PARQUET-922: ColumnIndex Layout to Support Page Skipping #63

poojanilangekar · 2017-08-31T23:38:00Z

Added PageLocation, OffsetIndex and ColumnIndex structures to the
parquet.thrift file in order to support secondary indexes in parquet files.

Added PageLocation, OffsetIndex and ColumnIndex structures to the parquet.thrift file in order to support secondary indexes in parquet files.

lekv · 2017-09-01T02:20:31Z

Should we also add a description of how these work together to parquet.thrift as a comment? It could be a smaller version of the design document.

lekv · 2017-09-13T15:49:22Z

src/main/thrift/parquet.thrift

+  4: optional list<i64> null_count
+
+/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/
+  5: optional list<i64> distinct_count


I think we agreed to remove this one in the last sync?

Actually, I think we need this because this structure is replacing Statistics, which can store a distinct count. I don't know that anyone actually writes distinct count (Parquet MR doesn't) but we should make sure before we remove it entirely.

Impala also doesn't populate the field. We could keep it out in this change and check other readers (which ones)? Once we drop the page statistics, we can either add it here as an optional, or not.

lekv · 2017-09-13T16:40:57Z

@julienledem @mkornacker @rdblue - Can you have a look at this PR, please? Thanks :)

rdblue · 2017-10-06T23:59:57Z

@poojanilangekar, I'd like to get this in so we can make a 2.3.2 release. Are you okay with us making modifications to it? For example, we need to get the spec document added as markdown and finalize the fields, like distinct_count.

rdblue · 2017-10-09T23:58:43Z

src/main/thrift/parquet.thrift

+/** Offset of the page in the file **/
+  1: required i64 offset
+
+/** Size of the page, including header. The same as PageHeader.compressed_page_size **/


I don't think this is the same as PageHeader.compressed_page_size. In Parquet MR, we recover the header size by writing the header and getting the current position: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L375. That header includes the compressed_page_size referenced here, so I don't think it can contain the header as well.

Yes, I remember that this was also a bug in the prototypical implementation. The comment should read

/** Size of the page, including header. Sum of compressed_page_size and header length **/

rdblue · 2017-10-10T00:05:32Z

src/main/thrift/parquet.thrift

+ */
+struct ColumnIndex {
+/** A list of bools to determine the validity of the corresponding min and max values **/
+  1: required list<bool> valid_values


I think we should rename this to be more clear. From my notes, this is true if there was a non-null value so the min and max in the next two lists are also non-null.

How about naming this null_pages? This should also state that the corresponding min_values and max_values entries for null pages are byte[0].

+1 for null_pages.

rdblue · 2017-10-10T00:07:54Z

src/main/thrift/parquet.thrift

+/** A list containing the upper bounds for the values of each page **/
+  3: optional list<binary> max_values
+
+/** A list containing DataPageHeaderV2.statistics.null_count for each page **/


This shouldn't reference DataPageHeaderV2 or Statistics. It should just have a clear description, "Number of null values in the page".

rdblue · 2017-10-10T00:10:36Z

src/main/thrift/parquet.thrift

+  4: optional list<i64> null_count
+
+/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/
+  5: optional list<i64> distinct_count


Actually, I think we need this because this structure is replacing Statistics, which can store a distinct count. I don't know that anyone actually writes distinct count (Parquet MR doesn't) but we should make sure before we remove it entirely.

rdblue · 2017-10-10T00:11:01Z

src/main/thrift/parquet.thrift

+/** A list containing DataPageHeaderV2.statistics.null_count for each page **/
+  4: optional list<i64> null_count
+
+/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/


If kept, this shouldn't reference DataPageHeaderV2 or Statistics

rdblue · 2017-10-10T00:12:15Z

src/main/thrift/parquet.thrift

+/** A list of bools to determine the validity of the corresponding min and max values **/
+  1: required list<bool> valid_values
+
+/** A list containing the lower bounds for the values of each page **/


This should state that the values are lower bounds, not necessarily the min value in a page. That allows us to truncate large values to save space. Same thing for the max_values field.

rdblue · 2017-10-10T00:15:34Z

src/main/thrift/parquet.thrift

+/** Size of the page, including header. The same as PageHeader.compressed_page_size **/
+  2: required i32 compressed_page_size
+
+/** Index within the RowGroup of the first row of the page **/


This should be more clear about the requirements for using this index structure: Pages must be split on record boundaries to ensure that no records are split across pages. The repetition level of the first value in each page must be 0.

Good point. Currently we don't require this to be the case for nested columns though.

@lekv @rdblue have we achieve this for nested column? i mean split on record boundaries to ensure that no records are split across pages, thanks a lot!

https://github.com/apache/parquet-mr/blob/330242b3cc2a746f3bbb939304becfa313ef1e53/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L123 i found this, thank you all

julienledem · 2017-10-10T18:07:06Z

this is being finalized in #72

PARQUET-922: ColumnIndex Layout to Support Page Skipping

f983794

Added PageLocation, OffsetIndex and ColumnIndex structures to the parquet.thrift file in order to support secondary indexes in parquet files.

lekv reviewed Sep 13, 2017

View reviewed changes

rdblue requested changes Oct 10, 2017

View reviewed changes

julienledem mentioned this pull request Oct 10, 2017

PARQUET-922: Add column indexes to parquet.thrift #72

Closed

asfgit closed this in f1de77d Oct 16, 2017

asfimport mentioned this pull request Jun 23, 2024

Add index pages to the format to support efficient page skipping #324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-922: ColumnIndex Layout to Support Page Skipping #63

PARQUET-922: ColumnIndex Layout to Support Page Skipping #63

poojanilangekar commented Aug 31, 2017

lekv commented Sep 1, 2017

lekv Sep 13, 2017

rdblue Oct 10, 2017

lekv Oct 10, 2017

lekv commented Sep 13, 2017

rdblue commented Oct 6, 2017

rdblue Oct 9, 2017

lekv Oct 10, 2017

rdblue Oct 10, 2017

lekv Oct 10, 2017

rdblue Oct 10, 2017

rdblue Oct 10, 2017

rdblue Oct 10, 2017

rdblue Oct 10, 2017

rdblue Oct 10, 2017

lekv Oct 10, 2017

zombee0 Jan 15, 2024

zombee0 Jan 15, 2024

julienledem commented Oct 10, 2017

PARQUET-922: ColumnIndex Layout to Support Page Skipping #63

PARQUET-922: ColumnIndex Layout to Support Page Skipping #63

Conversation

poojanilangekar commented Aug 31, 2017

lekv commented Sep 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lekv commented Sep 13, 2017

rdblue commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Oct 10, 2017