-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-922: ColumnIndex Layout to Support Page Skipping #63
Conversation
Added PageLocation, OffsetIndex and ColumnIndex structures to the parquet.thrift file in order to support secondary indexes in parquet files.
Should we also add a description of how these work together to parquet.thrift as a comment? It could be a smaller version of the design document. |
4: optional list<i64> null_count | ||
|
||
/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/ | ||
5: optional list<i64> distinct_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed to remove this one in the last sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we need this because this structure is replacing Statistics
, which can store a distinct count. I don't know that anyone actually writes distinct count (Parquet MR doesn't) but we should make sure before we remove it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impala also doesn't populate the field. We could keep it out in this change and check other readers (which ones)? Once we drop the page statistics, we can either add it here as an optional, or not.
@julienledem @mkornacker @rdblue - Can you have a look at this PR, please? Thanks :) |
@poojanilangekar, I'd like to get this in so we can make a 2.3.2 release. Are you okay with us making modifications to it? For example, we need to get the spec document added as markdown and finalize the fields, like |
/** Offset of the page in the file **/ | ||
1: required i64 offset | ||
|
||
/** Size of the page, including header. The same as PageHeader.compressed_page_size **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the same as PageHeader.compressed_page_size. In Parquet MR, we recover the header size by writing the header and getting the current position: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L375. That header includes the compressed_page_size referenced here, so I don't think it can contain the header as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I remember that this was also a bug in the prototypical implementation. The comment should read
/** Size of the page, including header. Sum of compressed_page_size and header length **/
*/ | ||
struct ColumnIndex { | ||
/** A list of bools to determine the validity of the corresponding min and max values **/ | ||
1: required list<bool> valid_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename this to be more clear. From my notes, this is true if there was a non-null value so the min and max in the next two lists are also non-null.
How about naming this null_pages
? This should also state that the corresponding min_values
and max_values
entries for null pages are byte[0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for null_pages.
/** A list containing the upper bounds for the values of each page **/ | ||
3: optional list<binary> max_values | ||
|
||
/** A list containing DataPageHeaderV2.statistics.null_count for each page **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't reference DataPageHeaderV2
or Statistics
. It should just have a clear description, "Number of null values in the page".
4: optional list<i64> null_count | ||
|
||
/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/ | ||
5: optional list<i64> distinct_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we need this because this structure is replacing Statistics
, which can store a distinct count. I don't know that anyone actually writes distinct count (Parquet MR doesn't) but we should make sure before we remove it entirely.
/** A list containing DataPageHeaderV2.statistics.null_count for each page **/ | ||
4: optional list<i64> null_count | ||
|
||
/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If kept, this shouldn't reference DataPageHeaderV2
or Statistics
/** A list of bools to determine the validity of the corresponding min and max values **/ | ||
1: required list<bool> valid_values | ||
|
||
/** A list containing the lower bounds for the values of each page **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should state that the values are lower bounds, not necessarily the min value in a page. That allows us to truncate large values to save space. Same thing for the max_values
field.
/** Size of the page, including header. The same as PageHeader.compressed_page_size **/ | ||
2: required i32 compressed_page_size | ||
|
||
/** Index within the RowGroup of the first row of the page **/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be more clear about the requirements for using this index structure: Pages must be split on record boundaries to ensure that no records are split across pages. The repetition level of the first value in each page must be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Currently we don't require this to be the case for nested columns though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is being finalized in #72 |
Added PageLocation, OffsetIndex and ColumnIndex structures to the
parquet.thrift file in order to support secondary indexes in parquet files.