Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-922: ColumnIndex Layout to Support Page Skipping #63

Closed
wants to merge 1 commit into from

Conversation

poojanilangekar
Copy link

Added PageLocation, OffsetIndex and ColumnIndex structures to the
parquet.thrift file in order to support secondary indexes in parquet files.

Added PageLocation, OffsetIndex and ColumnIndex structures to the
parquet.thrift file in order to support secondary indexes in parquet files.
@lekv
Copy link
Contributor

lekv commented Sep 1, 2017

Should we also add a description of how these work together to parquet.thrift as a comment? It could be a smaller version of the design document.

4: optional list<i64> null_count

/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/
5: optional list<i64> distinct_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed to remove this one in the last sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we need this because this structure is replacing Statistics, which can store a distinct count. I don't know that anyone actually writes distinct count (Parquet MR doesn't) but we should make sure before we remove it entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impala also doesn't populate the field. We could keep it out in this change and check other readers (which ones)? Once we drop the page statistics, we can either add it here as an optional, or not.

@lekv
Copy link
Contributor

lekv commented Sep 13, 2017

@julienledem @mkornacker @rdblue - Can you have a look at this PR, please? Thanks :)

@rdblue
Copy link
Contributor

rdblue commented Oct 6, 2017

@poojanilangekar, I'd like to get this in so we can make a 2.3.2 release. Are you okay with us making modifications to it? For example, we need to get the spec document added as markdown and finalize the fields, like distinct_count.

/** Offset of the page in the file **/
1: required i64 offset

/** Size of the page, including header. The same as PageHeader.compressed_page_size **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the same as PageHeader.compressed_page_size. In Parquet MR, we recover the header size by writing the header and getting the current position: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L375. That header includes the compressed_page_size referenced here, so I don't think it can contain the header as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I remember that this was also a bug in the prototypical implementation. The comment should read

/** Size of the page, including header. Sum of compressed_page_size and header length **/

*/
struct ColumnIndex {
/** A list of bools to determine the validity of the corresponding min and max values **/
1: required list<bool> valid_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this to be more clear. From my notes, this is true if there was a non-null value so the min and max in the next two lists are also non-null.

How about naming this null_pages? This should also state that the corresponding min_values and max_values entries for null pages are byte[0].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for null_pages.

/** A list containing the upper bounds for the values of each page **/
3: optional list<binary> max_values

/** A list containing DataPageHeaderV2.statistics.null_count for each page **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't reference DataPageHeaderV2 or Statistics. It should just have a clear description, "Number of null values in the page".

4: optional list<i64> null_count

/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/
5: optional list<i64> distinct_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we need this because this structure is replacing Statistics, which can store a distinct count. I don't know that anyone actually writes distinct count (Parquet MR doesn't) but we should make sure before we remove it entirely.

/** A list containing DataPageHeaderV2.statistics.null_count for each page **/
4: optional list<i64> null_count

/** A list containing DataPageHeaderV2.statistics.distinct_count for each page **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If kept, this shouldn't reference DataPageHeaderV2 or Statistics

/** A list of bools to determine the validity of the corresponding min and max values **/
1: required list<bool> valid_values

/** A list containing the lower bounds for the values of each page **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should state that the values are lower bounds, not necessarily the min value in a page. That allows us to truncate large values to save space. Same thing for the max_values field.

/** Size of the page, including header. The same as PageHeader.compressed_page_size **/
2: required i32 compressed_page_size

/** Index within the RowGroup of the first row of the page **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be more clear about the requirements for using this index structure: Pages must be split on record boundaries to ensure that no records are split across pages. The repetition level of the first value in each page must be 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Currently we don't require this to be the case for nested columns though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lekv @rdblue have we achieve this for nested column? i mean split on record boundaries to ensure that no records are split across pages, thanks a lot!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem
Copy link
Member

this is being finalized in #72

@asfgit asfgit closed this in f1de77d Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants