Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2352: Allow truncation of row group min_values/max_value statistics #216

Merged
merged 1 commit into from Oct 18, 2023

Conversation

raunaqmorarka
Copy link
Contributor

@raunaqmorarka raunaqmorarka commented Sep 21, 2023

Jira

Description

This updates the spec to allow truncation of row group min_values/max_value statistics so that readers can take advantage of row group pruning for predicates on columns containing long strings.
https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a feature to parquet-mr which allows users to deviate from the current spec and configure truncation of row group statistics.
This change also adds is_max_value_exact/is_min_value_exact to allow writers to specify
when the max_value/min_value are the actual max and min values found on the column chunk.

Since the possibility of truncation exists and is not possible to explicitly detect, attempts to pushdown min/max aggregation to parquet have avoided implementing it for string columns (e.g. https://issues.apache.org/jira/browse/SPARK-36645)
Given the above situation, the spec should be updated to allow truncation of min/max row group stats. This would align the spec with current reality that string column min/max row group stats could be truncated.

@@ -216,7 +216,12 @@ struct Statistics {
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
* Min and max values for the column, determined by its ColumnOrder.
* lower and upper bound values for the column, determined by its ColumnOrder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, you want to allow truncation by adding a flag to indicate whether the min_value/max_value is truncated or not, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the feature in https://issues.apache.org/jira/browse/PARQUET-1685, I want to assume that all existing stats are truncated. Going forward we should have a flag to explicitly indicate whether or not truncation took place and applications should perform aggregation pushdown only if that flag is found to indicate no truncation. But I think adding that flag can be tackled separately as a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are relevant and can be discussed together.

What do you think? @gszadovszky @shangxinli

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not directly related to this question. If the column is a utf-8 column ( with LogicalType UTF8 ), can the binary here be a non-utf8 bytes?

e.g. s[...100] is a valid utf-8, however, s[...50] is not.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU very good point, we should not allow truncation to produce a value that is invalid for given logical type
i think this is what @raunaqmorarka meant with Such more compact values must still be valid values within the column's logical type. in the code comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this PR to now include is_max_value_exact/is_min_value_exact for specifying when the max_value/min_value are the actual max and min values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @gszadovszky can you please take a look at current changes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @gszadovszky is anything more needed to merge this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just merged it. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@findepi
Copy link

findepi commented Sep 21, 2023

Thanks @raunaqmorarka, this lgtm

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Rest LGTM

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved
…alue statistics

This updates the spec to allow truncation of row group min_values/max_value statistics
so that readers can take advantage of row group pruning for predicates on columns
containing long strings.
https://issues.apache.org/jira/browse/PARQUET-1685 already introduced a feature to parquet-mr
which allows users to deviate from the current spec and configure truncation of row group statistics.
This change also adds is_max_value_exact/is_min_value_exact to allow writers to specify
when the max_value/min_value are the actual max and min values found on the column chunk.
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wgtmac
Copy link
Member

wgtmac commented Oct 14, 2023

cc @pitrou @emkornfield @shangxinli

@wgtmac wgtmac merged commit 31f92c7 into apache:master Oct 18, 2023
3 checks passed
@raunaqmorarka raunaqmorarka deleted the PARQUET-2352 branch October 18, 2023 09:41
@mapleFU
Copy link
Member

mapleFU commented Nov 23, 2023

So seems that for PageHeader (though if PageIndex enabled, we might not write page statistics) might also have to write these two statistics?

Also, we doesn't have page-index level statistics here

@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2023

Also, seems that this property only works for ByteArray, right (maybe FLBA also included)

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L965-L979

The syntax like this is not included in exact, right( at least in current arrow-rs impl)? @raunaqmorarka @tustvold

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Apr 26, 2024
…-format 2.10 (#15412)

[PARQUET-2352](apache/parquet-format#216) added fields to the `Statistics` struct to indicate whether the min and max values were exact or had been truncated. This was somewhat ambiguous in the past. One reason to want to know this is to allow avoiding the decoding of pages (or column chunks) that contain a single value (if the min and max are the same value, and are known to be exact values, and there are no nulls, then the only valid value for the page will be that value). This PR adds these new fields, which will always be `true` in cuDF since cuDF does not support truncating min and max values in the statistics (but does support truncation in the page indexes).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #15412
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants