PARQUET-2261: Implement SizeStatistics #1177

wgtmac · 2023-10-19T07:29:34Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2261
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

wgtmac · 2023-10-19T15:10:33Z

I have drafted the POC to read/write SizeStatistics. The feature implementation should be complete and associated tests will be added progressively. Please take a look when you have time. Thanks! @emkornfield

cc @mapleFU

etseidl · 2023-10-19T21:31:41Z

Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the column-index command could write out the unencoded sizes and histograms. The former is pretty straightforward, but I'm not entirely sure what the right approach is for the histograms. Also, rewrite currently ignores the new statistics. For a straight copy, all that's needed is to add chunk.getSizeStatistics() to the arg list of ColumnChunkMetadata.get() here.

wgtmac · 2023-10-20T01:50:10Z

Thanks @wgtmac, this looks great! I'm not sure if this is in scope for this PR, but it would be nice if the CLI was aware of the changes. Specifically, it would be great if the column-index command could write out the unencoded sizes and histograms. The former is pretty straightforward, but I'm not entirely sure what the right approach is for the histograms. Also, rewrite currently ignores the new statistics. For a straight copy, all that's needed is to add chunk.getSizeStatistics() to the arg list of ColumnChunkMetadata.get() here.

Thanks for the suggestion! Yes, it definitely should be done. @etseidl

emkornfield · 2023-10-20T16:49:21Z

@wgtmac took a scan through and this generally seems like what I expected. Thank you for doing it. Agree unit tests are needed.

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java

wgtmac · 2023-11-23T07:02:47Z

I have just rebased on the latest master branch and fixed all CI falures. As this PR gets too large, I will add print cli command and rewriter support for SizeStatistics in follow-up PRs. This is now ready for review. @emkornfield @etseidl @gszadovszky @shangxinli

wgtmac · 2023-11-23T07:06:34Z

cc @ConeyLiu as I have modified mergeColumnStatistics method which you've just refactored.

ConeyLiu · 2023-11-23T15:10:09Z

Thank @wgtmac for your notification.

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndex.java

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndex.java

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndexBuilder.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

pom.xml

emkornfield · 2023-12-06T06:53:29Z

Gentle ping @emkornfield

Took another pass through, I'm less familiar with Parquet MR but overall looks ok to me (with the exception of confirming if the one place I found we should be changing length to 0)

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java

shangxinli · 2024-02-18T22:36:47Z

LGTM

@ConeyLiu @etseidl @emkornfield Do you still have pending comments?

etseidl · 2024-02-19T00:50:43Z

Looks good to me too. I'd still like to see the CLI changed at some point to print the new statistics, but if no one else has cycles to work on that, I could try cleaning up what I have locally.

wgtmac · 2024-02-19T01:57:06Z

@etseidl I have filed https://issues.apache.org/jira/browse/PARQUET-2433 and https://issues.apache.org/jira/browse/PARQUET-2434 as follow-up work items. Will work on them once this PR gets merged.

ConeyLiu

+1 thanks for the great work.

emkornfield · 2024-02-19T17:46:18Z

I think all of my suggestions have been addressed. Thanks @wgtmac !

wgtmac · 2024-02-23T05:47:58Z

@gszadovszky It would be good if you can take a look if possible.

gszadovszky

I have some comments but LGTM overall.

This change is about writing these new statistics. Are there any benefits in actually using them at reading? Do we plan to implement?

Side note: This is yet another statistics that we gather during writing data. We already have min/max statistics, null counts, bloom filter, and now some additional ones. I think, we should implement a centralized builder that we call once for each value/dl/rl and can generate the statistics we need. We need to implement the gathering part as optimal as it can be. One check less can significantly decrease write performance.

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java

gszadovszky · 2024-02-23T09:44:53Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java

+    public void add(int repetitionLevel, int definitionLevel, Binary value) {
+      add(repetitionLevel, definitionLevel);
+      if (type.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.BINARY && value != null) {
+        unencodedByteArrayDataBytes = Math.addExact(unencodedByteArrayDataBytes, value.length());


I don't think, we shall fear of an overflow while adding an int to a long.

I tend to keep this check just in case.

In most cases, it is good to have checks like this one. But it can significantly hit performance when used in places called regularly. This method is invoked for every values. We shall be as effective as possible.

Make sense. I have removed the overflow check.

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

parquet-column/src/test/java/org/apache/parquet/column/statistics/TestSizeStatistics.java

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

pom.xml

wgtmac · 2024-02-24T15:17:58Z

Thanks for the feedback! I've addressed all the comments and added a new internal ColumnValueCollector class for common stats collections. @gszadovszky

gszadovszky · 2024-02-26T10:21:43Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java

+  void write(int value, int repetitionLevel, int definitionLevel) {
+    statistics.updateStats(value);
+    sizeStatisticsBuilder.add(repetitionLevel, definitionLevel);
+    if (bloomFilter != null) {


What do you think about using a no-op BloomFilter implementation instead of a null-check? I am not sure if would perform better, though.

Fixed. Could you please review it again?

gszadovszky

Thank you, @wgtmac!

wgtmac marked this pull request as draft October 19, 2023 07:49

etseidl reviewed Oct 20, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

wgtmac mentioned this pull request Oct 28, 2023

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering apache/parquet-format#197

Merged

wgtmac force-pushed the size_stats branch 2 times, most recently from cc0d75d to 0acf99f Compare November 4, 2023 17:25

wgtmac force-pushed the size_stats branch 2 times, most recently from 26ced88 to 4ff9d3d Compare November 22, 2023 04:20

wgtmac marked this pull request as ready for review November 22, 2023 04:21

wgtmac force-pushed the size_stats branch 7 times, most recently from 339d397 to ed3a89e Compare November 23, 2023 06:15

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Nov 23, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndex.java Outdated Show resolved Hide resolved

emkornfield reviewed Dec 6, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndex.java Show resolved Hide resolved

emkornfield reviewed Dec 6, 2023

View reviewed changes

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndexBuilder.java Show resolved Hide resolved

emkornfield reviewed Dec 6, 2023

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java Outdated Show resolved Hide resolved

emkornfield reviewed Dec 6, 2023

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

ConeyLiu reviewed Dec 7, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java Outdated Show resolved Hide resolved

ConeyLiu reviewed Dec 7, 2023

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/statistics/SizeStatistics.java Outdated Show resolved Hide resolved

wgtmac force-pushed the size_stats branch 2 times, most recently from eb7a1a5 to aa69b35 Compare December 10, 2023 14:01

wgtmac force-pushed the size_stats branch from aa69b35 to 4de885d Compare February 7, 2024 09:52

wgtmac force-pushed the size_stats branch from 4de885d to 2d36a77 Compare February 18, 2024 07:46

ConeyLiu approved these changes Feb 19, 2024

View reviewed changes

gszadovszky requested changes Feb 23, 2024

View reviewed changes

PARQUET-2261: Implement SizeStatistics

6039c93

wgtmac force-pushed the size_stats branch from 2d36a77 to 6039c93 Compare February 24, 2024 08:21

wgtmac added 3 commits February 24, 2024 22:35

address feedback

d59f848

remove japicmp exclusion

969040c

add ColumnValueCollector for collecting stats

4eb688a

gszadovszky reviewed Feb 26, 2024

View reviewed changes

use no-op bloomfilter

794d68e

gszadovszky approved these changes Feb 27, 2024

View reviewed changes

wgtmac merged commit d31a891 into apache:master Feb 27, 2024
9 checks passed

This was referenced Mar 16, 2024

[C++][Parquet] Implement SizeStatistics apache/arrow#40592

Open

GH-40592: [C++][Parquet] Implement SizeStatistics apache/arrow#40594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2261: Implement SizeStatistics #1177

PARQUET-2261: Implement SizeStatistics #1177

wgtmac commented Oct 19, 2023 •

edited by Fokko

wgtmac commented Oct 19, 2023

etseidl commented Oct 19, 2023

wgtmac commented Oct 20, 2023

emkornfield commented Oct 20, 2023

wgtmac commented Nov 23, 2023

wgtmac commented Nov 23, 2023

ConeyLiu commented Nov 23, 2023

emkornfield commented Dec 6, 2023

shangxinli commented Feb 18, 2024

etseidl commented Feb 19, 2024

wgtmac commented Feb 19, 2024

ConeyLiu left a comment

emkornfield commented Feb 19, 2024

wgtmac commented Feb 23, 2024

gszadovszky left a comment

gszadovszky Feb 23, 2024

wgtmac Feb 24, 2024

gszadovszky Feb 26, 2024 •

edited

wgtmac Feb 26, 2024

wgtmac commented Feb 24, 2024

gszadovszky Feb 26, 2024

wgtmac Feb 26, 2024

gszadovszky left a comment

PARQUET-2261: Implement SizeStatistics #1177

PARQUET-2261: Implement SizeStatistics #1177

Conversation

wgtmac commented Oct 19, 2023 • edited by Fokko

Jira

Tests

Commits

Documentation

wgtmac commented Oct 19, 2023

etseidl commented Oct 19, 2023

wgtmac commented Oct 20, 2023

emkornfield commented Oct 20, 2023

wgtmac commented Nov 23, 2023

wgtmac commented Nov 23, 2023

ConeyLiu commented Nov 23, 2023

emkornfield commented Dec 6, 2023

shangxinli commented Feb 18, 2024

etseidl commented Feb 19, 2024

wgtmac commented Feb 19, 2024

ConeyLiu left a comment

Choose a reason for hiding this comment

emkornfield commented Feb 19, 2024

wgtmac commented Feb 23, 2024

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky Feb 23, 2024

Choose a reason for hiding this comment

wgtmac Feb 24, 2024

Choose a reason for hiding this comment

gszadovszky Feb 26, 2024 • edited

Choose a reason for hiding this comment

wgtmac Feb 26, 2024

Choose a reason for hiding this comment

wgtmac commented Feb 24, 2024

gszadovszky Feb 26, 2024

Choose a reason for hiding this comment

wgtmac Feb 26, 2024

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

wgtmac commented Oct 19, 2023 •

edited by Fokko

gszadovszky Feb 26, 2024 •

edited