Core: Enable column statistics filtering after planning #8803

pvary · 2023-10-11T12:33:27Z

Based on our discussion on the dev list, I have created the PR which makes possible to narrow down the retained column statistics in the ScanTask returned from planning.

For reference the discussion: https://lists.apache.org/thread/pcfpztld5gfpdvm1dy4l84xfl6odxhw8

The PR makes it possible to set the includeColumnStats for a Scan. The resulting ScanTasks will contain column statistics for the specific columnIds only, omitting statistics which might be present in the metadata files, but not specifically requested by the user.

The PR consists of 3 main parts:

Interface changes:
- Scan.includeColumnStats to set the required columnIds
- ContentFile.copyWithSpecificStats to provide an interface for the stat removal when copying the file objects
Core changes:
- Implementation of the BaseFile constructor which takes care of the statistics filtering, and making sure that the other implementations are using this method as well.
- Propagating the columnStatsToInclude filed through the different scan implementations, and putting it into the TableScanContext.
- Adding a new property to the ManifestGroup builder to store the columnStatsToKeep. This class is responsible for the final copy of the DataFiles where we remove the statistics which are not needed.
- Added tests to check that the statistics removal is working as expected.
Flink changes:
- Adding a new FlinkReadOption to set which column stats we should keep: column-stats-to-keep
- Minimal Flink ScanContext and Planner changes to propagate the values
- Updated the documentation for the Flink Source
- Added tests to check that the statistics removal is working as expected.

.palantir/revapi.yml

nastra · 2023-10-11T13:33:15Z

...src/test/java/org/apache/iceberg/flink/source/enumerator/TestContinuousSplitPlannerImpl.java

+        new ContinuousSplitPlannerImpl(tableResource.tableLoader().clone(), scanContext, null);
+
+    ContinuousEnumerationResult initialResult = splitPlanner.planSplits(null);
+    Assert.assertEquals(1, initialResult.splits().size());


it would be better to use AssertJ-style assertions for newly added code as that makes migrating away from JUnit4 easier. See also https://github.com/apache/iceberg/blob/a3aff95f9e60962240b94242e24a778760bdd1d9/CONTRIBUTING.md#assertj.

That particular check would then be assertThat(initialResult.splits()).hasSize(1)

This is a few new test in a test class where different assertion methods are used. I think we should migrate them in one PR, and until that stick to the style which is used in the actual test.

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java

api/src/main/java/org/apache/iceberg/ContentFile.java

stevenzwu · 2023-10-17T17:16:30Z

@pvary I think we probably want to push the copyStatsForColumns down to ManifestReader. https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestReader.java#L299

pvary · 2023-10-18T09:12:45Z

@pvary I think we probably want to push the copyStatsForColumns down to ManifestReader. https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestReader.java#L299

That is for reading the data from the manifest file.
If we want at statistics for at least one column, then the manifest file reading schema should contain the stat fields, like:

  private static final Set<String> STATS_COLUMNS =
      ImmutableSet.of(
          "value_counts",
          "null_value_counts",
          "nan_value_counts",
          "lower_bounds",
          "upper_bounds",
          "record_count");

So we can not do filtering here. We need to read the stat fields from the manifest file, and then filter later for columns where we do not need it.

stevenzwu · 2023-10-18T15:49:52Z

@pvary I think we probably want to push the copyStatsForColumns down to ManifestReader. https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestReader.java#L299

That is for reading the data from the manifest file. If we want at statistics for at least one column, then the manifest file reading schema should contain the stat fields, like:
  private static final Set<String> STATS_COLUMNS =
      ImmutableSet.of(
          "value_counts",
          "null_value_counts",
          "nan_value_counts",
          "lower_bounds",
          "upper_bounds",
          "record_count");
So we can not do filtering here. We need to read the stat fields from the manifest file, and then filter later for columns where we do not need it.

If we look at this line
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestReader.java#L299

it calls this method from ContentFile

  default F copy(boolean withStats) {
    return withStats ? copy() : copyWithoutStats();
  }

if we push down the selection to the ManifestReader, it can call the new copyWithSpecificStats method that you added in this PR.

I understand the current code is for metadata column selection/projection, not the columns selected to include stats

api/src/main/java/org/apache/iceberg/ContentFile.java

api/src/main/java/org/apache/iceberg/Scan.java

core/src/main/java/org/apache/iceberg/BaseScan.java

api/src/main/java/org/apache/iceberg/ContentFile.java

core/src/main/java/org/apache/iceberg/BaseFile.java

core/src/main/java/org/apache/iceberg/TableScanContext.java

pvary · 2023-10-19T07:54:56Z

I understand the current code is for metadata column selection/projection, not the columns selected to include stats

My understanding is that the planning might need stats which are not required by the user.
Based on this for changing the stat retrieval here:

We need some more complicated decision tree for which column stats to keep (union of the user requested stats, and the planning required stats)
We (might) need to do the selective copy of the stats later again to remove the planning required stats which are not requested by the user. This might have a detrimental performance impact.

This change improves on memory consumption anyway. If we think that we need even more improvement and we accept the extra complexity we can add this feature later.

api/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

api/src/test/java/org/apache/iceberg/TestHelpers.java

core/src/main/java/org/apache/iceberg/BaseFile.java

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkReadConf.java

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkReadOptions.java

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/FlinkReadConf.java

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

core/src/main/java/org/apache/iceberg/BaseFile.java

aokolnychyi · 2023-11-08T02:00:06Z

api/src/main/java/org/apache/iceberg/Scan.java

+   *
+   * <p>Column stats include: value count, null value count, lower bounds, and upper bounds.
+   *
+   * @param requestedColumns column names for which to keep the stats. If <code>null</code> then all


I am not sure how I feel about supporting null here. Let me think.
Also, won't the current implementation throw an exception if we pass null?

I think we should not support null here as a valid value independently of what we will do in ContentFile. I would drop the doc.

I would guess that the situation is the similar with the other @Nullable TableScanContext attributes, like:

snapshotId

selectedColumns

projectedSchema

fromSnapshotId

toSnapshotId

branch

We have an undefined default behaviour which could be archived with not setting the values, or setting them to null. The only difference here is that we define this default behaviour. For consistency's sake we can remove the comment, but the behaviour will remain the same.

Do I miss something?

Thanks for the detailed review!

api/src/main/java/org/apache/iceberg/ContentFile.java

aokolnychyi

I did a detailed round, would love to hear what others think.

api/src/main/java/org/apache/iceberg/ContentFile.java

aokolnychyi · 2023-11-10T02:37:10Z

api/src/main/java/org/apache/iceberg/Scan.java

+   *
+   * <p>Column stats include: value count, null value count, lower bounds, and upper bounds.
+   *
+   * @param requestedColumns column names for which to keep the stats. If <code>null</code> then all


I think we should not support null here as a valid value independently of what we will do in ContentFile. I would drop the doc.

core/src/main/java/org/apache/iceberg/BaseFile.java

core/src/main/java/org/apache/iceberg/BaseScan.java

aokolnychyi

I left some final comments, a lot of them are optional personal suggestions. This looks good to me overall.

aokolnychyi · 2023-11-14T02:55:04Z

.palantir/revapi.yml

@@ -866,6 +866,11 @@ acceptedBreaks:
      old: "method void org.apache.iceberg.encryption.Ciphers::<init>()"
      new: "method void org.apache.iceberg.encryption.Ciphers::<init>()"
      justification: "Static utility class - should not have public constructor"
+  "1.4.0":
+    org.apache.iceberg:iceberg-core:
+    - code: "java.field.serialVersionUIDChanged"


While I think it should be fine, here is an idea. Java comes with serialver utility that allows us to generate the version UID prior to the change in this PR. We can use that value instead of 1L to be fully compatible. We don't modify the serialization of this class, we just missed to assign serialVersionUID. If we can recover the default value, we shouldn't worry about compatibility.

Here is the value I got locally:

cd core/build/classes/java/main serialver org.apache.iceberg.util.SerializableMap org.apache.iceberg.util.SerializableMap: private static final long serialVersionUID = -3377238354349859240L;

Could you double check, @pvary? If not, we can keep it as is.

Checked, but even when setting the serialVersionUID to -3377238354349859240L we have a revapi failure.
Also double checked, but serialver generated the same id for the old and the new code on my mac.

Do resorted to the revapi change.

Yeah, I am not sure what revapi actually does. I doubt they compare actual values. I think we should be fine.

core/src/main/java/org/apache/iceberg/BaseDistributedDataScan.java

core/src/main/java/org/apache/iceberg/BaseFile.java

core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java

core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java

core/src/main/java/org/apache/iceberg/DataTableScan.java

aokolnychyi · 2023-11-14T03:34:52Z

core/src/main/java/org/apache/iceberg/GenericDataFile.java

  }

  @Override
  public DataFile copy() {
-    return new GenericDataFile(this, true /* full copy */);
+    return new GenericDataFile(this, true /* full copy */, null);


You may consider overloading the constructor so that you don't have to pass an extra null here or adding the comment for the second argument (we have a comment for true but not null).

I think this usage is straightforward, and adding a new constructor would not help too much.
So I did not apply this change

core/src/main/java/org/apache/iceberg/GenericDeleteFile.java

core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java

aokolnychyi · 2023-11-14T03:38:50Z

core/src/main/java/org/apache/iceberg/ManifestGroup.java

@@ -154,6 +156,12 @@ ManifestGroup caseSensitive(boolean newCaseSensitive) {
    return this;
  }

+  ManifestGroup columnsToKeepStats(Set<Integer> newColumnsToKeepStats) {
+    this.columnsToKeepStats =
+        newColumnsToKeepStats == null ? null : Sets.newHashSet(newColumnsToKeepStats);


This copy seems redundant but up to you.

Kept as it is more consistent with the other implementations

pvary · 2023-11-14T10:50:49Z

@rdblue: I have fixed the changes requested by you. If you have any further comments, please leave a review.

@aokolnychyi did another throughout review and applied most of his suggested changes.

So with 2 +1's, I would like to merge this change in the next few days.

Thanks,
Peter

aokolnychyi · 2023-11-14T21:08:30Z

I went through Ryan's comments one more time. They seem to be addressed. I also think the current version is simpler. Let's merge it as is and follow up if needed to unblock subsequent changes in Flink.

@rdblue, please let us know if you spot anything else.

pvary · 2023-11-15T09:48:45Z

Thanks @nastra, @rdblue, @stevenzwu, @aokolnychyi for the diligent reviews!

Co-authored-by: Peter Vary <peter_vary4@apple.com>

pvary requested review from stevenzwu, RussellSpitzer and rdblue October 11, 2023 12:33

github-actions bot added API spark core flink docs labels Oct 11, 2023

pvary requested a review from aokolnychyi October 11, 2023 12:36

pvary force-pushed the colstat2 branch 2 times, most recently from 946c598 to c3b5139 Compare October 11, 2023 13:28

nastra reviewed Oct 11, 2023

View reviewed changes

.palantir/revapi.yml Show resolved Hide resolved

nastra reviewed Oct 11, 2023

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java Outdated Show resolved Hide resolved

Colstat

5bb425d

pvary force-pushed the colstat2 branch from c3b5139 to 5bb425d Compare October 11, 2023 14:05

pvary mentioned this pull request Oct 11, 2023

Flink: Emit watermarks from the IcebergSource #8553

Merged

stevenzwu reviewed Oct 17, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/ContentFile.java Outdated Show resolved Hide resolved

stevenzwu reviewed Oct 18, 2023

View reviewed changes

pvary force-pushed the colstat2 branch 3 times, most recently from 8c183a1 to 95ec264 Compare October 19, 2023 16:35

Stevent's comments

7a3effa

pvary force-pushed the colstat2 branch from 95ec264 to 7a3effa Compare October 19, 2023 19:08

stevenzwu reviewed Oct 21, 2023

View reviewed changes

Steven's comments 2nd round

ae5975c

Ryan's and Anton's comments

9872f1f

pvary force-pushed the colstat2 branch from 75a42f9 to 9872f1f Compare November 6, 2023 17:44

aokolnychyi reviewed Nov 8, 2023

View reviewed changes

pvary mentioned this pull request Nov 8, 2023

Flink: Add support for Flink 1.18 #8930

Closed

aokolnychyi reviewed Nov 10, 2023

View reviewed changes

Anton's comments

f6279d0

pvary force-pushed the colstat2 branch from 54c915f to ba8a0db Compare November 10, 2023 17:06

Serialization change accepted

32d2645

pvary force-pushed the colstat2 branch from ba8a0db to 32d2645 Compare November 13, 2023 16:09

aokolnychyi approved these changes Nov 14, 2023

View reviewed changes

Anton's comments

1922ae0

pvary force-pushed the colstat2 branch from 77d35d4 to 1922ae0 Compare November 14, 2023 09:15

aokolnychyi merged commit 6ec3de3 into apache:main Nov 14, 2023
46 checks passed

pvary pushed a commit to pvary/iceberg that referenced this pull request Nov 23, 2023

Backport apache#8803

9e58fbc

pvary pushed a commit to pvary/iceberg that referenced this pull request Nov 23, 2023

Backport apache#8803

0f365cb

pvary pushed a commit to pvary/iceberg that referenced this pull request Nov 24, 2023

Flink: Backport apache#8803 to v1.15 and v1.16

e7f9508

pvary pushed a commit to pvary/iceberg that referenced this pull request Nov 24, 2023

Flink: Backport apache#8803 to v1.16 and v1.15

21b3eea

pvary pushed a commit to pvary/iceberg that referenced this pull request Nov 24, 2023

Flink: Backport apache#8803 to v1.16 and v1.15

a634a07

pvary mentioned this pull request Nov 24, 2023

Flink: Backport #8553 to v1.15, v1.16 #9139

Closed

pvary pushed a commit to pvary/iceberg that referenced this pull request Nov 24, 2023

Flink: Backport apache#8803 to v1.16 and v1.15

e29f59a

stevenzwu pushed a commit that referenced this pull request Nov 24, 2023

Flink: Backport #8803 to v1.16 and v1.15 (#9144)

1a073dd

Co-authored-by: Peter Vary <peter_vary4@apple.com>

pvary deleted the colstat2 branch November 28, 2023 08:49

mas-chen pushed a commit to mas-chen/iceberg that referenced this pull request Jan 9, 2024

Core: Enable column statistics filtering after planning (apache#8803)

aef502c

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024

Core: Enable column statistics filtering after planning (apache#8803)

e461fa3

rodmeneses added a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024

Flink: Backport apache#8803 to v1.16 and v1.15 (apache#9144)

4c0d4fc

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Core: Enable column statistics filtering after planning (apache#8803)

642d47d

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Flink: Backport apache#8803 to v1.16 and v1.15 (apache#9144)

4c1ffea

Co-authored-by: Peter Vary <peter_vary4@apple.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Enable column statistics filtering after planning #8803

Core: Enable column statistics filtering after planning #8803

pvary commented Oct 11, 2023

nastra Oct 11, 2023

pvary Oct 11, 2023

stevenzwu commented Oct 17, 2023

pvary commented Oct 18, 2023

stevenzwu commented Oct 18, 2023

pvary commented Oct 19, 2023

aokolnychyi Nov 8, 2023

aokolnychyi Nov 10, 2023

pvary Nov 10, 2023

aokolnychyi left a comment

aokolnychyi Nov 10, 2023

aokolnychyi left a comment

aokolnychyi Nov 14, 2023 •

edited

pvary Nov 14, 2023

aokolnychyi Nov 14, 2023

aokolnychyi Nov 14, 2023

pvary Nov 14, 2023

aokolnychyi Nov 14, 2023

pvary Nov 14, 2023

pvary commented Nov 14, 2023

aokolnychyi commented Nov 14, 2023

pvary commented Nov 15, 2023

Core: Enable column statistics filtering after planning #8803

Core: Enable column statistics filtering after planning #8803

Conversation

pvary commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu commented Oct 17, 2023

pvary commented Oct 18, 2023

stevenzwu commented Oct 18, 2023

pvary commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

aokolnychyi Nov 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Nov 14, 2023

aokolnychyi commented Nov 14, 2023

pvary commented Nov 15, 2023

aokolnychyi Nov 14, 2023 •

edited