New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Core, Spark 3.4: Add filter to Rewrite position deletes #7582

Merged

szehon-ho merged 4 commits into apache:master from szehon-ho:filter_rewrite_pos_deletes

Aug 7, 2023

Collaborator

szehon-ho commented May 10, 2023

This adds support for filter in RewritePositionDeleteFiles.

Logic: RewritePositionDeletesFiles is based on PositionDeletesTable (a metadata table representing position deletes). Like all metadata table, it does partition predicate pushdown by transforming the partition spec into something that can evaluate the partition predicate on the metadata table (ie my_table.position_deletes.partition.part_col instead of my_table.part_col).

But here the RewritePositionDeleteFiles action actually gets a filter on the original table, not the PositionDeletesTable metadata table. So we short-circuit this partition-spec transformation in this case.

This is done by adding to the PositionDeletesTableScan a new method baseTableFilter() that takes filter based on the base table, not the position_deletes table. Some checks are added to ensure it is exclusively set from the filter based on the position_deletes table.

szehon-ho closed this

szehon-ho reopened this

github-actions bot added core spark labels

szehon-ho closed this

szehon-ho reopened this

szehon-ho mentioned this pull request

Spark 3.3: Add RemoveDanglingDeletes action #6581

Open

szehon-ho force-pushed the filter_rewrite_pos_deletes branch from d4a3eaf to c6b394e Compare

June 15, 2023 19:04

szehon-ho commented

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java Outdated

@@ @@ -130,18 +132,38 @@ private Schema calculateSchema() { @@
                 public static class PositionDeletesBatchScan
                     extends SnapshotScan<BatchScan, ScanTask, ScanTaskGroup<ScanTask>> implements BatchScan {
+                  private boolean filterSet = false;

Collaborator Author

szehon-ho Jun 15, 2023

This is messy, but the overall idea is, we either have a filter set on the metadata table, or the base table, and can only handle one of these.

Collaborator Author

szehon-ho commented Jun 15, 2023

Rebased

aokolnychyi reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java Outdated

                   @Override
                   protected CloseableIterable<ScanTask> doPlanFiles() {
                     String schemaString = SchemaParser.toJson(tableSchema());
                     // prepare transformed partition specs and caches
-                    Map<Integer, PartitionSpec> transformedSpecs = transformSpecs(tableSchema(), table().specs());
+                    Map<Integer, PartitionSpec> transformedSpecs = transformSpecsIfNecessary();

Contributor

aokolnychyi Jul 1, 2023

Hm, can't they actually work together? There seems to be quite a bit of logic that decides whether to use a filter on the base table or a filter on the metadata table.

Suppose we add baseTableFilter as in this PR. Can we do something like this later?

Expressions.and(filter(), Projections.inclusive(spec, isCaseSensitive()).project(baseTableRowFilter))

Whenever we compute evalCache?

Collaborator Author

szehon-ho Jul 6, 2023

OK , the latest change supports both filters now. I use another ManifestEvaluator, gotten via ManifestEvaluator.forPartitionFilter(), which internally does the Projection

aokolnychyi reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java Outdated

                     LoadingCache<Integer, ResidualEvaluator> residualCache =
                         partitionCacheOf(
                             transformedSpecs,
                             spec ->
                                 ResidualEvaluator.of(
                                     spec,
-                                    shouldIgnoreResiduals() ? Expressions.alwaysTrue() : filter(),
+                                    shouldIgnoreResiduals() ? Expressions.alwaysTrue() : effectiveFilter(),

Contributor

aokolnychyi Jul 1, 2023 •

edited

Hm, this seems a bit suspicious to use the base table filter as the residual. This will be propagated to task and I am not sure those filters will be even resolvable against the metadata table schema.

I need to take a closer look with fresh eyes.

Collaborator Author

szehon-ho Jul 6, 2023

yea, made this back to filter()

aokolnychyi reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java Outdated

+                   * @return this for method chaining
+                   */
+                  @Override
+                  public BatchScan filter(Expression expr) {

Contributor

aokolnychyi Jul 27, 2023

Is this actually needed? Won't the base implementation work given the current version of newRefinedScan?

aokolnychyi reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

-                        partitionCacheOf(
-                            transformedSpecs,
-                            spec -> ManifestEvaluator.forRowFilter(filter(), spec, isCaseSensitive()));
                     // iterate through delete manifests
                     List<ManifestFile> manifests = snapshot().deleteManifests(table().io());
                     CloseableIterable<ManifestFile> matchingManifests =

Contributor

aokolnychyi Jul 27, 2023 •

edited

Shall we do this filter only if either of the filter expression is non-trivial? Otherwise, what's the point of doing this work?

Contributor

aokolnychyi Jul 27, 2023

Well, this is for manifests, so shouldn't matter that much. Never mind.

aokolnychyi reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/PositionDeletesTable.java

@@ @@ -223,12 +289,16 @@ public void close() throws IOException { @@
                       @Override
                       public CloseableIterator<ScanTask> iterator() {
+                        Expression partitionFilter =
+                            Projections.inclusive(spec, isCaseSensitive()).project(baseTableFilter);

Contributor

aokolnychyi Jul 27, 2023

You could have cached this and used ManifestEvaluator.forPartitionFilter but it is probably not worth it.

aokolnychyi reviewed

View reviewed changes

...rk/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteFilesSparkAction.java Outdated

@@ @@ -133,16 +137,23 @@ public RewritePositionDeleteFiles.Result execute() { @@
                 }
                 private StructLikeMap<List<List<PositionDeletesScanTask>>> planFileGroups() {
-                  CloseableIterable<PositionDeletesScanTask> fileTasks = planFiles();
+                  Table deletesTable =

Contributor

aokolnychyi Jul 27, 2023

Why add it here and why rename fileTasks? Don't we have to modify planFiles instead?

private CloseableIterable<PositionDeletesScanTask> planFiles() {
  Table deletesTable =
      MetadataTableUtils.createMetadataTableInstance(table, MetadataTableType.POSITION_DELETES);

  PositionDeletesBatchScan scan = (PositionDeletesBatchScan) deletesTable.newBatchScan();

  return CloseableIterable.transform(
      scan.baseTableFilter(filter).ignoreResiduals().planFiles(),
      task -> (PositionDeletesScanTask) task);
}

aokolnychyi reviewed

View reviewed changes

...rk/src/main/java/org/apache/iceberg/spark/actions/RewritePositionDeleteFilesSparkAction.java Outdated

-                  CloseableIterable<PositionDeletesScanTask> fileTasks = planFiles();
+                  Table deletesTable =
+                      MetadataTableUtils.createMetadataTableInstance(table, MetadataTableType.POSITION_DELETES);
+                  PositionDeletesTable.PositionDeletesBatchScan deletesScan =

Contributor

aokolnychyi Jul 27, 2023

Can we do direct import to shorten the lines like in the sample snippet I mentioned above?

aokolnychyi reviewed

View reviewed changes

...ark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java

                           .rewritePositionDeletes(table)
                           .option(SizeBasedFileRewriter.REWRITE_ALL, "true")
                           .execute();

Contributor

aokolnychyi Jul 27, 2023

Needed?

aokolnychyi approved these changes

View reviewed changes

Contributor

aokolnychyi left a comment

This seems correct to me overall. I left a few suggestions.

szehon-ho added 3 commits

August 7, 2023 10:44


          Core, Spark 3.4: Add filter to Rewrite position deletes

28ba669


          Support both filters

599b38e


          Review comments

61ad219

szehon-ho force-pushed the filter_rewrite_pos_deletes branch from 0ec3dee to 61ad219 Compare

August 7, 2023 17:53


          Increment deprecation version number

3dee010

szehon-ho force-pushed the filter_rewrite_pos_deletes branch from 232b709 to 3dee010 Compare

August 7, 2023 18:06

szehon-ho merged commit 51782d3 into apache:master

41 checks passed

Collaborator Author

szehon-ho commented Aug 7, 2023

Thanks @aokolnychyi for review!

szehon-ho mentioned this pull request

Spark 3.3: Add filter to Rewrite position deletes #8280

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment