Spark: Enhance metadata deletes in 3.2 by aokolnychyi · Pull Request #3369 · apache/iceberg

aokolnychyi · 2021-10-25T16:00:27Z

This PR is an attempt to enable more metadata-only deletes in Spark.

aokolnychyi · 2021-10-25T16:01:55Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java


  private static final Map<Class<? extends Filter>, Operation> FILTERS = ImmutableMap
      .<Class<? extends Filter>, Operation>builder()
+      .put(AlwaysTrue.class, Operation.TRUE)


I had to add this because TRUNCATE tests I added were failing.

That's strange. This is exactly the kind of thing that makes me glad we decided to test and build against each Spark branch independently.

Yeah, I think TRUNCATE logic was added in 3.1. So it is probably broken since then.

aokolnychyi · 2021-10-25T16:04:29Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

@@ -212,33 +220,44 @@ private String getRowLevelOperationMode(String operation) {

  @Override
  public boolean canDeleteWhere(Filter[] filters) {


This implementation is just an idea of what can be done. It makes the check way more expensive than before but also allows us to cover more scenarios (e.g. deletes with transforms, deletes using metrics, deletes with multiple specs).

We may consider a flag or something to disable it but I am not sure at this point. Maybe, there are better ideas. Let me know.

I am wondering if it makes sense to do this sort of metadata delete inside the non-metadata delete path and here instead just do a check as to whether or not a metadata delete is even possible? Like instead of checking to to see whether any metadata delete can be done, make sure a metadata delete cannot be done.

See if the delete conditions could not possibly apply to all the specs currently in play, rather than checking to see if they can apply to all live files.

I am afraid it will be too late given how we plan to rewrite row-level commands and how Spark handles this check. However, we may want to do this less expensive and not cover some use cases.

aokolnychyi · 2021-10-25T16:06:09Z

@RussellSpitzer @szehon-ho @flyrain @rdblue @kbendick @karuppayya, any thoughts? I think our current metadata-only delete implementation is very limited.

rdblue · 2021-10-25T16:17:36Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

-      return !identitySourceIds.contains(field.fieldId());
-    });
+  // a metadata delete is possible iff matching files can be deleted entirely
+  private boolean canDeleteUsingMetadata(Expression deleteExpr) {


Looks like this pretty much just runs the delete logic to see if it will succeed.

I was thinking about something slightly different, which is to check whether the strict projection and the inclusive projection of the delete filter are equivalent. If so, then we know that the delete aligns with partitioning and will succeed. I'm okay with this if we don't think it's going to be too expensive during planning though.

I have the same concern about the performance. I think it will be reasonable in most cases just because we won't write new manifests and will most likely have a selective filter. However, I am open to any other alternatives.

I initially wanted to check just the conditions but I was not sure how to handle multiple specs. How would the inclusive and strict approach work here? Will we require that the projections are equivalent for each spec?

Yeah, we would need the expressions to be equal for each spec that has a manifest matching. So we could filter manifests using the manifest list, then get the specs and do the projection. That way you could take advantage of some partition filtering to eliminate specs.

I actually like what you have here. It shouldn't be a big problem. Let's see how it goes with this and we can always introduce a lighter-weight version later. Luckily, this should fail fast.

That should work. Since this implementation is more straightforward and covers slightly more use cases, let's try it out. We can switch to the alternative if the performance is bad.

rdblue · 2021-10-25T16:19:54Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

+        Evaluator evaluator = evaluators.computeIfAbsent(
+            spec.specId(),
+            specId -> new Evaluator(spec.partitionType(), Projections.strict(spec).project(deleteExpr)));
+        return evaluator.eval(file.partition()) || metricsEvaluator.eval(file);


This looks correct to me.

szehon-ho · 2021-10-25T17:24:37Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

-    });
+  // a metadata delete is possible iff matching files can be deleted entirely
+  private boolean canDeleteUsingMetadata(Expression deleteExpr) {
+    TableScan scan = table().newScan()


Do we need to pass down caseSensitivity flag to this scan?

Good catch. We probably should.

Missed that. Added.

kbendick

One or two questions, but outside of adding the caseSensitivity flag, this looks good to me.

kbendick · 2021-10-25T22:08:18Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

    }

-    return true;
+    return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);


Should this use equals instead of ==?

We use reference equality for true/false literals in a few places. It should be safe as these literals are singletons.

aokolnychyi · 2021-10-26T05:06:29Z

Thanks, @RussellSpitzer @rdblue @szehon-ho @kbendick!

Spark: Enhance metadata deletes in 3.2

fcd657d

github-actions bot added the spark label Oct 25, 2021

aokolnychyi commented Oct 25, 2021

View reviewed changes

rdblue reviewed Oct 25, 2021

View reviewed changes

rdblue approved these changes Oct 25, 2021

View reviewed changes

rdblue reviewed Oct 25, 2021

View reviewed changes

szehon-ho reviewed Oct 25, 2021

View reviewed changes

kbendick approved these changes Oct 25, 2021

View reviewed changes

Review comments

a576cac

aokolnychyi merged commit d55695e into apache:master Oct 26, 2021

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3369 to Spark 3.1

903f51f

		@@ -212,33 +220,44 @@ private String getRowLevelOperationMode(String operation) {

		@Override
		public boolean canDeleteWhere(Filter[] filters) {

Conversation

aokolnychyi commented Oct 25, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Oct 25, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Oct 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aokolnychyi Oct 25, 2021 •

edited

Loading