Skip to content

Spark: Enhance metadata deletes in 3.2#3369

Merged
aokolnychyi merged 2 commits intoapache:masterfrom
aokolnychyi:improve-spark-metadata-deletes
Oct 26, 2021
Merged

Spark: Enhance metadata deletes in 3.2#3369
aokolnychyi merged 2 commits intoapache:masterfrom
aokolnychyi:improve-spark-metadata-deletes

Conversation

@aokolnychyi
Copy link
Contributor

This PR is an attempt to enable more metadata-only deletes in Spark.

@github-actions github-actions bot added the spark label Oct 25, 2021

private static final Map<Class<? extends Filter>, Operation> FILTERS = ImmutableMap
.<Class<? extends Filter>, Operation>builder()
.put(AlwaysTrue.class, Operation.TRUE)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this because TRUNCATE tests I added were failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange. This is exactly the kind of thing that makes me glad we decided to test and build against each Spark branch independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think TRUNCATE logic was added in 3.1. So it is probably broken since then.

@@ -212,33 +220,44 @@ private String getRowLevelOperationMode(String operation) {

@Override
public boolean canDeleteWhere(Filter[] filters) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is just an idea of what can be done. It makes the check way more expensive than before but also allows us to cover more scenarios (e.g. deletes with transforms, deletes using metrics, deletes with multiple specs).

We may consider a flag or something to disable it but I am not sure at this point. Maybe, there are better ideas. Let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it makes sense to do this sort of metadata delete inside the non-metadata delete path and here instead just do a check as to whether or not a metadata delete is even possible? Like instead of checking to to see whether any metadata delete can be done, make sure a metadata delete cannot be done.

See if the delete conditions could not possibly apply to all the specs currently in play, rather than checking to see if they can apply to all live files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid it will be too late given how we plan to rewrite row-level commands and how Spark handles this check. However, we may want to do this less expensive and not cover some use cases.

@aokolnychyi
Copy link
Contributor Author

@RussellSpitzer @szehon-ho @flyrain @rdblue @kbendick @karuppayya, any thoughts? I think our current metadata-only delete implementation is very limited.

return !identitySourceIds.contains(field.fieldId());
});
// a metadata delete is possible iff matching files can be deleted entirely
private boolean canDeleteUsingMetadata(Expression deleteExpr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this pretty much just runs the delete logic to see if it will succeed.

I was thinking about something slightly different, which is to check whether the strict projection and the inclusive projection of the delete filter are equivalent. If so, then we know that the delete aligns with partitioning and will succeed. I'm okay with this if we don't think it's going to be too expensive during planning though.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern about the performance. I think it will be reasonable in most cases just because we won't write new manifests and will most likely have a selective filter. However, I am open to any other alternatives.

I initially wanted to check just the conditions but I was not sure how to handle multiple specs. How would the inclusive and strict approach work here? Will we require that the projections are equivalent for each spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we would need the expressions to be equal for each spec that has a manifest matching. So we could filter manifests using the manifest list, then get the specs and do the projection. That way you could take advantage of some partition filtering to eliminate specs.

I actually like what you have here. It shouldn't be a big problem. Let's see how it goes with this and we can always introduce a lighter-weight version later. Luckily, this should fail fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work. Since this implementation is more straightforward and covers slightly more use cases, let's try it out. We can switch to the alternative if the performance is bad.

Evaluator evaluator = evaluators.computeIfAbsent(
spec.specId(),
specId -> new Evaluator(spec.partitionType(), Projections.strict(spec).project(deleteExpr)));
return evaluator.eval(file.partition()) || metricsEvaluator.eval(file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me.

});
// a metadata delete is possible iff matching files can be deleted entirely
private boolean canDeleteUsingMetadata(Expression deleteExpr) {
TableScan scan = table().newScan()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass down caseSensitivity flag to this scan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We probably should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that. Added.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One or two questions, but outside of adding the caseSensitivity flag, this looks good to me.

}

return true;
return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use equals instead of ==?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use reference equality for true/false literals in a few places. It should be safe as these literals are singletons.

@aokolnychyi aokolnychyi merged commit d55695e into apache:master Oct 26, 2021
@aokolnychyi
Copy link
Contributor Author

Thanks, @RussellSpitzer @rdblue @szehon-ho @kbendick!

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants