Spark: Enhance metadata deletes in 3.2#3369
Conversation
|
|
||
| private static final Map<Class<? extends Filter>, Operation> FILTERS = ImmutableMap | ||
| .<Class<? extends Filter>, Operation>builder() | ||
| .put(AlwaysTrue.class, Operation.TRUE) |
There was a problem hiding this comment.
I had to add this because TRUNCATE tests I added were failing.
There was a problem hiding this comment.
That's strange. This is exactly the kind of thing that makes me glad we decided to test and build against each Spark branch independently.
There was a problem hiding this comment.
Yeah, I think TRUNCATE logic was added in 3.1. So it is probably broken since then.
| @@ -212,33 +220,44 @@ private String getRowLevelOperationMode(String operation) { | |||
|
|
|||
| @Override | |||
| public boolean canDeleteWhere(Filter[] filters) { | |||
There was a problem hiding this comment.
This implementation is just an idea of what can be done. It makes the check way more expensive than before but also allows us to cover more scenarios (e.g. deletes with transforms, deletes using metrics, deletes with multiple specs).
We may consider a flag or something to disable it but I am not sure at this point. Maybe, there are better ideas. Let me know.
There was a problem hiding this comment.
I am wondering if it makes sense to do this sort of metadata delete inside the non-metadata delete path and here instead just do a check as to whether or not a metadata delete is even possible? Like instead of checking to to see whether any metadata delete can be done, make sure a metadata delete cannot be done.
See if the delete conditions could not possibly apply to all the specs currently in play, rather than checking to see if they can apply to all live files.
There was a problem hiding this comment.
I am afraid it will be too late given how we plan to rewrite row-level commands and how Spark handles this check. However, we may want to do this less expensive and not cover some use cases.
|
@RussellSpitzer @szehon-ho @flyrain @rdblue @kbendick @karuppayya, any thoughts? I think our current metadata-only delete implementation is very limited. |
| return !identitySourceIds.contains(field.fieldId()); | ||
| }); | ||
| // a metadata delete is possible iff matching files can be deleted entirely | ||
| private boolean canDeleteUsingMetadata(Expression deleteExpr) { |
There was a problem hiding this comment.
Looks like this pretty much just runs the delete logic to see if it will succeed.
I was thinking about something slightly different, which is to check whether the strict projection and the inclusive projection of the delete filter are equivalent. If so, then we know that the delete aligns with partitioning and will succeed. I'm okay with this if we don't think it's going to be too expensive during planning though.
There was a problem hiding this comment.
I have the same concern about the performance. I think it will be reasonable in most cases just because we won't write new manifests and will most likely have a selective filter. However, I am open to any other alternatives.
I initially wanted to check just the conditions but I was not sure how to handle multiple specs. How would the inclusive and strict approach work here? Will we require that the projections are equivalent for each spec?
There was a problem hiding this comment.
Yeah, we would need the expressions to be equal for each spec that has a manifest matching. So we could filter manifests using the manifest list, then get the specs and do the projection. That way you could take advantage of some partition filtering to eliminate specs.
I actually like what you have here. It shouldn't be a big problem. Let's see how it goes with this and we can always introduce a lighter-weight version later. Luckily, this should fail fast.
There was a problem hiding this comment.
That should work. Since this implementation is more straightforward and covers slightly more use cases, let's try it out. We can switch to the alternative if the performance is bad.
| Evaluator evaluator = evaluators.computeIfAbsent( | ||
| spec.specId(), | ||
| specId -> new Evaluator(spec.partitionType(), Projections.strict(spec).project(deleteExpr))); | ||
| return evaluator.eval(file.partition()) || metricsEvaluator.eval(file); |
| }); | ||
| // a metadata delete is possible iff matching files can be deleted entirely | ||
| private boolean canDeleteUsingMetadata(Expression deleteExpr) { | ||
| TableScan scan = table().newScan() |
There was a problem hiding this comment.
Do we need to pass down caseSensitivity flag to this scan?
There was a problem hiding this comment.
Good catch. We probably should.
There was a problem hiding this comment.
Missed that. Added.
kbendick
left a comment
There was a problem hiding this comment.
One or two questions, but outside of adding the caseSensitivity flag, this looks good to me.
| } | ||
|
|
||
| return true; | ||
| return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr); |
There was a problem hiding this comment.
Should this use equals instead of ==?
There was a problem hiding this comment.
We use reference equality for true/false literals in a few places. It should be safe as these literals are singletons.
|
Thanks, @RussellSpitzer @rdblue @szehon-ho @kbendick! |
This PR is an attempt to enable more metadata-only deletes in Spark.