-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates #30555
Conversation
@viirya @sunchao @dbtsai @dongjoon-hyun @cloud-fan, could you take a look whenever you get a minute? |
Sure, @aokolnychyi ! |
Test build #132006 has started for PR 30555 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. (Pending CIs).
Thank you, @aokolnychyi .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. BTW, these commands are not supported yet, right?
cc @maryannxue as well. |
@@ -328,6 +328,8 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper | |||
// Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries. | |||
case q: UnaryNode => | |||
rewriteSubQueries(q, q.children) | |||
case s: SupportsSubquery => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment above this line? To be honest, it is hard to tell that this trait means UPDATE/MERGE/DELETE.
Also, I think this change is just part of the whole changes for supporting the subquery in UPDATE/MERGE/DELETE. We need the other changes in Analyzer and Optimizer rules. For example, CheckAnalysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment above this line? To be honest, it is hard to tell that this trait means UPDATE/MERGE/DELETE.
Sure, what kind of comment would make sense? SupportsSubquery
seems generic to me and may cover different rules in the future. Here, I match the behavior in the analyzer.
Also, I think this change is just part of the whole changes for supporting the subquery in UPDATE/MERGE/DELETE. We need the other changes in Analyzer and Optimizer rules. For example, CheckAnalysis.
You are right it is the first step and potentially more changes will be needed. At the same time, I think we've updated the analyzer already. Here is what we have in CheckAnalysis
:
// Only certain operators are allowed to host subquery expression containing
// outer references.
plan match {
case _: Filter | _: Aggregate | _: Project | _: SupportsSubquery => // Ok
case other => failAnalysis(
"Correlated scalar sub-queries can only be used in a " +
s"Filter/Aggregate/Project and a few commands: $plan")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to implement SupportsSubquery
in other nodes and remove UnaryNode
from here, I think the comment above may be sufficient (with minor tweaks once we remove UnaryNode
).
cc @dilipbiswal |
@@ -328,6 +328,8 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper | |||
// Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries. | |||
case q: UnaryNode => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make Filter
, Aggregate
and Project
extend SupportsSubquery
and only match SupportsSubquery
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. We can also simplify the check inside CheckAnalysis
in a follow-up PR.
Let me submit a separate PR for this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can get this one in first. How does it sound, @cloud-fan?
thanks, merging to master! |
Thanks everyone for the review! I've created SPARK-33624 to extend |
Actually, I think there are places where we distinguish |
I think we need a util method to match |
After a second look, I'm a bit worried about this half-baked solution. The correlated subquery handling is split into 3 steps in general:
I have a hard time imagining how we can rewrite UPDATE/DELETE/MERGE commands with correlated subquery to joins, and start to doubt if this is the right direction to go. Before this PR, @aokolnychyi can you share your plan of supporting UPDATE/DELETE/MERGE commands with correlated subquery? It's better to leave this half-baked state ASAP, by either reverting this patch or finishing the feature completely. |
I fully understand your concern, @cloud-fan . @aokolnychyi is working on this area actively. If you have an idea for the better and cleaner solution, could you share it as a working example? We can adjust to it accordingly. As of now, I'm reluctant to discuss revert these without feasible alternatives.
|
UPDATE/DELETE/MERGE are just logical plans in Spark, we need third-party libraries or vendors to provide proper implementations. So it's not about fully support this feature in Spark (as Spark can't), but about what Spark can do to make it easier for others to support this feature. Before this PR, the UPDATE/DELETE/MERGE implementation (physical plans) is fully responsible to handle correlated subqueries, as correlated subqueries inside UPDATE/DELETE/MERGE are not decorrelated. As an example, in physical plan's After this PR, correlated subqueries inside UPDATE/DELETE/MERGE are half-decorrelated. I don't know how UPDATE/DELETE/MERGE implementation can handle it. At least our internal UPDATE/DELETE/MERGE implementation is broken after this commit. If you guys have a good idea about how to handle half-decorrelated correlated subqueries, let's document it so that others can follow. |
If you can, could you elaborate about the detail of conflicts or the reason of brokerage, please?
|
Sorry, I missed this discussion. Let me take a look. |
I think Spark is actually capable of rewriting DELETE/UPDATE/MERGE operations and should do that in the future instead of delegating that to data source implementations. In my view, data sources just need to know which records to modify and Spark should be responsible for executing a distributed query to determine that. Spark may rewrite the plan differently based on whether a data source supports row-level or file-level changes but that should be it. Let me show how a DELETE statement with subqueries can be represented as a filter/project/join.
This DELETE can be rewritten by Spark as below for data sources that support file-level updates only. It basically queries a set of files, filters out records to be deleted, writes new files and replaced the old files with new ones.
Where the filter, in turn, could be converted into a join by
For the last step to happen, we need to handle DELETE/UPDATE/MERGE in I thought there was enough consensus that Spark should rewrite DELETE/UPDATE/MERGE but it was not clear how. That's why I went ahead and added That being said, I do accept the criticism that it is half done and I would not object if we want to revert the change from 3.1. I'd be still interested to know more about how it breaks other rules, though. Could you provide a bit more details, @cloud-fan? A design doc is being prepared but it is not ready yet. I would like to finish with the required distribution and ordering first. |
I can't refer to code in our private repo, but the problem is after We've reverted this commit internally and are unblocked, but I'm not sure if there are others implementing DELETE/UPDATE/MERGE like us and get broken. If no other complaints I'm OK to not revert it. I agree with the plan rewriting approach, and I'm looking forward to your design doc! |
Thank you, @aokolnychyi and @cloud-fan . |
Thanks for the context, @cloud-fan! I think we will want this rule eventually but I would not object reverting this from 3.1 if the community thinks that's safer. |
What changes were proposed in this pull request?
This PR adds logic to handle DELETE/UPDATE/MERGE plans in
PullupCorrelatedPredicates
.Why are the changes needed?
Right now,
PullupCorrelatedPredicates
applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten.Does this PR introduce any user-facing change?
No.
How was this patch tested?
The PR adds 3 new test cases.