[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates #30555

aokolnychyi · 2020-11-30T18:41:23Z

What changes were proposed in this pull request?

This PR adds logic to handle DELETE/UPDATE/MERGE plans in PullupCorrelatedPredicates.

Why are the changes needed?

Right now, PullupCorrelatedPredicates applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The PR adds 3 new test cases.

…icates

aokolnychyi · 2020-11-30T18:43:27Z

@viirya @sunchao @dbtsai @dongjoon-hyun @cloud-fan, could you take a look whenever you get a minute?

dongjoon-hyun · 2020-11-30T18:44:57Z

Sure, @aokolnychyi !

SparkQA · 2020-11-30T19:26:41Z

Test build #132006 has started for PR 30555 at commit b92852d.

dongjoon-hyun

+1, LGTM. (Pending CIs).

Thank you, @aokolnychyi .

viirya

Looks good. BTW, these commands are not supported yet, right?

HyukjinKwon · 2020-12-01T01:25:05Z

cc @maryannxue as well.

gatorsmile · 2020-12-01T02:36:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -328,6 +328,8 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
    // Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries.
    case q: UnaryNode =>
      rewriteSubQueries(q, q.children)
+    case s: SupportsSubquery =>


Add a comment above this line? To be honest, it is hard to tell that this trait means UPDATE/MERGE/DELETE.

Also, I think this change is just part of the whole changes for supporting the subquery in UPDATE/MERGE/DELETE. We need the other changes in Analyzer and Optimizer rules. For example, CheckAnalysis.

Add a comment above this line? To be honest, it is hard to tell that this trait means UPDATE/MERGE/DELETE.

Sure, what kind of comment would make sense? SupportsSubquery seems generic to me and may cover different rules in the future. Here, I match the behavior in the analyzer.

Also, I think this change is just part of the whole changes for supporting the subquery in UPDATE/MERGE/DELETE. We need the other changes in Analyzer and Optimizer rules. For example, CheckAnalysis.

You are right it is the first step and potentially more changes will be needed. At the same time, I think we've updated the analyzer already. Here is what we have in CheckAnalysis:

// Only certain operators are allowed to host subquery expression containing // outer references. plan match { case _: Filter | _: Aggregate | _: Project | _: SupportsSubquery => // Ok case other => failAnalysis( "Correlated scalar sub-queries can only be used in a " + s"Filter/Aggregate/Project and a few commands: $plan") }

If we decide to implement SupportsSubquery in other nodes and remove UnaryNode from here, I think the comment above may be sufficient (with minor tweaks once we remove UnaryNode).

gatorsmile · 2020-12-01T02:37:50Z

cc @dilipbiswal

cloud-fan · 2020-12-01T11:25:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -328,6 +328,8 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
    // Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries.
    case q: UnaryNode =>


shall we make Filter, Aggregate and Project extend SupportsSubquery and only match SupportsSubquery here?

Sounds good to me. We can also simplify the check inside CheckAnalysis in a follow-up PR.

Let me submit a separate PR for this one.

Actually, we can get this one in first. How does it sound, @cloud-fan?

cloud-fan · 2020-12-01T14:10:58Z

thanks, merging to master!

aokolnychyi · 2020-12-01T14:20:39Z

Thanks everyone for the review! I've created SPARK-33624 to extend SupportsSubquery in Filter, Aggregate and Project.

aokolnychyi · 2020-12-01T14:28:18Z

Actually, I think there are places where we distinguish Aggregate and SupportsSubquery so we may not unify those. For example, CheckAnalysis.

cloud-fan · 2020-12-01T14:33:28Z

I think we need a util method to match Project/Filter/Aggregate/SupportsSubquery, so that we don't make the same mistake due to code inconsistency (CheckAnalysis handles SupportsSubquery but the rule does not)

cloud-fan · 2021-01-20T15:16:49Z

After a second look, I'm a bit worried about this half-baked solution. The correlated subquery handling is split into 3 steps in general:

CheckAnalysis makes sure correlated subquery can only exist in SupportsSubquery, Filter, and a few other operators.
PullupCorrelatedPredicates pulls up the outer references in the correlated subquery to the root node. It handles SupportsSubquery and UnaryNode.
RewriteCorrelatedScalarSubquery and RewritePredicateSubquery rewrite correlated subquery to join. They only handle Filter, Aggregate and Project.

I have a hard time imagining how we can rewrite UPDATE/DELETE/MERGE commands with correlated subquery to joins, and start to doubt if this is the right direction to go. Before this PR, SupportsSubquery is mostly a marker-trait, to let CheckAnalysis not get in the way (fail UPDATE/DELETE/MERGE commands with correlated subquery). We assume users would add catalyst rules and/or provide proper UPDATE/DELETE/MERGE physical plans to support correlated subquery. Now PullupCorrelatedPredicates can get in the way as well.

@aokolnychyi can you share your plan of supporting UPDATE/DELETE/MERGE commands with correlated subquery? It's better to leave this half-baked state ASAP, by either reverting this patch or finishing the feature completely.

dongjoon-hyun · 2021-01-21T16:53:34Z

I fully understand your concern, @cloud-fan . @aokolnychyi is working on this area actively.

If you have an idea for the better and cleaner solution, could you share it as a working example? We can adjust to it accordingly. As of now, I'm reluctant to discuss revert these without feasible alternatives.

start to doubt if this is the right direction to go.

dongjoon-hyun · 2021-01-21T16:54:26Z

We can switch to your new suggestion (PR) if it's applicable ASAP.
cc @rdblue , @dbtsai , @viirya , @holdenk , @sunchao

cloud-fan · 2021-01-21T17:27:15Z

UPDATE/DELETE/MERGE are just logical plans in Spark, we need third-party libraries or vendors to provide proper implementations. So it's not about fully support this feature in Spark (as Spark can't), but about what Spark can do to make it easier for others to support this feature.

Before this PR, the UPDATE/DELETE/MERGE implementation (physical plans) is fully responsible to handle correlated subqueries, as correlated subqueries inside UPDATE/DELETE/MERGE are not decorrelated. As an example, in physical plan's doExecute method, people can put UPDATE/DELETE/MERGE conditions in filter and build a DataFrame to evaluate the condition and collect the result.

After this PR, correlated subqueries inside UPDATE/DELETE/MERGE are half-decorrelated. I don't know how UPDATE/DELETE/MERGE implementation can handle it. At least our internal UPDATE/DELETE/MERGE implementation is broken after this commit. If you guys have a good idea about how to handle half-decorrelated correlated subqueries, let's document it so that others can follow.

dongjoon-hyun · 2021-01-21T20:00:10Z

If you can, could you elaborate about the detail of conflicts or the reason of brokerage, please?

At least our internal UPDATE/DELETE/MERGE implementation is broken after this commit.

aokolnychyi · 2021-01-22T00:35:24Z

Sorry, I missed this discussion. Let me take a look.

aokolnychyi · 2021-01-22T01:48:18Z

I think Spark is actually capable of rewriting DELETE/UPDATE/MERGE operations and should do that in the future instead of delegating that to data source implementations. In my view, data sources just need to know which records to modify and Spark should be responsible for executing a distributed query to determine that.

Spark may rewrite the plan differently based on whether a data source supports row-level or file-level changes but that should be it.

Let me show how a DELETE statement with subqueries can be represented as a filter/project/join.

DELETE FROM t WHERE id IN (SELECT * FROM deleted_id)

This DELETE can be rewritten by Spark as below for data sources that support file-level updates only. It basically queries a set of files, filters out records to be deleted, writes new files and replaced the old files with new ones.

ReplaceData (a node that replaces data) 
+- Project [id, dep]
   +- Filter NOT (id IN (list []) (a filter with our subquery)
      +- RelationV2[id#48, dep#49]

Where the filter, in turn, could be converted into a join by RewritePredicateSubquery.

ReplaceData (a node that replaces data) 
+- Project [id, dep]
   +- Filter NOT (exists <=> true)
      +- Join ExistenceJoin(exists), (id = value)
         :- RelationV2[id, dep]
         +- LocalRelation [value]

For the last step to happen, we need to handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates.

I thought there was enough consensus that Spark should rewrite DELETE/UPDATE/MERGE but it was not clear how. That's why I went ahead and added SupportsSubquery to PullupCorrelatedPredicates as a first step.

That being said, I do accept the criticism that it is half done and I would not object if we want to revert the change from 3.1. I'd be still interested to know more about how it breaks other rules, though. Could you provide a bit more details, @cloud-fan?

A design doc is being prepared but it is not ready yet. I would like to finish with the required distribution and ordering first.

cloud-fan · 2021-01-22T04:24:39Z

I can't refer to code in our private repo, but the problem is after PullupCorrelatedPredicates, the correlated scalar subquery has more output columns (because the outer references are pulled up), then we can't use the DELETE/UPDATE/MERGE conditions to build Dataset, as Analyzer requires that scalar subquery can only have one output column.

We've reverted this commit internally and are unblocked, but I'm not sure if there are others implementing DELETE/UPDATE/MERGE like us and get broken. If no other complaints I'm OK to not revert it.

I agree with the plan rewriting approach, and I'm looking forward to your design doc!

dongjoon-hyun · 2021-01-22T04:34:55Z

Thank you, @aokolnychyi and @cloud-fan .

aokolnychyi · 2021-01-22T18:54:44Z

Thanks for the context, @cloud-fan! I think we will want this rule eventually but I would not object reverting this from 3.1 if the community thinks that's safer.

[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPred…

b92852d

…icates

github-actions bot added the SQL label Nov 30, 2020

dongjoon-hyun approved these changes Nov 30, 2020

View reviewed changes

viirya approved these changes Nov 30, 2020

View reviewed changes

HyukjinKwon approved these changes Dec 1, 2020

View reviewed changes

gatorsmile reviewed Dec 1, 2020

View reviewed changes

cloud-fan reviewed Dec 1, 2020

View reviewed changes

cloud-fan closed this in 478fb7f Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates #30555

[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates #30555

aokolnychyi commented Nov 30, 2020

aokolnychyi commented Nov 30, 2020

dongjoon-hyun commented Nov 30, 2020

SparkQA commented Nov 30, 2020

dongjoon-hyun left a comment

viirya left a comment

HyukjinKwon commented Dec 1, 2020

gatorsmile Dec 1, 2020

aokolnychyi Dec 1, 2020

aokolnychyi Dec 1, 2020

gatorsmile commented Dec 1, 2020

cloud-fan Dec 1, 2020

aokolnychyi Dec 1, 2020 •

edited

aokolnychyi Dec 1, 2020

cloud-fan commented Dec 1, 2020

aokolnychyi commented Dec 1, 2020

aokolnychyi commented Dec 1, 2020 •

edited

cloud-fan commented Dec 1, 2020

cloud-fan commented Jan 20, 2021 •

edited

dongjoon-hyun commented Jan 21, 2021 •

edited

dongjoon-hyun commented Jan 21, 2021 •

edited

cloud-fan commented Jan 21, 2021

dongjoon-hyun commented Jan 21, 2021

aokolnychyi commented Jan 22, 2021

aokolnychyi commented Jan 22, 2021

cloud-fan commented Jan 22, 2021

dongjoon-hyun commented Jan 22, 2021

aokolnychyi commented Jan 22, 2021

		@@ -328,6 +328,8 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
		// Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries.
		case q: UnaryNode =>

[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates #30555

[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates #30555

Conversation

aokolnychyi commented Nov 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

aokolnychyi commented Nov 30, 2020

dongjoon-hyun commented Nov 30, 2020

SparkQA commented Nov 30, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 1, 2020

gatorsmile Dec 1, 2020

Choose a reason for hiding this comment

aokolnychyi Dec 1, 2020

Choose a reason for hiding this comment

aokolnychyi Dec 1, 2020

Choose a reason for hiding this comment

gatorsmile commented Dec 1, 2020

cloud-fan Dec 1, 2020

Choose a reason for hiding this comment

aokolnychyi Dec 1, 2020 • edited

Choose a reason for hiding this comment

aokolnychyi Dec 1, 2020

Choose a reason for hiding this comment

cloud-fan commented Dec 1, 2020

aokolnychyi commented Dec 1, 2020

aokolnychyi commented Dec 1, 2020 • edited

cloud-fan commented Dec 1, 2020

cloud-fan commented Jan 20, 2021 • edited

dongjoon-hyun commented Jan 21, 2021 • edited

dongjoon-hyun commented Jan 21, 2021 • edited

cloud-fan commented Jan 21, 2021

dongjoon-hyun commented Jan 21, 2021

aokolnychyi commented Jan 22, 2021

aokolnychyi commented Jan 22, 2021

cloud-fan commented Jan 22, 2021

dongjoon-hyun commented Jan 22, 2021

aokolnychyi commented Jan 22, 2021

aokolnychyi Dec 1, 2020 •

edited

aokolnychyi commented Dec 1, 2020 •

edited

cloud-fan commented Jan 20, 2021 •

edited

dongjoon-hyun commented Jan 21, 2021 •

edited

dongjoon-hyun commented Jan 21, 2021 •

edited