[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336

allisonwang-db · 2022-10-21T22:26:14Z

What changes were proposed in this pull request?

This PR updates the splitSubquery in RewriteCorrelatedScalarSubquery to support non-aggregated one-row subquery.

In CheckAnalysis, we allow three types of correlated scalar subquery patterns:

SubqueryAlias/Project + Aggregate
SubqueryAlias/Project + Filter + Aggregate
SubqueryAlias/Project + LogicalPlan (maxRows <= 1)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

Lines 851 to 856 in 748fa27

    
           cleanQueryInScalarSubquery(query) match { 
        
             case a: Aggregate => checkAggregateInScalarSubquery(outerAttrs, query, a) 
        
             case Filter(_, a: Aggregate) => checkAggregateInScalarSubquery(outerAttrs, query, a) 
        
             case p: LogicalPlan if p.maxRows.exists(_ <= 1) => // Ok 
        
             case other => 
        
               expr.failAnalysis(

We should support the thrid case in splitSubquery to avoid Unexpected operator exceptions.

Why are the changes needed?

To fix an issue with correlated subquery rewrite.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests.

allisonwang-db · 2022-10-24T23:23:23Z

cc @cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

jchen5

Looks good to me, one small comment

jchen5 · 2022-10-26T00:46:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -511,17 +511,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe
   * (first part of returned value), the HAVING clause of the innermost query block
   * (optional second part) and the parts below the HAVING CLAUSE (third part).


Looks like this comment needs to be updated - in the new case it's returning None, rather than the inner query block below HAVING (and this is ok because we only needed the aggregate to fix the COUNT bug). Right?

Yea, we should at least explain when the third part can be None.

cloud-fan · 2022-10-27T02:42:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -561,14 +566,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe
        val origOutput = query.output.head

        val resultWithZeroTups = evalSubqueryOnZeroTups(query)
-        if (resultWithZeroTups.isEmpty) {
+        val (topPart, havingNode, aggNode) = splitSubquery(query)


how about we make the case 1 result a lazy val? (multiple variable lazy val looks weird)

lazy val planWithoutCountBug = Project(... // Or just val as constructing logical plan is cheap if (resultWithZeroTups.isEmpty) { planWithoutCountBug } else { val (topPart, havingNode, aggNode) = splitSubquery(query) if (aggNode.isEmpty) planWithoutCountBug else ... }

cloud-fan · 2022-10-28T04:25:14Z

thanks, merging to master!

…atedScalarSubquery ### What changes were proposed in this pull request? This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery. In CheckAnalysis, we allow three types of correlated scalar subquery patterns: 1. SubqueryAlias/Project + Aggregate 2. SubqueryAlias/Project + Filter + Aggregate 3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1) https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856 We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions. ### Why are the changes needed? To fix an issue with correlated subquery rewrite. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. Closes apache#38336 from allisonwang-db/spark-40862-split-subquery. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Oct 21, 2022

allisonwang-db force-pushed the spark-40862-split-subquery branch from 815d3e1 to 0120a70 Compare October 24, 2022 20:16

cloud-fan reviewed Oct 25, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Oct 25, 2022

View reviewed changes

jchen5 approved these changes Oct 26, 2022

View reviewed changes

cloud-fan reviewed Oct 27, 2022

View reviewed changes

allisonwang-db force-pushed the spark-40862-split-subquery branch from f24ead1 to 8ea1954 Compare October 27, 2022 16:02

allisonwang-db added 2 commits October 27, 2022 19:40

fix

898efc0

address comments

7f4cf74

allisonwang-db force-pushed the spark-40862-split-subquery branch from 8ea1954 to 7f4cf74 Compare October 27, 2022 23:46

cloud-fan approved these changes Oct 28, 2022

View reviewed changes

cloud-fan closed this in 3feddec Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336

[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336

allisonwang-db commented Oct 21, 2022

allisonwang-db commented Oct 24, 2022

jchen5 left a comment

jchen5 Oct 26, 2022

cloud-fan Oct 27, 2022

cloud-fan Oct 27, 2022

cloud-fan commented Oct 28, 2022

	cleanQueryInScalarSubquery(query) match {
	case a: Aggregate => checkAggregateInScalarSubquery(outerAttrs, query, a)
	case Filter(_, a: Aggregate) => checkAggregateInScalarSubquery(outerAttrs, query, a)
	case p: LogicalPlan if p.maxRows.exists(_ <= 1) => // Ok
	case other =>
	expr.failAnalysis(

		@@ -511,17 +511,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe
		* (first part of returned value), the HAVING clause of the innermost query block
		* (optional second part) and the parts below the HAVING CLAUSE (third part).

[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336

[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336

Conversation

allisonwang-db commented Oct 21, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

allisonwang-db commented Oct 24, 2022

jchen5 left a comment

Choose a reason for hiding this comment

jchen5 Oct 26, 2022

Choose a reason for hiding this comment

cloud-fan Oct 27, 2022

Choose a reason for hiding this comment

cloud-fan Oct 27, 2022

Choose a reason for hiding this comment

cloud-fan commented Oct 28, 2022