New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336
[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336
Conversation
815d3e1
to
0120a70
Compare
cc @cloud-fan |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, one small comment
@@ -511,17 +511,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe | |||
* (first part of returned value), the HAVING clause of the innermost query block | |||
* (optional second part) and the parts below the HAVING CLAUSE (third part). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment needs to be updated - in the new case it's returning None, rather than the inner query block below HAVING (and this is ok because we only needed the aggregate to fix the COUNT bug). Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we should at least explain when the third part can be None.
@@ -561,14 +566,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe | |||
val origOutput = query.output.head | |||
|
|||
val resultWithZeroTups = evalSubqueryOnZeroTups(query) | |||
if (resultWithZeroTups.isEmpty) { | |||
val (topPart, havingNode, aggNode) = splitSubquery(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we make the case 1 result a lazy val? (multiple variable lazy val looks weird)
lazy val planWithoutCountBug = Project(... // Or just val as constructing logical plan is cheap
if (resultWithZeroTups.isEmpty) {
planWithoutCountBug
} else {
val (topPart, havingNode, aggNode) = splitSubquery(query)
if (aggNode.isEmpty) planWithoutCountBug else ...
}
f24ead1
to
8ea1954
Compare
8ea1954
to
7f4cf74
Compare
thanks, merging to master! |
…atedScalarSubquery ### What changes were proposed in this pull request? This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery. In CheckAnalysis, we allow three types of correlated scalar subquery patterns: 1. SubqueryAlias/Project + Aggregate 2. SubqueryAlias/Project + Filter + Aggregate 3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1) https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856 We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions. ### Why are the changes needed? To fix an issue with correlated subquery rewrite. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. Closes apache#38336 from allisonwang-db/spark-40862-split-subquery. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR updates the
splitSubquery
inRewriteCorrelatedScalarSubquery
to support non-aggregated one-row subquery.In CheckAnalysis, we allow three types of correlated scalar subquery patterns:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Lines 851 to 856 in 748fa27
We should support the thrid case in
splitSubquery
to avoidUnexpected operator
exceptions.Why are the changes needed?
To fix an issue with correlated subquery rewrite.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit tests.