Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery #38336

Closed

Conversation

allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR updates the splitSubquery in RewriteCorrelatedScalarSubquery to support non-aggregated one-row subquery.

In CheckAnalysis, we allow three types of correlated scalar subquery patterns:

  1. SubqueryAlias/Project + Aggregate
  2. SubqueryAlias/Project + Filter + Aggregate
  3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1)

cleanQueryInScalarSubquery(query) match {
case a: Aggregate => checkAggregateInScalarSubquery(outerAttrs, query, a)
case Filter(_, a: Aggregate) => checkAggregateInScalarSubquery(outerAttrs, query, a)
case p: LogicalPlan if p.maxRows.exists(_ <= 1) => // Ok
case other =>
expr.failAnalysis(

We should support the thrid case in splitSubquery to avoid Unexpected operator exceptions.

Why are the changes needed?

To fix an issue with correlated subquery rewrite.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests.

@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

Copy link
Contributor

@jchen5 jchen5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, one small comment

@@ -511,17 +511,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe
* (first part of returned value), the HAVING clause of the innermost query block
* (optional second part) and the parts below the HAVING CLAUSE (third part).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment needs to be updated - in the new case it's returning None, rather than the inner query block below HAVING (and this is ok because we only needed the aggregate to fix the COUNT bug). Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we should at least explain when the third part can be None.

@@ -561,14 +566,18 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe
val origOutput = query.output.head

val resultWithZeroTups = evalSubqueryOnZeroTups(query)
if (resultWithZeroTups.isEmpty) {
val (topPart, havingNode, aggNode) = splitSubquery(query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we make the case 1 result a lazy val? (multiple variable lazy val looks weird)

lazy val planWithoutCountBug = Project(... // Or just val as constructing logical plan is cheap
if (resultWithZeroTups.isEmpty) {
  planWithoutCountBug
} else {
  val (topPart, havingNode, aggNode) = splitSubquery(query)
  if (aggNode.isEmpty) planWithoutCountBug else ...
}

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3feddec Oct 28, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…atedScalarSubquery

### What changes were proposed in this pull request?
This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery.

In CheckAnalysis, we allow three types of correlated scalar subquery patterns:
1. SubqueryAlias/Project + Aggregate
2. SubqueryAlias/Project + Filter + Aggregate
3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1)

https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856

We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions.

### Why are the changes needed?
To fix an issue with correlated subquery rewrite.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New unit tests.

Closes apache#38336 from allisonwang-db/spark-40862-split-subquery.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants