New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687
[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687
Conversation
Kubernetes integration test unable to build dist. exiting with code: 1 |
cc @cloud-fan |
Test build #139036 has finished for PR 32687 at commit
|
@@ -62,11 +62,13 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression { | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add classdoc to explain these parameters?
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #139071 has finished for PR 32687 at commit
|
thanks, merging to master! |
…queryExpression refactor ### What changes were proposed in this pull request? Add a test. ### Why are the changes needed? The SubqueryExpression refactor PR #32687 actually fixes the bug of `SubqueryExpression.references`. So this follow-up PR adds a regression unit test for it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a new test. Closes #32990 from Ngone51/spark-35545-followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
This PR refactors
SubqueryExpression
class. It removes the children field from SubqueryExpression's constructor and addsouterAttrs
andjoinCond
.Why are the changes needed?
Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up.
For example:
SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2
During the analysis phase, outer references in the subquery are stored in the children field:
scalar-subquery [t2.c1]
, but after the optimizer rulePullupCorrelatedPredicates
, the children field will be used to store the join conditions, which contain both the inner and the outer references:scalar-subquery [t1.c1 = t2.c1]
. This is why the references of SubqueryExpression excludes the inner plan's output:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
Lines 68 to 69 in 29ed1a2
This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.