[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687

allisonwang-db · 2021-05-27T21:58:13Z

What changes were proposed in this pull request?

This PR refactors SubqueryExpression class. It removes the children field from SubqueryExpression's constructor and adds outerAttrs and joinCond.

Why are the changes needed?

Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up.

For example:
SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2

During the analysis phase, outer references in the subquery are stored in the children field: scalar-subquery [t2.c1], but after the optimizer rule PullupCorrelatedPredicates, the children field will be used to store the join conditions, which contain both the inner and the outer references: scalar-subquery [t1.c1 = t2.c1]. This is why the references of SubqueryExpression excludes the inner plan's output:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

Lines 68 to 69 in 29ed1a2

    
           override lazy val references: AttributeSet = 
        
             if (plan.resolved) super.references -- plan.outputSet else super.references

This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

SparkQA · 2021-05-27T23:16:07Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43554/

allisonwang-db · 2021-05-28T01:09:21Z

cc @cloud-fan

SparkQA · 2021-05-28T02:57:20Z

Test build #139036 has finished for PR 32687 at commit d37d01a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-28T07:25:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

@@ -62,11 +62,13 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression {
 */


can we add classdoc to explain these parameters?

SparkQA · 2021-05-28T21:42:06Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43592/

SparkQA · 2021-05-29T00:57:29Z

Test build #139071 has finished for PR 32687 at commit eab933a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-31T04:57:22Z

thanks, merging to master!

…queryExpression refactor ### What changes were proposed in this pull request? Add a test. ### Why are the changes needed? The SubqueryExpression refactor PR #32687 actually fixes the bug of `SubqueryExpression.references`. So this follow-up PR adds a regression unit test for it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a new test. Closes #32990 from Ngone51/spark-35545-followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

subquery expression

d37d01a

github-actions bot added the SQL label May 27, 2021

cloud-fan reviewed May 28, 2021

View reviewed changes

cloud-fan approved these changes May 28, 2021

View reviewed changes

add docs

eab933a

cloud-fan closed this in 806da9d May 31, 2021

Ngone51 mentioned this pull request Jun 21, 2021

[SPARK-35545][FOLLOW-UP][TEST][SQL] Add a regression test for the SubqueryExpression refactor #32990

Closed

mythrocks mentioned this pull request Jun 24, 2021

Investigate SPARK-35545 (Refactor of SubqueryExpression's members) NVIDIA/spark-rapids#2808

Closed

allisonwang-db deleted the refactor-subquery-expr branch January 19, 2024 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687

[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687

allisonwang-db commented May 27, 2021

SparkQA commented May 27, 2021

allisonwang-db commented May 28, 2021

SparkQA commented May 28, 2021

cloud-fan May 28, 2021

SparkQA commented May 28, 2021

SparkQA commented May 29, 2021

cloud-fan commented May 31, 2021

	override lazy val references: AttributeSet =
	if (plan.resolved) super.references -- plan.outputSet else super.references

		@@ -62,11 +62,13 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression {
		*/

[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687

[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687

Conversation

allisonwang-db commented May 27, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 27, 2021

allisonwang-db commented May 28, 2021

SparkQA commented May 28, 2021

cloud-fan May 28, 2021

Choose a reason for hiding this comment

SparkQA commented May 28, 2021

SparkQA commented May 29, 2021

cloud-fan commented May 31, 2021