Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions #32687

Closed

Conversation

allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR refactors SubqueryExpression class. It removes the children field from SubqueryExpression's constructor and adds outerAttrs and joinCond.

Why are the changes needed?

Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up.

For example:
SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2

During the analysis phase, outer references in the subquery are stored in the children field: scalar-subquery [t2.c1], but after the optimizer rule PullupCorrelatedPredicates, the children field will be used to store the join conditions, which contain both the inner and the outer references: scalar-subquery [t1.c1 = t2.c1]. This is why the references of SubqueryExpression excludes the inner plan's output:

override lazy val references: AttributeSet =
if (plan.resolved) super.references -- plan.outputSet else super.references

This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the SQL label May 27, 2021
@SparkQA
Copy link

SparkQA commented May 27, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43554/

@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented May 28, 2021

Test build #139036 has finished for PR 32687 at commit d37d01a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -62,11 +62,13 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression {
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add classdoc to explain these parameters?

@SparkQA
Copy link

SparkQA commented May 28, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43592/

@SparkQA
Copy link

SparkQA commented May 29, 2021

Test build #139071 has finished for PR 32687 at commit eab933a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 806da9d May 31, 2021
MaxGekk pushed a commit that referenced this pull request Jun 21, 2021
…queryExpression refactor

### What changes were proposed in this pull request?

Add a test.

### Why are the changes needed?

The SubqueryExpression refactor PR #32687 actually fixes the bug of `SubqueryExpression.references`. So this follow-up PR adds a regression unit test for it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a new test.

Closes #32990 from Ngone51/spark-35545-followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@allisonwang-db allisonwang-db deleted the refactor-subquery-expr branch January 19, 2024 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants