Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164

Closed
wants to merge 3 commits into from

Conversation

yeshengm
Copy link
Contributor

What changes were proposed in this pull request?

The original implementation of PullupCorrelatedPredicates can remove predicates in subqueries if the rule is run multiple times. This fix resolves this issue by appending new predicates to existing predicates from the last run.

How was this patch tested?

A new UT.

@joshrosen-stripe
Copy link
Contributor

FYI @mgaido91 also has a patch for this at #25145; posting this cross-reference here so reviewers are aware of both PRs and can work together to converge on a single fix.

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107693 has finished for PR 25164 at commit 0efae65.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm
Copy link
Contributor Author

retest this please

@@ -275,13 +275,16 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
plan transformExpressions {
case ScalarSubquery(sub, children, exprId) if children.nonEmpty =>
val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans)
ScalarSubquery(newPlan, newCond, exprId)
val conds = newCond ++ children.filter(_.isInstanceOf[Predicate])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. What if we use children if newCond is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... That might not be correct. Since in the analysis phase, children contains all outer references. Even though newCond is empty, we can't leave out children as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, if there are outer references there, newCond should not be empty, right? I am a bit worried about using Predicate here. You might have an outer reference in a Coalesce or CaseWhen, which are not Predicates for instance...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense... Basically we want to distinguish analyzed plans and optimized plans. But in the current implementation, both the analyzer and optimizer are stripping outer references...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a side note... The PullupCorrelatedPredicates rule is tightly coupled with RewriteSubqueryPredicates. For ListQuery, it seems that PullupCorrelatedPredicates is compulsory for a correct physical plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with your suggestion @dilipbiswal, basically it is what I suggested earlier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Thanks a lot. Lets see if it works or if there is something we may be missing :-)

Copy link
Contributor Author

@yeshengm yeshengm Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it... Because the logic for checking OuterReferences and the logic for actually pulling up predicates are slightly different. With that being said, even though l.children is non-empty, it does not necessarily mean that newCond is non-empty.

The most natural way I can think of is that we combine these two rule PullupCorrelatedPredicates and RewriteSubqueryPredicates, and RewriteSubqueryPredicates completely removes those hacky list subqueries. I don't think the plan can change if we apply these two rules in a single Once batch.

Copy link
Contributor Author

@yeshengm yeshengm Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH subquery resolution and optimzation are super tricky and can be error-prone. The current code is a bit complex and fragile, because one piece of code might have some pre-conditions on some other parts of the codebase, which might change over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeshengm Thanks .. let me make a pr to get it tested to see what is broken. I think it makes sense to keep this logic (even though its merged with rewriteSubquery) idempotent. That is because, pullupCorrelatedPredicates works for scalar subquery as well and that is handled by a different rule.

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107697 has finished for PR 25164 at commit 0efae65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107700 has finished for PR 25164 at commit 0efae65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107730 has finished for PR 25164 at commit c898430.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107750 has finished for PR 25164 at commit c898430.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm yeshengm force-pushed the once-pullup-correlated-expr branch from c898430 to 6b79f88 Compare July 16, 2019 18:03
@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107754 has finished for PR 25164 at commit 6b79f88.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm yeshengm force-pushed the once-pullup-correlated-expr branch from 6b79f88 to f8f4ec8 Compare July 16, 2019 18:44
@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107755 has finished for PR 25164 at commit f8f4ec8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm yeshengm changed the title [SPARK-28375] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times [SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times Jul 17, 2019
@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107767 has finished for PR 25164 at commit 4d405f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm yeshengm closed this Jul 17, 2019
@yeshengm
Copy link
Contributor Author

yeshengm commented Jul 17, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants