[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164

yeshengm · 2019-07-15T17:24:11Z

What changes were proposed in this pull request?

The original implementation of PullupCorrelatedPredicates can remove predicates in subqueries if the rule is run multiple times. This fix resolves this issue by appending new predicates to existing predicates from the last run.

How was this patch tested?

A new UT.

joshrosen-stripe · 2019-07-15T17:43:33Z

FYI @mgaido91 also has a patch for this at #25145; posting this cross-reference here so reviewers are aware of both PRs and can work together to converge on a single fix.

SparkQA · 2019-07-15T18:50:47Z

Test build #107693 has finished for PR 25164 at commit 0efae65.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yeshengm · 2019-07-15T19:38:44Z

retest this please

mgaido91 · 2019-07-15T19:58:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -275,13 +275,16 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
    plan transformExpressions {
      case ScalarSubquery(sub, children, exprId) if children.nonEmpty =>
        val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans)
-        ScalarSubquery(newPlan, newCond, exprId)
+        val conds = newCond ++ children.filter(_.isInstanceOf[Predicate])


I am not sure about this. What if we use children if newCond is empty?

Hmmm... That might not be correct. Since in the analysis phase, children contains all outer references. Even though newCond is empty, we can't leave out children as it is.

But, if there are outer references there, newCond should not be empty, right? I am a bit worried about using Predicate here. You might have an outer reference in a Coalesce or CaseWhen, which are not Predicates for instance...

That makes sense... Basically we want to distinguish analyzed plans and optimized plans. But in the current implementation, both the analyzer and optimizer are stripping outer references...

Also, a side note... The PullupCorrelatedPredicates rule is tightly coupled with RewriteSubqueryPredicates. For ListQuery, it seems that PullupCorrelatedPredicates is compulsory for a correct physical plan.

Yes, I agree with your suggestion @dilipbiswal, basically it is what I suggested earlier

@mgaido91 Thanks a lot. Lets see if it works or if there is something we may be missing :-)

I doubt it... Because the logic for checking OuterReferences and the logic for actually pulling up predicates are slightly different. With that being said, even though l.children is non-empty, it does not necessarily mean that newCond is non-empty.

The most natural way I can think of is that we combine these two rule PullupCorrelatedPredicates and RewriteSubqueryPredicates, and RewriteSubqueryPredicates completely removes those hacky list subqueries. I don't think the plan can change if we apply these two rules in a single Once batch.

TBH subquery resolution and optimzation are super tricky and can be error-prone. The current code is a bit complex and fragile, because one piece of code might have some pre-conditions on some other parts of the codebase, which might change over time.

@yeshengm Thanks .. let me make a pr to get it tested to see what is broken. I think it makes sense to keep this logic (even though its merged with rewriteSubquery) idempotent. That is because, pullupCorrelatedPredicates works for scalar subquery as well and that is handled by a different rule.

SparkQA · 2019-07-15T22:31:16Z

Test build #107697 has finished for PR 25164 at commit 0efae65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-15T23:12:55Z

Test build #107700 has finished for PR 25164 at commit 0efae65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-16T09:22:21Z

Test build #107730 has finished for PR 25164 at commit c898430.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

yeshengm · 2019-07-16T17:26:23Z

retest this please

SparkQA · 2019-07-16T17:34:26Z

Test build #107750 has finished for PR 25164 at commit c898430.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-16T18:17:10Z

Test build #107754 has finished for PR 25164 at commit 6b79f88.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-16T20:13:03Z

Test build #107755 has finished for PR 25164 at commit f8f4ec8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-17T04:28:15Z

Test build #107767 has finished for PR 25164 at commit 4d405f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yeshengm · 2019-07-17T18:59:51Z

Yep. It does work!

On Wed, Jul 17, 2019 at 11:57 AM Dilip Biswal ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala <#25164 (comment)>: > @@ -275,13 +275,16 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper plan transformExpressions { case ScalarSubquery(sub, children, exprId) if children.nonEmpty => val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) - ScalarSubquery(newPlan, newCond, exprId) + val conds = newCond ++ children.filter(_.isInstanceOf[Predicate]) @yeshengm <https://github.com/yeshengm> Ok.. sounds good. One question, can we not make this PullupCorrelatedPredicates idempotent now the way it is (i.e when these two rules are separate) ? If we did something like this : case l @ ListQuery(sub, _, exprId, childOutputs) => val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) if (newCond.isEmpty) { // Perhaps just returning `l` may work as well. But in case we r relying on // the de-dup processing somehow.. ListQuery(newPlan, l.children, exprId, childOutputs) } else { ListQuery(newPlan, newCond, exprId, childOutputs) } will it work ? Or you tried it already ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25164?email_source=notifications&email_token=AC5TTEM4YUV5AB4Y2GUQ4DDP75TSFA5CNFSM4IDZAVE2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB6YOUFY#discussion_r304591303>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC5TTEPCLRIBBWFABPQSURLP75TSFANCNFSM4IDZAVEQ> .

-- Yesheng

joshrosen-stripe mentioned this pull request Jul 15, 2019

[SPARK-28375][SQL] Make PullupCorrelatedPredicates idempotent #25145

Closed

yeshengm closed this Jul 15, 2019

yeshengm reopened this Jul 15, 2019

mgaido91 reviewed Jul 15, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jul 16, 2019

yeshengm force-pushed the once-pullup-correlated-expr branch from 0efae65 to c898430 Compare July 16, 2019 09:10

fix

ed6ff67

yeshengm force-pushed the once-pullup-correlated-expr branch from c898430 to 6b79f88 Compare July 16, 2019 18:03

no pull up for subqueries without outerref

f8f4ec8

yeshengm force-pushed the once-pullup-correlated-expr branch from 6b79f88 to f8f4ec8 Compare July 16, 2019 18:44

modify UT

4d405f6

yeshengm changed the title ~~[SPARK-28375] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times~~ [SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times Jul 17, 2019

yeshengm closed this Jul 17, 2019

yeshengm mentioned this pull request Jul 26, 2019

[SPARK-28237][SQL] Enforce Idempotence for Once batches in RuleExecutor #25249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164

[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164

yeshengm commented Jul 15, 2019

joshrosen-stripe commented Jul 15, 2019

SparkQA commented Jul 15, 2019

yeshengm commented Jul 15, 2019

mgaido91 Jul 15, 2019

yeshengm Jul 15, 2019

mgaido91 Jul 15, 2019

yeshengm Jul 15, 2019

yeshengm Jul 16, 2019

mgaido91 Jul 17, 2019

dilipbiswal Jul 17, 2019

yeshengm Jul 18, 2019 •

edited

yeshengm Jul 18, 2019 •

edited

dilipbiswal Jul 26, 2019

SparkQA commented Jul 15, 2019

SparkQA commented Jul 15, 2019

SparkQA commented Jul 16, 2019

yeshengm commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 17, 2019

yeshengm commented Jul 17, 2019 via email

[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164

[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164

Conversation

yeshengm commented Jul 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

joshrosen-stripe commented Jul 15, 2019

SparkQA commented Jul 15, 2019

yeshengm commented Jul 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeshengm Jul 18, 2019 • edited

Choose a reason for hiding this comment

yeshengm Jul 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 15, 2019

SparkQA commented Jul 15, 2019

SparkQA commented Jul 16, 2019

yeshengm commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 17, 2019

yeshengm commented Jul 17, 2019 via email

yeshengm Jul 18, 2019 •

edited

yeshengm Jul 18, 2019 •

edited