[SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition. by jiangxb1987 · Pull Request #14012 · apache/spark

jiangxb1987 · 2016-07-01T09:17:18Z

What changes were proposed in this pull request?

Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
SELECT a FROM t WHERE rand() < 0.1 AND a = 1
And
SELECT a FROM t WHERE a = 1 AND rand() < 0.1
may call rand() for different times and therefore the output rows differ.

This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.

How was this patch tested?

Expanded related testcases in FilterPushdownSuite.

…dicates currectly in non-deterministic condition.

jiangxb1987 · 2016-07-01T09:22:20Z

cc @liancheng @cloud-fan

cloud-fan · 2016-07-03T15:15:03Z

ok to test

SparkQA · 2016-07-03T16:59:17Z

Test build #61691 has finished for PR 14012 at commit 856d86d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-07-05T08:02:07Z

add to whitelist

liancheng · 2016-07-05T08:15:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition { cond =>
-        cond.references.subsetOf(partitionAttrs) && cond.deterministic &&
+        isPredicatePushdownAble = isPredicatePushdownAble && cond.deterministic
+        isPredicatePushdownAble && cond.references.subsetOf(partitionAttrs) &&


The following can be easier to read:

val (candidates, containingNonDeterministic) = splitConjunctivePredicates(condition).span(_.deterministic) val (pushDown, rest) = candidates.partition { cond => cond.references.subsetOf(partitionAttrs) && partitionAttrs.forall(_.isInstanceOf[Attribute]) } val stayUp = rest ++ containingNonDeterministic

And we should move the partitionAttrs.forall(_.isInstanceOf[Attribute]) predicate out of the closure.

Sure, I'll update that!

SparkQA · 2016-07-05T09:47:47Z

Test build #61748 has finished for PR 14012 at commit 856d86d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-05T10:54:02Z

Test build #61751 has finished for PR 14012 at commit c005645.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-05T11:03:47Z

Test build #61752 has finished for PR 14012 at commit a4e62f0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-05T13:09:34Z

Test build #61753 has finished for PR 14012 at commit 33d45de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-07-07T17:02:32Z

cc @liancheng please review this PR, thanks!

liancheng · 2016-07-08T06:49:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -1106,21 +1106,32 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper {
    // Push [[Filter]] operators through [[Window]] operators. Parts of the predicate that can be
    // pushed beneath must satisfy the following two conditions:


Nit: Remove "two".

liancheng · 2016-07-08T06:55:25Z

LGTM except for some minor comments. Thanks for improving this!

liancheng · 2016-07-08T06:56:26Z

One more thing, please complete the PR title.

cloud-fan · 2016-07-08T08:12:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      // This is for ensuring all the partitioning expressions have been converted to alias
+      // in Analyzer. Thus, we do not need to check if the expressions in conditions are
+      // the same as the expressions used in partitioning columns.
+      if (partitionAttrs.forall(_.isInstanceOf[Attribute])) {


I don't think this check is necessary. partitionAttrs is AttributeSet and AttributeSet extends Traversable[Attribute].

Yep, I'll remove that check, thanks!

cloud-fan · 2016-07-08T08:19:31Z

is it a typo in PR title? currectly -> correctly

SparkQA · 2016-07-08T17:40:55Z

Test build #61985 has finished for PR 14012 at commit 98d369e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-09T12:59:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+   * @return (pushDown, stayUp)
+   */
+  private def splitPushdownPredicates(
+      condition: Expression)(specificRules: (Expression) => Boolean) = {


hmmm, looks like duplicating these codes is more readable, @liancheng what do you think?

Yea... @jiangxb1987 Sorry that I had to agree with @cloud-fan. Seems that factoring out this method makes the code harder to understand, mostly because the semantics of specificRules is quite convoluted. Could you please revert this part? Sorry again for the trouble!

I have reverted this part, thanks!

jiangxb1987 · 2016-07-11T11:29:47Z

@liancheng please find some time to review the latest updates, thanks!

…adable.

SparkQA · 2016-07-11T18:58:43Z

Test build #62100 has finished for PR 14012 at commit 9b2b5a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-07-13T16:22:09Z

Thanks! Merged this to master.

JoshRosen · 2016-09-29T18:43:25Z

Since this is a correctness fix, albeit a minor one, I'd like to backport it. I'm going to cherry-pick this to branch-2.0.

…dicates correctly in non-deterministic condition. ## What changes were proposed in this pull request? Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example: ```SELECT a FROM t WHERE rand() < 0.1 AND a = 1``` And ```SELECT a FROM t WHERE a = 1 AND rand() < 0.1``` may call rand() for different times and therefore the output rows differ. This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates. ## How was this patch tested? Expanded related testcases in FilterPushdownSuite. Author: 蒋星博 <jiangxingbo@meituan.com> Closes #14012 from jiangxb1987/ppd. (cherry picked from commit f376c37) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

[SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown pre…

856d86d

…dicates currectly in non-deterministic condition.

liancheng reviewed Jul 5, 2016
View reviewed changes

refactor code to be easier to read.

c005645

fix Scalastyle check fails.

a4e62f0

fix Scalastyle check fails.

33d45de

liancheng reviewed Jul 8, 2016
View reviewed changes

jiangxb1987 changed the title ~~[SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown pre…~~ [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition. Jul 8, 2016

cloud-fan reviewed Jul 8, 2016
View reviewed changes

jiangxb1987 changed the title ~~[SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition.~~ [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition. Jul 8, 2016

jiangxb1987 added 2 commits July 8, 2016 22:40

remove unnessary check

f40a0fa

refactor the split pushdown predicates related logic as a helper method.

98d369e

cloud-fan reviewed Jul 9, 2016
View reviewed changes

revert the previous change because duplicating these codes is more re…

9b2b5a8

…adable.

asfgit closed this in f376c37 Jul 13, 2016

jiangxb1987 mentioned this pull request Jul 20, 2016

[SPARK-14172][SQL] Hive table partition predicate not passed down correctly #13893

Closed

jiangxb1987 deleted the ppd branch August 12, 2016 08:07

gatorsmile mentioned this pull request Aug 25, 2016

[SPARK-17244] Catalyst should not pushdown non-deterministic join conditions #14815

Closed

		@@ -1106,21 +1106,32 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper {
		// Push [[Filter]] operators through [[Window]] operators. Parts of the predicate that can be
		// pushed beneath must satisfy the following two conditions:

Conversation

jiangxb1987 commented Jul 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jiangxb1987 commented Jul 1, 2016

Uh oh!

cloud-fan commented Jul 3, 2016

Uh oh!

SparkQA commented Jul 3, 2016

Uh oh!

liancheng commented Jul 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

jiangxb1987 commented Jul 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 8, 2016

Uh oh!

liancheng commented Jul 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 8, 2016

Uh oh!

SparkQA commented Jul 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jul 11, 2016

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

liancheng commented Jul 13, 2016

Uh oh!

JoshRosen commented Sep 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiangxb1987 commented Jul 1, 2016 •

edited

Loading