[SPARK-9082][SQL] Filter using non-deterministic expressions should not be pushed down #7446

cloud-fan · 2015-07-16T14:33:50Z

No description provided.

cloud-fan · 2015-07-16T14:34:05Z

cc @yhuai

SparkQA · 2015-07-16T16:02:41Z

Test build #37499 has finished for PR 7446 at commit 804754d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-07-16T17:10:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+  test("nondeterministic: can't push down filter through project") {
+    val originalQuery = testRelation
+      .select(Rand(10).as('rand))
+      .where('rand > 5)


Maybe it is better to use a condition having both deterministic and non-deterministic expressions, e.g.

val originalQuery = testRelation .select(Rand(10).as('rand), 'a) .where('rand > 5 && 'a > 5)

SparkQA · 2015-07-16T17:19:21Z

Test build #37518 has finished for PR 7446 at commit b5b3c85.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-07-16T17:24:38Z

Thanks for the fix! There is one thing I think it is better to double check. Let's say I have a partitioned Parquet table and I select a few columns from it and then I have predicates on top of it. These predicates have non-deterministic expressions, a predicate on the partition column, and some predicates that can be pushed down to parquet table scan (pushed into parquet's reader). With our fix, will partitioning column get correctly pruned and those predicates get pushed down to parquet's reader? ParquetFilterSuite.scala is the file for testing parquet filter pushdown. We may need to add tests for partition pruning. @liancheng Where are our tests for partition pruning?

cloud-fan · 2015-07-16T18:17:41Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+  test("nondeterministic: can't push down filter through project") {
+    val originalQuery = testRelation
+      .select(Rand(10).as('rand), 'a)
+      .where('rand > 5 && 'a > 5)


looks like we should optimize in to

testRelation .where('a > 5) .select(Rand(10).as('rand), 'a) .where('rand > 5)

Can we do this?

I feel it may require much more changes. I am fine if we do that in a separate pr in future.

SparkQA · 2015-07-16T19:47:07Z

Test build #37527 has finished for PR 7446 at commit 10bdd29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-16T23:11:13Z

@yhuai I believe expressions with UDF(s) are never pushed down in Parquet or any other data sources. Only simple comparison and string predicates dealing with constants can be pushed down. I'm double checking partition pruning.

yhuai · 2015-07-16T23:20:01Z

Yeah. But my concern is if this fix will prevent any legitimate predicates from being pushed down.

chenghao-intel · 2015-07-17T01:12:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      // We only push down filter if their overlapped expressions are all
+      // deterministic.
+      val hasNondeterministic = condition.collect {
+        case a: Attribute if aliasMap.contains(a) => aliasMap(a)


We probably can not use the contains here, as we have to use the semanticEquals for finding the identical expression.

But aliasMap is AttributeMap, I think it should be safe to call contains here?

Oh, yes, the original code seems has bug, it use the .toMap

cloud-fan · 2015-07-17T08:59:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      val hasNondeterministic = projectList1.flatMap(_.collect {
-        case a: Attribute if aliasMap.contains(a) => aliasMap(a).child
-      }).exists(_.find(!_.deterministic).isDefined)
+      val canCollapse = projectList1.find(hasNondeterministic(_, aliasMap)).isEmpty


an unrelated but small change, we don't need to go through the whole project list, can stop once we find nondeterministic expressions.

SparkQA · 2015-07-17T10:52:52Z

Test build #37618 has finished for PR 7446 at commit 0e5e2d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("isMulticlass = " + metadata.isMulticlass)
- * (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits),
- logDebug("isMulticlass = " + metadata.isMulticlass)
- abstract class UnsafeProjection extends Projection
- case class FromUnsafeProjection(fields: Seq[DataType]) extends Projection
- abstract class BaseProjection extends Projection
- class SpecificProjection extends $
- class SpecificProjection extends $

cloud-fan · 2015-07-17T13:03:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      val hasNondeterministic = projectList1.flatMap(_.collect {
-        case a: Attribute if aliasMap.contains(a) => aliasMap(a).child
-      }).exists(_.find(!_.deterministic).isDefined)
+      val noCollapse = projectList1.exists(hasNondeterministic(_, aliasMap))


an unrelated but small change, we don't need to go through the whole project list, can stop once we find nondeterministic expressions.

SparkQA · 2015-07-17T14:42:18Z

Test build #37630 has finished for PR 7446 at commit 33eb2d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-19T11:19:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      sourceAliases: AttributeMap[Alias]) = {
+    project.exists {
+      case a: Attribute =>
+        sourceAliases.get(a).map(_.child.exists(!_.deterministic)).getOrElse(false)


sourceAliases.get(a).exists(_.child.exists(!_.deterministic))

SparkQA · 2015-07-21T04:46:57Z

Test build #37901 has finished for PR 7446 at commit 330021e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ExpectsInputTypes extends Expression
- trait ImplicitCastInputTypes extends ExpectsInputTypes
- trait Unevaluable extends Expression
- trait Nondeterministic extends Expression
- trait CodegenFallback extends Expression
- case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Ascii(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Base64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class UnBase64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class FakeFileStatus(

cloud-fan · 2015-07-21T05:21:26Z

ping @yhuai

liancheng · 2015-07-22T11:59:37Z

@yhuai Sorry that I misunderstood your question at first. I think this PR is safe for normal filter push-down. Verified that partition pruning and Parquet filter push-down are both working properly.

liancheng · 2015-07-22T11:59:45Z

LGTM

yhuai · 2015-07-22T18:43:47Z

LGTM. I am merging it to master.

yhuai · 2015-07-22T18:44:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      // Split the condition into small conditions by `And`, so that we can push down part of this
+      // condition without nondeterministic expressions.
+      val andConditions = splitConjunctivePredicates(condition)
+      val nondeterministicConditions = andConditions.filter(hasNondeterministic(_, aliasMap))


Maybe use a partition at here will be better?

…ghProject` a follow up of #7446 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7607 from cloud-fan/tmp and squashes the following commits: 7106989 [Wenchen Fan] use `partition` in `PushPredicateThroughProject`

yhuai reviewed Jul 16, 2015
View reviewed changes

cloud-fan reviewed Jul 16, 2015
View reviewed changes

chenghao-intel reviewed Jul 17, 2015
View reviewed changes

cloud-fan force-pushed the filter branch from 10bdd29 to 0e5e2d6 Compare July 17, 2015 08:52

cloud-fan reviewed Jul 17, 2015
View reviewed changes

cloud-fan force-pushed the filter branch from 61f87b9 to 33eb2d9 Compare July 17, 2015 13:06

liancheng reviewed Jul 19, 2015
View reviewed changes

cloud-fan added 6 commits July 21, 2015 10:25

Filter using non-deterministic expressions should not be pushed down

557158e

fix bug

8ce15ca

address comments

3912f84

push down part of predicate if possible

949be07

more enhance

2cab68c

add exists to tree node

330021e

cloud-fan force-pushed the filter branch from 33eb2d9 to 330021e Compare July 21, 2015 02:58

yhuai reviewed Jul 22, 2015
View reviewed changes

asfgit closed this in 7652095 Jul 22, 2015

cloud-fan mentioned this pull request Jul 23, 2015

[SQL][minor] use partition in PushPredicateThroughProject #7607

Closed

cloud-fan deleted the filter branch August 27, 2015 13:01

yhuai mentioned this pull request Sep 15, 2015

[SPARK-10539][SQL]Project should not be pushed down through Intersect or Except #8742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9082][SQL] Filter using non-deterministic expressions should not be pushed down #7446

[SPARK-9082][SQL] Filter using non-deterministic expressions should not be pushed down #7446

cloud-fan commented Jul 16, 2015

cloud-fan commented Jul 16, 2015

SparkQA commented Jul 16, 2015

yhuai Jul 16, 2015

SparkQA commented Jul 16, 2015

yhuai commented Jul 16, 2015

cloud-fan Jul 16, 2015

yhuai Jul 16, 2015

SparkQA commented Jul 16, 2015

liancheng commented Jul 16, 2015

yhuai commented Jul 16, 2015

chenghao-intel Jul 17, 2015

cloud-fan Jul 17, 2015

chenghao-intel Jul 17, 2015

cloud-fan Jul 17, 2015

SparkQA commented Jul 17, 2015

cloud-fan Jul 17, 2015

SparkQA commented Jul 17, 2015

liancheng Jul 19, 2015

SparkQA commented Jul 21, 2015

cloud-fan commented Jul 21, 2015

liancheng commented Jul 22, 2015

liancheng commented Jul 22, 2015

yhuai commented Jul 22, 2015

yhuai Jul 22, 2015

[SPARK-9082][SQL] Filter using non-deterministic expressions should not be pushed down #7446

[SPARK-9082][SQL] Filter using non-deterministic expressions should not be pushed down #7446

Conversation

cloud-fan commented Jul 16, 2015

cloud-fan commented Jul 16, 2015

SparkQA commented Jul 16, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2015

yhuai commented Jul 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2015

liancheng commented Jul 16, 2015

yhuai commented Jul 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 17, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 17, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 21, 2015

cloud-fan commented Jul 21, 2015

liancheng commented Jul 22, 2015

liancheng commented Jul 22, 2015

yhuai commented Jul 22, 2015

Choose a reason for hiding this comment