[SPARK-20758][SQL] Add Constant propagation optimization #17993

tejasapatil · 2017-05-16T03:17:10Z

What changes were proposed in this pull request?

See class doc of ConstantPropagation for the approach used.

How was this patch tested?

Added unit tests

tejasapatil · 2017-05-16T03:22:23Z

Jenkins test this please

hvanhovell · 2017-05-16T08:48:56Z

ok to test

hvanhovell · 2017-05-16T08:49:13Z

it is weird that jenkins is not kicking off

hvanhovell · 2017-05-16T16:21:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match {


This might be more straightfoward:

expression.find { case _: Not | _: Or => true }.isDefined

did this change

hvanhovell · 2017-05-16T16:22:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case q: LogicalPlan => q transformExpressionsUp {
+      case and @ (left And right)


case and: And if containsNonConjunctionPredicates(and)?

did this change

hvanhovell · 2017-05-16T16:27:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+      case and @ (left And right)
+        if !containsNonConjunctionPredicates(left) && !containsNonConjunctionPredicates(right) =>
+
+        val leftEntries = left.collect {


Lets put the collect in a function, so we can avoid the repetition.

hvanhovell · 2017-05-16T16:31:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        val predicates = (leftEntries.map(_._2) ++ rightEntries.map(_._2)).toSet
+
+        def replaceConstants(expression: Expression) = expression transform {
+          case a: AttributeReference if constantsMap.contains(a) =>


I don't think the double lookup is necessary. constantsMap.get(a).getOrElse(a) should cover this.

What happens if I do something stupid like i = 1 and ((j = 1) = (j = i))? I think j = 1 might replaced by 1 = 1.

Nice catch !!! I changed the logic to handle that.

SparkQA · 2017-05-16T16:35:01Z

Test build #76964 has finished for PR 17993 at commit bb3b349.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-16T16:37:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+
+        and transform {
+          case e @ EqualTo(_, _) if !predicates.contains(e) &&
+            e.references.exists(ref => constantsMap.contains(ref)) =>


Building the references map is more expensive, shall we just skip this?

gatorsmile · 2017-05-16T18:13:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        if !containsNonConjunctionPredicates(left) && !containsNonConjunctionPredicates(right) =>
+
+        val leftEntries = left.collect {
+          case e @ EqualTo(left: AttributeReference, right: Literal) => ((left, right), e)


How about EqualNullSafe? Normally, we use Equality

did this change

dongjoon-hyun · 2017-05-19T18:34:41Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantPropagationSuite.scala

+  private val columnB = 'b.int
+
+  /**
+   * Unit tests for constant propagation in expressions.


Hi, @tejasapatil . Nit. It looks like test suite comment. Can we move this comment to line 27?

did this change

tejasapatil · 2017-05-20T22:44:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+  }.isDefined
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case f: Filter => f transformExpressionsUp {


I was initially doing this for the entire logical plan but now switched to do only for filter operator.
Reason: Doing this for the entire logical plan will mess up with JOIN predicates. eg.

SELECT * FROM a JOIN b ON a.i = 1 AND b.i = a.i => SELECT * FROM a JOIN b ON a.i = 1 AND b.i = 1

.. the result is a cartesian product and Spark fails (asking to set a config). In case of OUTER JOINs, changing the join predicates might cause regression.

Maybe I am being myopic here but the result should be the same right? The only way this regresses is when we plan a CartesianProduct instead of an BroadcastNestedLoopJoin... I am fine with not optimizing this for now, it would be nice if these constraints are at least generated here.

Yes the result should be the same. I don't have any theoretical proof if doing this over joins will be safe so want to be cautious here ... any bad rules might lead to correctness bugs which is super bad for end users.

it would be nice if these constraints are at least generated here

Sorry I am not able to get you here and want to make sure if I am not ignoring your comment. Are you suggesting any changes over the existing version ?

We currently infer is not null constraints up and down the plan. This could be easily extended to other constraints. Your PR has some overlap with this. However, lets focus on getting this merged first, and then we might take a stab at extending this.

also cc @sameeragarwal

SparkQA · 2017-05-20T23:13:19Z

Test build #77133 has finished for PR 17993 at commit cc026da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-21T00:04:47Z

Test build #77134 has finished for PR 17993 at commit b8c4147.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-05-21T22:36:05Z

Jenkins test this please

SparkQA · 2017-05-21T23:52:00Z

Test build #77160 has started for PR 17993 at commit aaad78c.

tejasapatil · 2017-05-22T17:57:15Z

Jenkins test this please

SparkQA · 2017-05-22T19:56:27Z

Test build #77193 has finished for PR 17993 at commit aaad78c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-26T01:53:38Z

Test build #77392 has finished for PR 17993 at commit 399f348.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-05-26T02:27:28Z

cc @hvanhovell @gatorsmile @dongjoon-hyun

dongjoon-hyun · 2017-05-26T16:09:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+ *   in the AND node.
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = expression.find {


private def?

dongjoon-hyun · 2017-05-26T16:13:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        }
+
+        val constantsMap = AttributeMap(equalityPredicates.map(_._1))
+        val predicates = equalityPredicates.map(_._2).toSet


I'm wondering if it's safe when we have both a = 1 and a = 2 at the same time?

Current impl will pick the last one (ie. a = 2) and propagate it. Given that its one of the equality predicates user provided, there is nothing wrong in propagating it. When the query is evaluated, it would return empty result given that a = 1 and a = 2 cannot be true at the same time.

scala> hc.sql(" SELECT * FROM table1 a WHERE a.j = 1 AND a.j = 2 AND a.k = (a.j + 3)").explain(true) == Physical Plan == *Project [i#51, j#52, k#53] +- *Filter ((((isnotnull(k#53) && isnotnull(j#52)) && (j#52 = 1)) && (j#52 = 2)) && (cast(k#53 as int) = 5)) +- *FileScan orc default.table1[i#51,j#52,k#53] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/Users/tejasp/warehouse/table1], PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j), EqualTo(j,1), EqualTo(j,2)], ReadSchema: struct<i:int,j:int,k:string>

dongjoon-hyun · 2017-05-26T16:14:11Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantPropagationSuite.scala

+        .where(columnA === Literal(11) && columnB === Literal(10)).analyze
+
+    comparePlans(Optimize.execute(query.analyze), correctAnswer)
+  }


Could you add a negative test case like SELECT * FROM t WHERE a=1 and a=2 and b=a+3?

- FileSourceStrategySuite.partitioned table - FileSourceStrategySuite.partitioned table - case insensitive - FileSourceStrategySuite.partitioned table - after scan filters

SparkQA · 2017-05-26T20:53:44Z

Test build #77432 has finished for PR 17993 at commit 731f796.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-28T22:50:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        }
+
+        and transform {
+          case e @ EqualTo(_, _) if !predicates.contains(e) => replaceConstants(e)


Should we check for identity instead of equality? I think you are doing the latter. What will happen in the following example: select * from bla where (a = 1 or b = 2) and a = 1

Here is the behavior with this PR. Seems reasonable because a = 1 has to be true so (a = 1 or b = 2) would always be true and can be eliminated.

scala> hc.sql(" select * from bla where (a = 1 or b = 2) and a = 1 ").explain(true) == Physical Plan == *Project [a#34, b#35] +- *Filter (isnotnull(a#34) && (a#34 = 1)) +- *FileScan ....

hvanhovell · 2017-05-28T23:13:20Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantPropagationSuite.scala

+        .where(
+          columnA === Literal(11) &&
+            columnB === Literal(10) &&
+            (columnA === Add(columnC, Literal(3)) || Literal(10) === columnC))


Should we be able to infer that columnA == Literal(11)?

Perhaps if you increase the number of iterations on ConstantPropagation batch...

I had set a higher value for iterations in previous version of this PR but somehow the unit tests kept failing for me over terminal (surprisingly they worked fine over Intellij). This seems unrelated to the change done in the PR. If you have any advice here, let me know

eg. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77193/testReport/org.apache.spark.sql.catalyst.optimizer/ConstantPropagationSuite/basic_test/

sbt.ForkMain$ForkError: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Max iterations (2) reached for batch ConstantPropagation, tree: !Filter ((a#82440 = 11) && (b#82441 = 10)) +- !Project [a#82440] +- LocalRelation <empty>, [a#82437, b#82438, c#82439] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:105) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.catalyst.optimizer.ConstantPropagationSuite$$anonfun$1.apply$mcV$sp(ConstantPropagationSuite.scala:60) at org.apache.spark.sql.catalyst.optimizer.ConstantPropagationSuite$$anonfun$1.apply(ConstantPropagationSuite.scala:50)

Hmmm... that means the optimizer is not converging to a fixed point. Could you try to increase the number of iterations? You can also check if the optimizer reaches 100 iterations during regular execution; it should log a warning. If it does something is wrong with the rule, and it might cause the optimizer to run prohibitively long...

That did it !! updated the change

SparkQA · 2017-05-29T03:19:28Z

Test build #77493 has finished for PR 17993 at commit 38543de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-29T10:21:08Z

LGTM - merging to master. Thanks!

hvanhovell reviewed May 16, 2017

View reviewed changes

gatorsmile reviewed May 16, 2017

View reviewed changes

dongjoon-hyun reviewed May 19, 2017

View reviewed changes

tejasapatil force-pushed the SPARK-20758_const_propagation branch 2 times, most recently from cc026da to b8c4147 Compare May 20, 2017 22:32

tejasapatil commented May 20, 2017

View reviewed changes

dongjoon-hyun reviewed May 26, 2017

View reviewed changes

tejasapatil added 5 commits May 26, 2017 10:41

Initial version of constant propagation

145faf9

review comment

ef13e87

Fix unit tests

32040ab

- FileSourceStrategySuite.partitioned table - FileSourceStrategySuite.partitioned table - case insensitive - FileSourceStrategySuite.partitioned table - after scan filters

unit test

83f63bb

more test case

731f796

tejasapatil force-pushed the SPARK-20758_const_propagation branch from 399f348 to 731f796 Compare May 26, 2017 18:37

hvanhovell reviewed May 28, 2017

View reviewed changes

test case: change num iterations

38543de

asfgit closed this in f9b59ab May 29, 2017

tejasapatil deleted the SPARK-20758_const_propagation branch June 16, 2017 17:56

wangyum mentioned this pull request Feb 28, 2023

[SPARK-42500][SQL] ConstantPropagation supports more cases #40093

Closed

[SPARK-20758][SQL] Add Constant propagation optimization #17993

[SPARK-20758][SQL] Add Constant propagation optimization #17993

Conversation

tejasapatil commented May 16, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

tejasapatil commented May 16, 2017

hvanhovell commented May 16, 2017

hvanhovell commented May 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 20, 2017

SparkQA commented May 21, 2017

tejasapatil commented May 21, 2017

SparkQA commented May 21, 2017

tejasapatil commented May 22, 2017

SparkQA commented May 22, 2017

SparkQA commented May 26, 2017

tejasapatil commented May 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 26, 2017

Choose a reason for hiding this comment

tejasapatil May 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 29, 2017

hvanhovell commented May 29, 2017

tejasapatil commented May 16, 2017 •

edited

Loading

tejasapatil May 28, 2017 •

edited

Loading