[SPARK-13919] [SQL] fix column pruning through filter #11828

davies · 2016-03-18T20:41:03Z

What changes were proposed in this pull request?

This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning.

The RuleExecutor will fail the test if reached max iterations.

Closes #11745

How was this patch tested?

Existing tests.

This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137

gatorsmile · 2016-03-18T20:51:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

 */
 object ColumnPruning extends Rule[LogicalPlan] {
  private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean =
    output1.size == output2.size &&
      output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2))

-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform {


Here, we are using transform, which is actually transformDown. In this rule ColumnPruning, we could add many Project into the child. This could easily cause stack overflow. That is why my PR #11745 is changing it to transformUp. Do you think this change makes sense?

Column pruning have to be from top to bottom, or you will need multiple run of this rule. The added Projection is exactly the same whenever you go from top or bottom. If going from bottom, it will not work sometimes (because the added Project will be moved by other rules, for sample filter push down).

Have you actually see the stack overflow on this rule? I donot think so.

If we are using transformUp, the removeProjectBeforeFilter 's assumption is not right. The following line does not cover all the cases:

case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child))) if p2.outputSet.subsetOf(child.outputSet) =>

I saw the stack overflow in my local environment.

I think my PR: #11745 covers all the cases even if we change it from transform to transformUp

We should not change transform to transformUp, it will be great if you can post a test case that cause StackOverflow, thanks!

Will do it tonight. I did not have it now.

Unable to reproduce the stack overflow now, if we keep the following lines in ColumnPruning:

// Eliminate no-op Projects case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child

If we remove the above line, we will get the stack overflow easily because we can generate duplicate Project. Anyway, I am fine if you want to use transformDown.

There is no reason we should remove this line.

If transformDown is required here, could you change transform to transformDown? Got it from the comment in the function transform
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L242-L243

SparkQA · 2016-03-18T22:10:28Z

Test build #53560 has finished for PR 11828 at commit ffe1270.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit d956aad.

SparkQA · 2016-03-18T23:48:44Z

Test build #53572 has finished for PR 11828 at commit d956aad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-19T00:01:02Z

Test build #53574 has finished for PR 11828 at commit b26d1c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-19T00:29:27Z

Test build #53577 has finished for PR 11828 at commit 920de45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-19T04:03:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala

@@ -297,7 +302,7 @@ class ColumnPruningSuite extends PlanTest {
            SortOrder('b, Ascending) :: Nil,
            UnspecifiedFrame)).as('window) :: Nil,
          'a :: Nil, 'b.asc :: Nil)
-        .select('a, 'c, 'window).where('window > 1).select('a, 'c).analyze
+        .where('window > 1).select('a, 'c).analyze


Any reason why removing .select('a, 'c, 'window)? It seems like the previous one is a better plan, right?

The select before where help nothing, could be worse (without whole stage codegen), is it really a better plan?

If so, it becomes harder for Optimizer to judge which plan is better. Based on my understanding, the general principle of ColumnPruning is doing the best to add extra Project to prune unnecessary columns or pushing Project down as deep as possible. In this case, .select('a, 'c, 'window) prunes the useless column b.

Could you explain the current strategy for this rule? We might need to add more test cases to check if it does the desired work.

After more thinking, can we modify the existing operator Filter by adding the functionality of Project into Filter?

Added comment for that.

I don't think that's necessary or good idea to add the functionality of Project into Filter.

Got it. It is easier to understand it now. : )

SparkQA · 2016-03-19T06:25:56Z

Test build #53600 has finished for PR 11828 at commit b1118e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-21T18:29:34Z

cc @cloud-fan

SparkQA · 2016-03-21T20:10:01Z

Test build #53698 has finished for PR 11828 at commit 6e698cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-21T20:34:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+   * The Project before Filter is not necessary but conflict with PushPredicatesThroughProject,
+   * so remove it.
+   */
+  private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform {


Same here. We still need to explicitly use transformDown.

We usually use transform in everywhere, even we know that tranformDown is better, for example, all those rules that push down a predicate.

I think it's fine, or we should update all these places.

It's still correct if someone change transform to transformUp suddenly.

I see.

Is that possible there are two continuous Project following the Filter?

two continuous Project will be combined together by other rules.

CollapseProject is called after this rule. Anyway, we can leave it here if no test case failed due to it.

cloud-fan · 2016-03-22T05:38:43Z

If my understanding is right, what we want is:

insert a Project below Filter, so that we may have chance to push it down further.
push down Filter through Project, to reduce the number of input rows.

The problem is: at the time we push Filter through Project, we don't know if this Project can be pushed down further, and we may mistakenly lift it and make the plan sub-optimal.

I think logically we should put ColumnPruning and PushPredicatesThroughProject in different batches, so that when we push down Filter through Project, we can make sure this Project can't be pushed further and it's safe to lift it.

Anyway, this PR does fix the problem, but I think we should simplify the Operator Optimizations batch, it contains over 30 rules, and is very hard to reason about how these rules interact with each other.

davies · 2016-03-22T06:29:25Z

@cloud-fan That's correct. The reason we keep them together is that many rule depend on each other. It make sense to split them as multiple batches, I'm not sure how clearly we can split them, that's another topic.

gatorsmile · 2016-03-22T06:37:23Z

We need to add more tests when splitting the Optimizer rules into multiple batches. So far, the test coverage of Optimizer is weak. We only evaluate the effects of the individual rules. After introducing Constraints, I found more rules become correlated. For example, the rule SimplifyCasts affects the predicate push down. : (

SparkQA · 2016-03-24T20:24:27Z

Test build #54069 has finished for PR 11828 at commit 6a41cd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-24T22:53:55Z

Test build #54090 has finished for PR 11828 at commit bb8f0cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-25T03:57:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -518,6 +438,23 @@ class Analyzer(
    def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
      case p: LogicalPlan if !p.childrenResolved => p

+      // If the projection list contains Stars, expand it.
+      case p: Project if containsStar(p.projectList) =>


This is moved from ResolveStar without any changes.

Oh I see, this can speed up resolution for nested plans, thanks for fixing it!

SparkQA · 2016-03-25T05:37:16Z

Test build #54144 has finished for PR 11828 at commit cd7132e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

davies · 2016-03-25T05:51:37Z

@cloud-fan Does this look good to you?

cloud-fan · 2016-03-25T06:00:06Z

LGTM

SparkQA · 2016-03-25T07:11:56Z

Test build #54150 has finished for PR 11828 at commit d2da9e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T09:00:03Z

Test build #2694 has finished for PR 11828 at commit d2da9e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-25T16:04:59Z

Merging into master, thanks!

…n-nullable attributes ## What changes were proposed in this pull request? This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in #11828. ## How was this patch tested? Unit test in `ConstraintPropagationSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11953 from sameeragarwal/infer-isnotnull.

fix column pruning through filter

ffe1270

gatorsmile reviewed Mar 18, 2016
View reviewed changes

Davies Liu added 4 commits March 18, 2016 15:24

fix tests

d956aad

Revert "fix tests"

6c474a5

This reverts commit d956aad.

enable InferFiltersFromConstraints

b26d1c0

add regression tests

920de45

gatorsmile reviewed Mar 19, 2016
View reviewed changes

use the order as Optimizer

b1118e5

add comments

6e698cc

gatorsmile reviewed Mar 21, 2016
View reviewed changes

gatorsmile mentioned this pull request Mar 22, 2016

[SPARK-13891] [SQL] [TEST] Issue Exceptions when Hitting the Max Iteration Limit in Optimizer and Analyzer #11714

Closed

Davies Liu added 2 commits March 24, 2016 12:01

fail the tests if reach max iterations

580bf4e

Merge branch 'master' of github.com:apache/spark into fail_rule

6a41cd4

disable the failed test

bb8f0cc

Merge branch 'master' of github.com:apache/spark into fail_rule

6ac25f7

davies mentioned this pull request Mar 25, 2016

[SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208

Closed

fix ResolveStar

cd7132e

davies reviewed Mar 25, 2016
View reviewed changes

Merge branch 'master' of github.com:apache/spark into fail_rule

d2da9e4

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sameeragarwal mentioned this pull request Mar 25, 2016

[SPARK-14137][SPARK-14150][SQL] Infer IsNotNull constraints from non-nullable attributes #11953

Closed

asfgit closed this in 6603d9f Mar 25, 2016

[SPARK-13919] [SQL] fix column pruning through filter #11828

[SPARK-13919] [SQL] fix column pruning through filter #11828

Conversation

davies commented Mar 18, 2016

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 19, 2016

SparkQA commented Mar 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 19, 2016

davies commented Mar 21, 2016

SparkQA commented Mar 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 22, 2016

davies commented Mar 22, 2016

gatorsmile commented Mar 22, 2016

SparkQA commented Mar 24, 2016

SparkQA commented Mar 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 25, 2016

davies commented Mar 25, 2016

cloud-fan commented Mar 25, 2016

SparkQA commented Mar 25, 2016

SparkQA commented Mar 25, 2016

davies commented Mar 25, 2016