Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13919] [SQL] fix column pruning through filter #11828

Closed
wants to merge 13 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Mar 18, 2016

What changes were proposed in this pull request?

This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning.

The RuleExecutor will fail the test if reached max iterations.

Closes #11745

How was this patch tested?

Existing tests.

This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137

*/
object ColumnPruning extends Rule[LogicalPlan] {
private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean =
output1.size == output2.size &&
output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2))

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are using transform, which is actually transformDown. In this rule ColumnPruning, we could add many Project into the child. This could easily cause stack overflow. That is why my PR #11745 is changing it to transformUp. Do you think this change makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column pruning have to be from top to bottom, or you will need multiple run of this rule. The added Projection is exactly the same whenever you go from top or bottom. If going from bottom, it will not work sometimes (because the added Project will be moved by other rules, for sample filter push down).

Have you actually see the stack overflow on this rule? I donot think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using transformUp, the removeProjectBeforeFilter 's assumption is not right. The following line does not cover all the cases:

case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
   if p2.outputSet.subsetOf(child.outputSet) =>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the stack overflow in my local environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my PR: #11745 covers all the cases even if we change it from transform to transformUp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change transform to transformUp, it will be great if you can post a test case that cause StackOverflow, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it tonight. I did not have it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to reproduce the stack overflow now, if we keep the following lines in ColumnPruning:

    // Eliminate no-op Projects
    case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child

If we remove the above line, we will get the stack overflow easily because we can generate duplicate Project. Anyway, I am fine if you want to use transformDown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason we should remove this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If transformDown is required here, could you change transform to transformDown? Got it from the comment in the function transform
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L242-L243

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53560 has finished for PR 11828 at commit ffe1270.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53572 has finished for PR 11828 at commit d956aad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53574 has finished for PR 11828 at commit b26d1c0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53577 has finished for PR 11828 at commit 920de45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -297,7 +302,7 @@ class ColumnPruningSuite extends PlanTest {
SortOrder('b, Ascending) :: Nil,
UnspecifiedFrame)).as('window) :: Nil,
'a :: Nil, 'b.asc :: Nil)
.select('a, 'c, 'window).where('window > 1).select('a, 'c).analyze
.where('window > 1).select('a, 'c).analyze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why removing .select('a, 'c, 'window)? It seems like the previous one is a better plan, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select before where help nothing, could be worse (without whole stage codegen), is it really a better plan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, it becomes harder for Optimizer to judge which plan is better. Based on my understanding, the general principle of ColumnPruning is doing the best to add extra Project to prune unnecessary columns or pushing Project down as deep as possible. In this case, .select('a, 'c, 'window) prunes the useless column b.

Could you explain the current strategy for this rule? We might need to add more test cases to check if it does the desired work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more thinking, can we modify the existing operator Filter by adding the functionality of Project into Filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment for that.

I don't think that's necessary or good idea to add the functionality of Project into Filter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It is easier to understand it now. : )

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53600 has finished for PR 11828 at commit b1118e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Mar 21, 2016

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53698 has finished for PR 11828 at commit 6e698cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* The Project before Filter is not necessary but conflict with PushPredicatesThroughProject,
* so remove it.
*/
private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. We still need to explicitly use transformDown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use transform in everywhere, even we know that tranformDown is better, for example, all those rules that push down a predicate.

I think it's fine, or we should update all these places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still correct if someone change transform to transformUp suddenly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Is that possible there are two continuous Project following the Filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two continuous Project will be combined together by other rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollapseProject is called after this rule. Anyway, we can leave it here if no test case failed due to it.

@cloud-fan
Copy link
Contributor

If my understanding is right, what we want is:

  • insert a Project below Filter, so that we may have chance to push it down further.
  • push down Filter through Project, to reduce the number of input rows.

The problem is: at the time we push Filter through Project, we don't know if this Project can be pushed down further, and we may mistakenly lift it and make the plan sub-optimal.

I think logically we should put ColumnPruning and PushPredicatesThroughProject in different batches, so that when we push down Filter through Project, we can make sure this Project can't be pushed further and it's safe to lift it.

Anyway, this PR does fix the problem, but I think we should simplify the Operator Optimizations batch, it contains over 30 rules, and is very hard to reason about how these rules interact with each other.

@davies
Copy link
Contributor Author

davies commented Mar 22, 2016

@cloud-fan That's correct. The reason we keep them together is that many rule depend on each other. It make sense to split them as multiple batches, I'm not sure how clearly we can split them, that's another topic.

@gatorsmile
Copy link
Member

We need to add more tests when splitting the Optimizer rules into multiple batches. So far, the test coverage of Optimizer is weak. We only evaluate the effects of the individual rules. After introducing Constraints, I found more rules become correlated. For example, the rule SimplifyCasts affects the predicate push down. : (

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54069 has finished for PR 11828 at commit 6a41cd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54090 has finished for PR 11828 at commit bb8f0cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -518,6 +438,23 @@ class Analyzer(
def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case p: LogicalPlan if !p.childrenResolved => p

// If the projection list contains Stars, expand it.
case p: Project if containsStar(p.projectList) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from ResolveStar without any changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this can speed up resolution for nested plans, thanks for fixing it!

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54144 has finished for PR 11828 at commit cd7132e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@davies
Copy link
Contributor Author

davies commented Mar 25, 2016

@cloud-fan Does this look good to you?

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54150 has finished for PR 11828 at commit d2da9e4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #2694 has finished for PR 11828 at commit d2da9e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Mar 25, 2016

Merging into master, thanks!

@asfgit asfgit closed this in 6603d9f Mar 25, 2016
asfgit pushed a commit that referenced this pull request Mar 25, 2016
…n-nullable attributes

## What changes were proposed in this pull request?

This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in #11828.

## How was this patch tested?

Unit test in `ConstraintPropagationSuite`

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11953 from sameeragarwal/infer-isnotnull.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants