-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13919] [SQL] fix column pruning through filter #11828
Conversation
*/ | ||
object ColumnPruning extends Rule[LogicalPlan] { | ||
private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean = | ||
output1.size == output2.size && | ||
output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2)) | ||
|
||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we are using transform
, which is actually transformDown
. In this rule ColumnPruning
, we could add many Project
into the child. This could easily cause stack overflow. That is why my PR #11745 is changing it to transformUp
. Do you think this change makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column pruning have to be from top to bottom, or you will need multiple run of this rule. The added Projection is exactly the same whenever you go from top or bottom. If going from bottom, it will not work sometimes (because the added Project will be moved by other rules, for sample filter push down).
Have you actually see the stack overflow on this rule? I donot think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using transformUp
, the removeProjectBeforeFilter
's assumption is not right. The following line does not cover all the cases:
case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
if p2.outputSet.subsetOf(child.outputSet) =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the stack overflow in my local environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my PR: #11745 covers all the cases even if we change it from transform
to transformUp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change transform to transformUp, it will be great if you can post a test case that cause StackOverflow, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it tonight. I did not have it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unable to reproduce the stack overflow now, if we keep the following lines in ColumnPruning
:
// Eliminate no-op Projects
case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child
If we remove the above line, we will get the stack overflow easily because we can generate duplicate Project
. Anyway, I am fine if you want to use transformDown
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason we should remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If transformDown
is required here, could you change transform
to transformDown
? Got it from the comment in the function transform
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L242-L243
Test build #53560 has finished for PR 11828 at commit
|
This reverts commit d956aad.
Test build #53572 has finished for PR 11828 at commit
|
Test build #53574 has finished for PR 11828 at commit
|
Test build #53577 has finished for PR 11828 at commit
|
@@ -297,7 +302,7 @@ class ColumnPruningSuite extends PlanTest { | |||
SortOrder('b, Ascending) :: Nil, | |||
UnspecifiedFrame)).as('window) :: Nil, | |||
'a :: Nil, 'b.asc :: Nil) | |||
.select('a, 'c, 'window).where('window > 1).select('a, 'c).analyze | |||
.where('window > 1).select('a, 'c).analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why removing .select('a, 'c, 'window)
? It seems like the previous one is a better plan, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The select
before where
help nothing, could be worse (without whole stage codegen), is it really a better plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, it becomes harder for Optimizer to judge which plan is better. Based on my understanding, the general principle of ColumnPruning
is doing the best to add extra Project
to prune unnecessary columns or pushing Project
down as deep as possible. In this case, .select('a, 'c, 'window)
prunes the useless column b
.
Could you explain the current strategy for this rule? We might need to add more test cases to check if it does the desired work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more thinking, can we modify the existing operator Filter
by adding the functionality of Project
into Filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment for that.
I don't think that's necessary or good idea to add the functionality of Project into Filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. It is easier to understand it now. : )
Test build #53600 has finished for PR 11828 at commit
|
cc @cloud-fan |
Test build #53698 has finished for PR 11828 at commit
|
* The Project before Filter is not necessary but conflict with PushPredicatesThroughProject, | ||
* so remove it. | ||
*/ | ||
private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. We still need to explicitly use transformDown
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use transform
in everywhere, even we know that tranformDown is better, for example, all those rules that push down a predicate.
I think it's fine, or we should update all these places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still correct if someone change transform
to transformUp
suddenly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Is that possible there are two continuous Project
following the Filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two continuous Project will be combined together by other rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CollapseProject
is called after this rule. Anyway, we can leave it here if no test case failed due to it.
If my understanding is right, what we want is:
The problem is: at the time we push I think logically we should put Anyway, this PR does fix the problem, but I think we should simplify the |
@cloud-fan That's correct. The reason we keep them together is that many rule depend on each other. It make sense to split them as multiple batches, I'm not sure how clearly we can split them, that's another topic. |
We need to add more tests when splitting the Optimizer rules into multiple batches. So far, the test coverage of Optimizer is weak. We only evaluate the effects of the individual rules. After introducing |
Test build #54069 has finished for PR 11828 at commit
|
Test build #54090 has finished for PR 11828 at commit
|
@@ -518,6 +438,23 @@ class Analyzer( | |||
def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { | |||
case p: LogicalPlan if !p.childrenResolved => p | |||
|
|||
// If the projection list contains Stars, expand it. | |||
case p: Project if containsStar(p.projectList) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved from ResolveStar without any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this can speed up resolution for nested plans, thanks for fixing it!
Test build #54144 has finished for PR 11828 at commit
|
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@cloud-fan Does this look good to you? |
LGTM |
Test build #54150 has finished for PR 11828 at commit
|
Test build #2694 has finished for PR 11828 at commit
|
Merging into master, thanks! |
…n-nullable attributes ## What changes were proposed in this pull request? This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in #11828. ## How was this patch tested? Unit test in `ConstraintPropagationSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11953 from sameeragarwal/infer-isnotnull.
What changes were proposed in this pull request?
This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning.
The RuleExecutor will fail the test if reached max iterations.
Closes #11745
How was this patch tested?
Existing tests.
This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137