-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682
Conversation
Test build #53016 has finished for PR 11682 at commit
|
Test build #53046 has finished for PR 11682 at commit
|
@gatorsmile Is this behavior new (introduced by a recent change) or previous versions of the optimizer also have the same behavior? |
@yhuai This was introduced in the recent change in |
@gatorsmile Thanks for finding this. In general, both PushPredicateThroughProject and ColumnPruning are useful, we should keep both and fix the conflict (make them stable). I think they are only conflicted with each other when they are on opt of a LeafNode (can't be pushed further), so we don't need to insert a Project between Filter and LeafNode, because PhysicalOperation can already handle Project(Filter(LeafNode())) very well. For the non-deterministic Filter or Project, we may do things wrong in many places, could you create a separate JIRA for that? |
I'm worried that we are adding rules that aren't stable. Perhaps we should make the rule executor throw an error when the testing flag is set if we ever hit the maximum number of iterations. |
After some offline discussions with @marmbrus , we realized that these two rules may still conflict with each other if the Filter can'be pushed throughout the child (for example, outer join). A better solution could be split the ColumnPruning as two rules: a) the first one add new Project 2) the second one remove unnecessary projects, include the Project under Filter (the Project should only prune some columns). The second rule ran just after the first role. |
Yeah, that is a good idea. Let me try it. Thanks! |
Test build #53121 has finished for PR 11682 at commit
|
Test build #53119 has finished for PR 11682 at commit
|
Test build #53116 has finished for PR 11682 at commit
|
LGTM |
Test build #53144 has finished for PR 11682 at commit
|
Merging in master. |
@gatorsmile Have you missed the special rule for Filter(Project()) ? |
@davies In the case Filter(Project()), the second rule Could you explain any issue this PR misses? I can submit a follow-up PR to address it. Thanks! |
@gatorsmile This latest changes does not address the problem in the PR description, ColumnPruning and PushPredicateThroughProject still conflict with each other, right? Could you also add a regression test for that. |
@davies I might understand your points. We still prefer Let me explain my understanding. They still conflicts with each other. However, now the changes can converge without extra changes. The final plan depends the order of these two rules. If we want to push |
After writing more test cases, I found a couple of issues. In addition, we need another split of the rule Anyway, will show the details in the PR. Thanks! |
* Note: this rule should be executed just after ColumnPruning. | ||
*/ | ||
object EliminateOperators extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what we want, it cause that we can't prune columsn for Filter(Join()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the latest PR: #11745. It will enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it after this PR. Thus, I submitted a new one to fix the issue. Sorry for that.
@gatorsmile @rxin @cloud-fan Since this PR does not solve the problem as expected, also introduce other problems (can't prune columns for Filter(Join(xx)), I have reverted this patch. |
…g and EliminateOperator #### What changes were proposed in this pull request? Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations: ``` Join Inner, Some((cast(id1#16 as bigint) = id1#18L)) :- Project [id1#16] : +- Filter isnotnull(cast(id1#16 as bigint)) : +- Project [id1#16] : +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int> +- Filter isnotnull(id1#18L) +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint> ``` This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators` The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue. cc sameeragarwal marmbrus In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR. cc davies cloud-fan yhuai #### How was this patch tested? Modified the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11682 from gatorsmile/viewDuplicateNames.
What changes were proposed in this pull request?
Before this PR, two Optimizer rules
ColumnPruning
andPushPredicateThroughProject
reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. ExtraProject
are found in the plan. For example, below is the optimized plan after reaching 100 iterations:This PR splits the optimizer rule
ColumnPruning
toColumnPruning
andEliminateOperators
The issue becomes worse when having another rule
NullFiltering
, which could add extra Filters forIsNotNull
. We have to be careful when introducing extraFilter
if the benefit is not large enough. Another PR will be submitted by @sameeragarwal to handle this issue.cc @sameeragarwal @marmbrus
In addition,
ColumnPruning
should not pushProject
through non-deterministicFilter
. This could cause wrong results. This will be put in a separate PR.cc @davies @cloud-fan @yhuai
How was this patch tested?
Modified the existing test cases.