Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

Closed
wants to merge 9 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

Before this PR, two Optimizer rules ColumnPruning and PushPredicateThroughProject reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra Project are found in the plan. For example, below is the optimized plan after reaching 100 iterations:

Join Inner, Some((cast(id1#16 as bigint) = id1#18L))
:- Project [id1#16]
:  +- Filter isnotnull(cast(id1#16 as bigint))
:     +- Project [id1#16]
:        +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int>
+- Filter isnotnull(id1#18L)
   +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint>

This PR splits the optimizer rule ColumnPruning to ColumnPruning and EliminateOperators

The issue becomes worse when having another rule NullFiltering, which could add extra Filters for IsNotNull. We have to be careful when introducing extra Filter if the benefit is not large enough. Another PR will be submitted by @sameeragarwal to handle this issue.

cc @sameeragarwal @marmbrus

In addition, ColumnPruning should not push Project through non-deterministic Filter. This could cause wrong results. This will be put in a separate PR.

cc @davies @cloud-fan @yhuai

How was this patch tested?

Modified the existing test cases.

@SparkQA
Copy link

SparkQA commented Mar 13, 2016

Test build #53016 has finished for PR 11682 at commit e128a0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53046 has finished for PR 11682 at commit e5e00ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Mar 14, 2016

@gatorsmile Is this behavior new (introduced by a recent change) or previous versions of the optimizer also have the same behavior?

@gatorsmile
Copy link
Member Author

@yhuai This was introduced in the recent change in ColumnPruning. 1.6 does not have such an issue. Thanks!

@davies
Copy link
Contributor

davies commented Mar 14, 2016

@gatorsmile Thanks for finding this. In general, both PushPredicateThroughProject and ColumnPruning are useful, we should keep both and fix the conflict (make them stable).

I think they are only conflicted with each other when they are on opt of a LeafNode (can't be pushed further), so we don't need to insert a Project between Filter and LeafNode, because PhysicalOperation can already handle Project(Filter(LeafNode())) very well.

For the non-deterministic Filter or Project, we may do things wrong in many places, could you create a separate JIRA for that?

@marmbrus
Copy link
Contributor

I'm worried that we are adding rules that aren't stable. Perhaps we should make the rule executor throw an error when the testing flag is set if we ever hit the maximum number of iterations.

@gatorsmile
Copy link
Member Author

@davies Will try to find a way to keep both. Will submit a separate PR for handling non-determisitic Filter.

@marmbrus Yeah, your concern is valid. Will submit a separate PR to do what you said above.

@davies
Copy link
Contributor

davies commented Mar 14, 2016

After some offline discussions with @marmbrus , we realized that these two rules may still conflict with each other if the Filter can'be pushed throughout the child (for example, outer join). A better solution could be split the ColumnPruning as two rules: a) the first one add new Project 2) the second one remove unnecessary projects, include the Project under Filter (the Project should only prune some columns). The second rule ran just after the first role.

@gatorsmile
Copy link
Member Author

Yeah, that is a good idea. Let me try it. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53121 has finished for PR 11682 at commit adf64da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53119 has finished for PR 11682 at commit f1eee03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53116 has finished for PR 11682 at commit 608b901.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-13840] [SQL] Disable Project Pushdown Through Filter [SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator Mar 15, 2016
@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53144 has finished for PR 11682 at commit bc4685a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 15, 2016

Merging in master.

@asfgit asfgit closed this in 99bd2f0 Mar 15, 2016
@davies
Copy link
Contributor

davies commented Mar 15, 2016

@gatorsmile Have you missed the special rule for Filter(Project()) ?

@gatorsmile
Copy link
Member Author

@davies In the case Filter(Project()), the second rule EliminateOperators removes the useless Project if it does not prune any column.

Could you explain any issue this PR misses? I can submit a follow-up PR to address it. Thanks!

@davies
Copy link
Contributor

davies commented Mar 15, 2016

@gatorsmile This latest changes does not address the problem in the PR description, ColumnPruning and PushPredicateThroughProject still conflict with each other, right?

Could you also add a regression test for that.

@gatorsmile
Copy link
Member Author

@davies I might understand your points. We still prefer PushPredicateThroughProject. Let me submit another PR to address that issue and at you.

Let me explain my understanding. They still conflicts with each other. However, now the changes can converge without extra changes. The final plan depends the order of these two rules.

If we want to push Project deeper, we should keep the current order. Then, after finishing the batch Batch("Operator Optimizations"), we can add another batch for PushPredicateThroughProject. We just need to run it once to make sure Filter is below Project in the final optimized plan.

@gatorsmile gatorsmile deleted the viewDuplicateNames branch March 15, 2016 17:35
@gatorsmile
Copy link
Member Author

After writing more test cases, I found a couple of issues. In addition, we need another split of the rule ColumnPruning. Have a new rule PushProjectThroughPredicate before starting ColumnPruning.

Anyway, will show the details in the PR. Thanks!

* Note: this rule should be executed just after ColumnPruning.
*/
object EliminateOperators extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we want, it cause that we can't prune columsn for Filter(Join()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the latest PR: #11745. It will enable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it after this PR. Thus, I submitted a new one to fix the issue. Sorry for that.

@davies
Copy link
Contributor

davies commented Mar 17, 2016

@gatorsmile @rxin @cloud-fan Since this PR does not solve the problem as expected, also introduce other problems (can't prune columns for Filter(Join(xx)), I have reverted this patch.

@gatorsmile
Copy link
Member Author

@davies Sorry for that. : (

Could you review another PR: #11745 ? That is built on this PR to resolve all the issues. Thanks!

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…g and EliminateOperator

#### What changes were proposed in this pull request?

Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations:

```
Join Inner, Some((cast(id1#16 as bigint) = id1#18L))
:- Project [id1#16]
:  +- Filter isnotnull(cast(id1#16 as bigint))
:     +- Project [id1#16]
:        +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int>
+- Filter isnotnull(id1#18L)
   +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint>
```

This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators`

The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue.

cc sameeragarwal marmbrus

In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR.

cc davies cloud-fan yhuai

#### How was this patch tested?

Modified the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#11682 from gatorsmile/viewDuplicateNames.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants