[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

gatorsmile · 2016-03-13T02:06:25Z

What changes were proposed in this pull request?

Before this PR, two Optimizer rules ColumnPruning and PushPredicateThroughProject reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra Project are found in the plan. For example, below is the optimized plan after reaching 100 iterations:

Join Inner, Some((cast(id1#16 as bigint) = id1#18L))
:- Project [id1#16]
:  +- Filter isnotnull(cast(id1#16 as bigint))
:     +- Project [id1#16]
:        +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int>
+- Filter isnotnull(id1#18L)
   +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint>

This PR splits the optimizer rule ColumnPruning to ColumnPruning and EliminateOperators

The issue becomes worse when having another rule NullFiltering, which could add extra Filters for IsNotNull. We have to be careful when introducing extra Filter if the benefit is not large enough. Another PR will be submitted by @sameeragarwal to handle this issue.

cc @sameeragarwal @marmbrus

In addition, ColumnPruning should not push Project through non-deterministic Filter. This could cause wrong results. This will be put in a separate PR.

cc @davies @cloud-fan @yhuai

How was this patch tested?

Modified the existing test cases.

SparkQA · 2016-03-13T03:24:15Z

Test build #53016 has finished for PR 11682 at commit e128a0a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T04:14:12Z

Test build #53046 has finished for PR 11682 at commit e5e00ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-03-14T15:53:06Z

@gatorsmile Is this behavior new (introduced by a recent change) or previous versions of the optimizer also have the same behavior?

gatorsmile · 2016-03-14T16:12:07Z

@yhuai This was introduced in the recent change in ColumnPruning. 1.6 does not have such an issue. Thanks!

davies · 2016-03-14T17:29:17Z

@gatorsmile Thanks for finding this. In general, both PushPredicateThroughProject and ColumnPruning are useful, we should keep both and fix the conflict (make them stable).

I think they are only conflicted with each other when they are on opt of a LeafNode (can't be pushed further), so we don't need to insert a Project between Filter and LeafNode, because PhysicalOperation can already handle Project(Filter(LeafNode())) very well.

For the non-deterministic Filter or Project, we may do things wrong in many places, could you create a separate JIRA for that?

marmbrus · 2016-03-14T17:33:19Z

I'm worried that we are adding rules that aren't stable. Perhaps we should make the rule executor throw an error when the testing flag is set if we ever hit the maximum number of iterations.

gatorsmile · 2016-03-14T17:46:45Z

@davies Will try to find a way to keep both. Will submit a separate PR for handling non-determisitic Filter.

@marmbrus Yeah, your concern is valid. Will submit a separate PR to do what you said above.

davies · 2016-03-14T17:50:02Z

After some offline discussions with @marmbrus , we realized that these two rules may still conflict with each other if the Filter can'be pushed throughout the child (for example, outer join). A better solution could be split the ColumnPruning as two rules: a) the first one add new Project 2) the second one remove unnecessary projects, include the Project under Filter (the Project should only prune some columns). The second rule ran just after the first role.

gatorsmile · 2016-03-14T17:59:17Z

Yeah, that is a good idea. Let me try it. Thanks!

SparkQA · 2016-03-14T23:16:38Z

Test build #53121 has finished for PR 11682 at commit adf64da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T23:29:16Z

Test build #53119 has finished for PR 11682 at commit f1eee03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T23:34:07Z

Test build #53116 has finished for PR 11682 at commit 608b901.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-15T02:17:29Z

LGTM

SparkQA · 2016-03-15T02:53:00Z

Test build #53144 has finished for PR 11682 at commit bc4685a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-15T07:29:46Z

Merging in master.

davies · 2016-03-15T09:14:13Z

@gatorsmile Have you missed the special rule for Filter(Project()) ?

gatorsmile · 2016-03-15T14:13:50Z

@davies In the case Filter(Project()), the second rule EliminateOperators removes the useless Project if it does not prune any column.

Could you explain any issue this PR misses? I can submit a follow-up PR to address it. Thanks!

davies · 2016-03-15T16:58:25Z

@gatorsmile This latest changes does not address the problem in the PR description, ColumnPruning and PushPredicateThroughProject still conflict with each other, right?

Could you also add a regression test for that.

gatorsmile · 2016-03-15T17:35:13Z

@davies I might understand your points. We still prefer PushPredicateThroughProject. Let me submit another PR to address that issue and at you.

Let me explain my understanding. They still conflicts with each other. However, now the changes can converge without extra changes. The final plan depends the order of these two rules.

If we want to push Project deeper, we should keep the current order. Then, after finishing the batch Batch("Operator Optimizations"), we can add another batch for PushPredicateThroughProject. We just need to run it once to make sure Filter is below Project in the final optimized plan.

gatorsmile · 2016-03-15T20:17:47Z

After writing more test cases, I found a couple of issues. In addition, we need another split of the rule ColumnPruning. Have a new rule PushProjectThroughPredicate before starting ColumnPruning.

Anyway, will show the details in the PR. Thanks!

davies · 2016-03-17T06:08:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ * Note: this rule should be executed just after ColumnPruning.
+ */
+object EliminateOperators extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {


This is not what we want, it cause that we can't prune columsn for Filter(Join()).

See the latest PR: #11745. It will enable it.

I realize it after this PR. Thus, I submitted a new one to fix the issue. Sorry for that.

davies · 2016-03-17T06:12:31Z

@gatorsmile @rxin @cloud-fan Since this PR does not solve the problem as expected, also introduce other problems (can't prune columns for Filter(Join(xx)), I have reverted this patch.

gatorsmile · 2016-03-17T06:17:34Z

@davies Sorry for that. : (

Could you review another PR: #11745 ? That is built on this PR to resolve all the issues. Thanks!

…g and EliminateOperator #### What changes were proposed in this pull request? Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations: ``` Join Inner, Some((cast(id1#16 as bigint) = id1#18L)) :- Project [id1#16] : +- Filter isnotnull(cast(id1#16 as bigint)) : +- Project [id1#16] : +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int> +- Filter isnotnull(id1#18L) +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint> ``` This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators` The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue. cc sameeragarwal marmbrus In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR. cc davies cloud-fan yhuai #### How was this patch tested? Modified the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11682 from gatorsmile/viewDuplicateNames.

Disable Project Push Down Through Filter

e128a0a

fixed test cases.

e5e00ae

gatorsmile added 6 commits March 14, 2016 13:58

split columnpruning to two rules

dd8c542

Merge remote-tracking branch 'upstream/master' into viewDuplicateNames

99a7164

fix stackoverflow

608b901

revert the change in OrcFilterSuite back.

c7c5f43

change the test case name back

f1eee03

revert the test case back

adf64da

gatorsmile changed the title ~~[SPARK-13840] [SQL] Disable Project Pushdown Through Filter~~ [SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator Mar 15, 2016

rename.

bc4685a

gatorsmile mentioned this pull request Mar 15, 2016

[SPARK-13891] [SQL] [TEST] Issue Exceptions when Hitting the Max Iteration Limit in Optimizer and Analyzer #11714

Closed

asfgit closed this in 99bd2f0 Mar 15, 2016

gatorsmile deleted the viewDuplicateNames branch March 15, 2016 17:35

gatorsmile mentioned this pull request Mar 16, 2016

[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745

Closed

davies reviewed Mar 17, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

gatorsmile commented Mar 13, 2016

SparkQA commented Mar 13, 2016

SparkQA commented Mar 14, 2016

yhuai commented Mar 14, 2016

gatorsmile commented Mar 14, 2016

davies commented Mar 14, 2016

marmbrus commented Mar 14, 2016

gatorsmile commented Mar 14, 2016

davies commented Mar 14, 2016

gatorsmile commented Mar 14, 2016

SparkQA commented Mar 14, 2016

SparkQA commented Mar 14, 2016

SparkQA commented Mar 14, 2016

cloud-fan commented Mar 15, 2016

SparkQA commented Mar 15, 2016

rxin commented Mar 15, 2016

davies commented Mar 15, 2016

gatorsmile commented Mar 15, 2016

davies commented Mar 15, 2016

gatorsmile commented Mar 15, 2016

gatorsmile commented Mar 15, 2016

davies Mar 17, 2016

gatorsmile Mar 17, 2016

gatorsmile Mar 17, 2016

davies commented Mar 17, 2016

gatorsmile commented Mar 17, 2016

[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

Conversation

gatorsmile commented Mar 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 13, 2016

SparkQA commented Mar 14, 2016

yhuai commented Mar 14, 2016

gatorsmile commented Mar 14, 2016

davies commented Mar 14, 2016

marmbrus commented Mar 14, 2016

gatorsmile commented Mar 14, 2016

davies commented Mar 14, 2016

gatorsmile commented Mar 14, 2016

SparkQA commented Mar 14, 2016

SparkQA commented Mar 14, 2016

SparkQA commented Mar 14, 2016

cloud-fan commented Mar 15, 2016

SparkQA commented Mar 15, 2016

rxin commented Mar 15, 2016

davies commented Mar 15, 2016

gatorsmile commented Mar 15, 2016

davies commented Mar 15, 2016

gatorsmile commented Mar 15, 2016

gatorsmile commented Mar 15, 2016

davies Mar 17, 2016

Choose a reason for hiding this comment

gatorsmile Mar 17, 2016

Choose a reason for hiding this comment

gatorsmile Mar 17, 2016

Choose a reason for hiding this comment

davies commented Mar 17, 2016

gatorsmile commented Mar 17, 2016