[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample #24049

dongjoon-hyun · 2019-03-11T05:54:11Z

What changes were proposed in this pull request?

CollapseProject optimizer rule simplifies some plans by merging the adjacent projects and performing alias substitutions.

scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain
== Physical Plan ==
*(1) Project [a#5 AS c#1]
+- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]

We can do that more complex cases like the following. This PR aims to handle adjacent projects across limit/repartition/sample. Here, repartition means Repartition, not RepartitionByExpression.

BEFORE

scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain
== Physical Plan ==
*(2) Project [b#0 AS c#1]
+- Exchange RoundRobinPartitioning(1)
   +- *(1) Project [a#5 AS b#0]
      +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]

AFTER

scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(1) Project [a#11 AS c#7]
   +- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11]

How was this patch tested?

Pass the Jenkins with the newly added and updated test cases.

dongjoon-hyun · 2019-03-11T05:57:02Z

@maropu . This is the one you requested before. Could you review this?

maropu · 2019-03-11T06:16:49Z

Thanks, @dongjoon-hyun ! If this resolved, we can remove the patten match (https://github.com/apache/spark/pull/23964/files#diff-43334bab9616cc53e8797b9afa9fc7aaR46) in #23964 ?

dongjoon-hyun · 2019-03-11T06:35:58Z

Yes, @maropu ! I'll rebase that PR after this is merged.

maropu · 2019-03-11T06:37:18Z

nice! I'll review later.

dongjoon-hyun · 2019-03-11T06:41:09Z

Yep. Thanks, @maropu .
Also, @cloud-fan . Could you review this PR when you have some time?

SparkQA · 2019-03-11T07:05:01Z

Test build #103298 has finished for PR 24049 at commit 7ae3932.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-11T07:15:18Z

retest this please

SparkQA · 2019-03-11T08:59:41Z

Test build #103306 has finished for PR 24049 at commit 7ae3932.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-11T19:41:25Z

Test build #103334 has finished for PR 24049 at commit 4d92cb9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-11T23:40:56Z

Test build #103347 has finished for PR 24049 at commit cc8bec6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…imit/repartition/sample

dongjoon-hyun · 2019-03-11T23:54:24Z

Ur, it's weird because it passed locally. I'll rebase this to the master.

SparkQA · 2019-03-12T04:11:34Z

Test build #103354 has finished for PR 24049 at commit 853cd4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

dongjoon-hyun · 2019-03-12T07:38:34Z

Hi, @dbtsai and @maropu .
Now, it's specifically targeting the redundant aliasing (renaming) cases.

SparkQA · 2019-03-12T11:49:14Z

Test build #103364 has finished for PR 24049 at commit 3189685.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-12T20:32:31Z

Hi, @dbtsai and @maropu .
Could you review this once more when you have some time?

dbtsai · 2019-03-12T21:45:49Z

LGTM. Merged into master. Thanks.

…imit/repartition/sample ## What changes were proposed in this pull request? `CollapseProject` optimizer rule simplifies some plans by merging the adjacent projects and performing alias substitutions. ```scala scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain == Physical Plan == *(1) Project [a#5 AS c#1] +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] ``` We can do that more complex cases like the following. This PR aims to handle adjacent projects across limit/repartition/sample. Here, repartition means `Repartition`, not `RepartitionByExpression`. **BEFORE** ```scala scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain == Physical Plan == *(2) Project [b#0 AS c#1] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [a#5 AS b#0] +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] ``` **AFTER** ```scala scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- *(1) Project [a#11 AS c#7] +- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11] ``` ## How was this patch tested? Pass the Jenkins with the newly added and updated test cases. Closes #24049 from dongjoon-hyun/SPARK-27123. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

dongjoon-hyun · 2019-03-12T21:52:47Z

Thank you so much for review and merging, @dbtsai . Also, thank you, @maropu !

maropu · 2019-03-12T22:24:27Z

Sorry to be late; could you update the description of CollapseProject as follow-up?

/**
 * Combines two adjacent [[Project]] operators into one and perform alias substitution,
 * merging the expressions into one single expression.
 */

Probably, it would be better to describe something about the new target this pr added. It seems Project -> Sample -> Project is not a case of adjacent projects?

dongjoon-hyun · 2019-03-12T22:44:49Z

Yep. Sure!

gatorsmile · 2019-03-13T06:18:36Z

This will cause the perf regression, right?

cloud-fan · 2019-03-13T06:28:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -699,6 +699,24 @@ object CollapseProject extends Rule[LogicalPlan] {
        agg.copy(aggregateExpressions = buildCleanedProjectList(
          p.projectList, agg.aggregateExpressions))
      }
+    case p1 @ Project(_, g @ GlobalLimit(_, l @ LocalLimit(_, p2: Project))) =>


Sorry to be late for the review. I have 2 concerns about this optimization:

if p2 outputs one column, and p1 outputs 1000 columns, then pushing down p1 through limit operator would increase the data size to be shuffled.

if p1 has an expensive expression like UDF, pushing p1 through limit operator means the expensive expression will be executed a lot more times.

Do we have a general rule to justify the benefit of pushing down the project operator?

Thank you for review, @gatorsmile and @cloud-fan . I got it. I'll narrow down with if isRenaming(l1, l2) => for these cases, too.

I believe case 1 can be handled with isRenaming, but I'm not sure how case 2 can be handled.

Case 2 actually means that we can't push down project through operators that will reduce the numRows, e.g. limit, sample, etc.

@cloud-fan . isRenaming use semanticEquals. Case 2 will be prevented. I'll make a PR soon.

dongjoon-hyun · 2019-03-13T15:34:32Z

Hi, @maropu , @gatorsmile , @cloud-fan . I'll make a followup very soon. Is there other concern for this PR?

dongjoon-hyun mentioned this pull request Mar 11, 2019

[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition #23964

Closed

dongjoon-hyun added 3 commits March 11, 2019 16:52

[SPARK-27123][SQL] Improve CollapseProject to handle projects cross l…

4aec498

…imit/repartition/sample

fix to check deterministic on repartition

0fbab55

fix Sample together.

853cd4e

dbtsai reviewed Mar 12, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

dbtsai reviewed Mar 12, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

Address comments

3189685

dbtsai closed this Mar 12, 2019

dongjoon-hyun deleted the SPARK-27123 branch March 12, 2019 21:52

cloud-fan reviewed Mar 13, 2019

View reviewed changes

This was referenced Mar 13, 2019

[SPARK-27123][SQL][FOLLOWUP] Use isRenaming check for limit too. #24082

Closed

[SPARK-27034][SPARK-27123][SQL][FOLLOWUP] Update Nested Schema Pruning BM result with EC2 #24078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample #24049

[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample #24049

dongjoon-hyun commented Mar 11, 2019 •

edited

dongjoon-hyun commented Mar 11, 2019

maropu commented Mar 11, 2019

dongjoon-hyun commented Mar 11, 2019

maropu commented Mar 11, 2019

dongjoon-hyun commented Mar 11, 2019

SparkQA commented Mar 11, 2019

maropu commented Mar 11, 2019

SparkQA commented Mar 11, 2019

SparkQA commented Mar 11, 2019

SparkQA commented Mar 11, 2019

dongjoon-hyun commented Mar 11, 2019

SparkQA commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

SparkQA commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

dbtsai commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

maropu commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

gatorsmile commented Mar 13, 2019

cloud-fan Mar 13, 2019

dongjoon-hyun Mar 13, 2019

cloud-fan Mar 13, 2019

dongjoon-hyun Mar 13, 2019

dongjoon-hyun commented Mar 13, 2019

[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample #24049

[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample #24049

Conversation

dongjoon-hyun commented Mar 11, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Mar 11, 2019

maropu commented Mar 11, 2019

dongjoon-hyun commented Mar 11, 2019

maropu commented Mar 11, 2019

dongjoon-hyun commented Mar 11, 2019

SparkQA commented Mar 11, 2019

maropu commented Mar 11, 2019

SparkQA commented Mar 11, 2019

SparkQA commented Mar 11, 2019

SparkQA commented Mar 11, 2019

dongjoon-hyun commented Mar 11, 2019

SparkQA commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

SparkQA commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

dbtsai commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

maropu commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

gatorsmile commented Mar 13, 2019

cloud-fan Mar 13, 2019

Choose a reason for hiding this comment

dongjoon-hyun Mar 13, 2019

Choose a reason for hiding this comment

cloud-fan Mar 13, 2019

Choose a reason for hiding this comment

dongjoon-hyun Mar 13, 2019

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 13, 2019

dongjoon-hyun commented Mar 11, 2019 •

edited