Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27123][SQL] Improve CollapseProject to handle projects cross limit/repartition/sample #24049

Closed
wants to merge 4 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 11, 2019

What changes were proposed in this pull request?

CollapseProject optimizer rule simplifies some plans by merging the adjacent projects and performing alias substitutions.

scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain
== Physical Plan ==
*(1) Project [a#5 AS c#1]
+- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]

We can do that more complex cases like the following. This PR aims to handle adjacent projects across limit/repartition/sample. Here, repartition means Repartition, not RepartitionByExpression.

BEFORE

scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain
== Physical Plan ==
*(2) Project [b#0 AS c#1]
+- Exchange RoundRobinPartitioning(1)
   +- *(1) Project [a#5 AS b#0]
      +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]

AFTER

scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(1) Project [a#11 AS c#7]
   +- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11]

How was this patch tested?

Pass the Jenkins with the newly added and updated test cases.

@dongjoon-hyun
Copy link
Member Author

@maropu . This is the one you requested before. Could you review this?

@maropu
Copy link
Member

maropu commented Mar 11, 2019

Thanks, @dongjoon-hyun ! If this resolved, we can remove the patten match (https://github.com/apache/spark/pull/23964/files#diff-43334bab9616cc53e8797b9afa9fc7aaR46) in #23964 ?

@dongjoon-hyun
Copy link
Member Author

Yes, @maropu ! I'll rebase that PR after this is merged.

@maropu
Copy link
Member

maropu commented Mar 11, 2019

nice! I'll review later.

@dongjoon-hyun
Copy link
Member Author

Yep. Thanks, @maropu .
Also, @cloud-fan . Could you review this PR when you have some time?

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103298 has finished for PR 24049 at commit 7ae3932.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 11, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103306 has finished for PR 24049 at commit 7ae3932.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103334 has finished for PR 24049 at commit 4d92cb9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103347 has finished for PR 24049 at commit cc8bec6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Ur, it's weird because it passed locally. I'll rebase this to the master.

@SparkQA
Copy link

SparkQA commented Mar 12, 2019

Test build #103354 has finished for PR 24049 at commit 853cd4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @dbtsai and @maropu .
Now, it's specifically targeting the redundant aliasing (renaming) cases.

@SparkQA
Copy link

SparkQA commented Mar 12, 2019

Test build #103364 has finished for PR 24049 at commit 3189685.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @dbtsai and @maropu .
Could you review this once more when you have some time?

@dbtsai
Copy link
Member

dbtsai commented Mar 12, 2019

LGTM. Merged into master. Thanks.

@dbtsai dbtsai closed this Mar 12, 2019
dbtsai pushed a commit that referenced this pull request Mar 12, 2019
…imit/repartition/sample

## What changes were proposed in this pull request?

`CollapseProject` optimizer rule simplifies some plans by merging the adjacent projects and performing alias substitutions.
```scala
scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain
== Physical Plan ==
*(1) Project [a#5 AS c#1]
+- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]
```

We can do that more complex cases like the following. This PR aims to handle adjacent projects across limit/repartition/sample. Here, repartition means `Repartition`, not `RepartitionByExpression`.

**BEFORE**
```scala
scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain
== Physical Plan ==
*(2) Project [b#0 AS c#1]
+- Exchange RoundRobinPartitioning(1)
   +- *(1) Project [a#5 AS b#0]
      +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]
```

**AFTER**
```scala
scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM t)").explain
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(1) Project [a#11 AS c#7]
   +- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11]
```

## How was this patch tested?

Pass the Jenkins with the newly added and updated test cases.

Closes #24049 from dongjoon-hyun/SPARK-27123.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
@dongjoon-hyun
Copy link
Member Author

Thank you so much for review and merging, @dbtsai . Also, thank you, @maropu !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-27123 branch March 12, 2019 21:52
@maropu
Copy link
Member

maropu commented Mar 12, 2019

Sorry to be late; could you update the description of CollapseProject as follow-up?

/**
 * Combines two adjacent [[Project]] operators into one and perform alias substitution,
 * merging the expressions into one single expression.
 */

Probably, it would be better to describe something about the new target this pr added. It seems Project -> Sample -> Project is not a case of adjacent projects?

@dongjoon-hyun
Copy link
Member Author

Yep. Sure!

@gatorsmile
Copy link
Member

This will cause the perf regression, right?

@@ -699,6 +699,24 @@ object CollapseProject extends Rule[LogicalPlan] {
agg.copy(aggregateExpressions = buildCleanedProjectList(
p.projectList, agg.aggregateExpressions))
}
case p1 @ Project(_, g @ GlobalLimit(_, l @ LocalLimit(_, p2: Project))) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be late for the review. I have 2 concerns about this optimization:

  1. if p2 outputs one column, and p1 outputs 1000 columns, then pushing down p1 through limit operator would increase the data size to be shuffled.
  2. if p1 has an expensive expression like UDF, pushing p1 through limit operator means the expensive expression will be executed a lot more times.

Do we have a general rule to justify the benefit of pushing down the project operator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @gatorsmile and @cloud-fan . I got it. I'll narrow down with if isRenaming(l1, l2) => for these cases, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe case 1 can be handled with isRenaming, but I'm not sure how case 2 can be handled.

Case 2 actually means that we can't push down project through operators that will reduce the numRows, e.g. limit, sample, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . isRenaming use semanticEquals. Case 2 will be prevented. I'll make a PR soon.

@dongjoon-hyun
Copy link
Member Author

Hi, @maropu , @gatorsmile , @cloud-fan . I'll make a followup very soon. Is there other concern for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants