[SPARK-32753][SQL][3.0] Only copy tags to node with no tags #29665

manuzhang · 2020-09-07T23:40:15Z

This PR backports #29593 to branch-3.0

What changes were proposed in this pull request?

Only copy tags to node with no tags when transforming plans.

Why are the changes needed?

@cloud-fan made a good point that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE.

spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
val df = spark.sql("select id from v1 group by id distribute by id") 
println(df.collect().toArray.mkString(","))
println(df.queryExecution.executedPlan)

// With AQE
[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
AdaptiveSparkPlan(isFinalPlan=true)
+- CustomShuffleReader local
   +- ShuffleQueryStage 0
      +- Exchange hashpartitioning(id#183L, 10), true
         +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
            +- Union
               :- *(1) Range (0, 10, step=1, splits=2)
               +- *(2) Range (0, 10, step=1, splits=2)

// Without AQE
[4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
*(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
+- Exchange hashpartitioning(id#206L, 10), true
   +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
      +- Union
         :- *(1) Range (0, 10, step=1, splits=2)
         +- *(2) Range (0, 10, step=1, splits=2)

It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags.

Does this PR introduce any user-facing change?

Yes. Fix a bug.

How was this patch tested?

Add test.

SparkQA · 2020-09-07T23:51:29Z

Test build #128366 has finished for PR 29665 at commit be84096.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-08T06:36:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+  test("SPARK-32753: Only copy tags to node with no tags") {
+    withSQLConf(
+      SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true"
+    ) {


Indentation?

withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {

Sure. BTW, if this indentation style must be followed, should it be added to checkstyle test ?

dongjoon-hyun · 2020-09-08T06:38:14Z

@manuzhang and @cloud-fan . If this is a kind of a correctness issue, could you add a label to JIRA, please?

dongjoon-hyun · 2020-09-08T06:39:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+    withSQLConf(
+      SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true"
+    ) {
+      spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")


Shall we use withTempView("v1")?

cloud-fan · 2020-09-08T06:59:48Z

label added.

SparkQA · 2020-09-08T07:05:02Z

Test build #128374 has finished for PR 29665 at commit 15b1673.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-08T13:04:04Z

Test build #128387 has finished for PR 29665 at commit 6973697.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-08T13:36:03Z

thanks, merging to 3.0!

This PR backports #29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](#29593 (comment)) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2020-09-08T13:37:23Z

@manuzhang can you create a new PR against master to add the new code updates from this PR?

manuzhang · 2020-09-08T14:41:47Z

@dongjoon-hyun @cloud-fan thanks for review. #29682 has been opened as follow-up against master.

dongjoon-hyun · 2020-09-08T20:38:57Z

Thank you, @cloud-fan and @manuzhang .

### What changes were proposed in this pull request? Fix indentation and clean up view in the test added by #29593. ### Why are the changes needed? Address review comments in #29665. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Closes #29682 from manuzhang/spark-32753-followup. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

This PR backports apache#29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](apache#29593 (comment)) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes apache#29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Fix indentation and clean up view in the test added by apache#29593. ### Why are the changes needed? Address review comments in apache#29665. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Closes apache#29682 from manuzhang/spark-32753-followup. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

probot-autolabeler bot added the SQL label Sep 7, 2020

[SPARK-32753][SQL][3.0] Only copy tags to node with no tags

15b1673

manuzhang force-pushed the spark-32753-3.0 branch from be84096 to 15b1673 Compare September 8, 2020 02:09

cloud-fan approved these changes Sep 8, 2020

View reviewed changes

dongjoon-hyun reviewed Sep 8, 2020

View reviewed changes

Address comments

6973697

cloud-fan closed this Sep 8, 2020

manuzhang mentioned this pull request Sep 8, 2020

[SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test #29682

Closed

rshkv mentioned this pull request Feb 17, 2021

[SPARK-32753][SQL] Only copy tags to node with no tags palantir/spark#732

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32753][SQL][3.0] Only copy tags to node with no tags #29665

[SPARK-32753][SQL][3.0] Only copy tags to node with no tags #29665

manuzhang commented Sep 7, 2020

SparkQA commented Sep 7, 2020

dongjoon-hyun Sep 8, 2020

manuzhang Sep 8, 2020

dongjoon-hyun commented Sep 8, 2020

dongjoon-hyun Sep 8, 2020

cloud-fan commented Sep 8, 2020

SparkQA commented Sep 8, 2020

SparkQA commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

cloud-fan commented Sep 8, 2020 •

edited

Loading

manuzhang commented Sep 8, 2020

dongjoon-hyun commented Sep 8, 2020

[SPARK-32753][SQL][3.0] Only copy tags to node with no tags #29665

[SPARK-32753][SQL][3.0] Only copy tags to node with no tags #29665

Conversation

manuzhang commented Sep 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 7, 2020

dongjoon-hyun Sep 8, 2020

Choose a reason for hiding this comment

manuzhang Sep 8, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 8, 2020

dongjoon-hyun Sep 8, 2020

Choose a reason for hiding this comment

cloud-fan commented Sep 8, 2020

SparkQA commented Sep 8, 2020

SparkQA commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

cloud-fan commented Sep 8, 2020 • edited Loading

manuzhang commented Sep 8, 2020

dongjoon-hyun commented Sep 8, 2020

cloud-fan commented Sep 8, 2020 •

edited

Loading