[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle #27226

JkSelf · 2020-01-16T03:11:13Z

What changes were proposed in this pull request?

OptimizeSkewedJoin rule change the outputPartitioning after inserting PartialShuffleReaderExec or SkewedPartitionReaderExec. So it may need to introduce additional to ensure the right result. This PR disable OptimizeSkewedJoin rule when introducing additional shuffle.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new ut

JkSelf · 2020-01-16T03:13:16Z

@cloud-fan @hvanhovell @maryannxue Please help review if you have available time. Thanks for your help.

SparkQA · 2020-01-16T03:16:34Z

Test build #116807 has finished for PR 27226 at commit 4bed602.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-16T03:18:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+  private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) =
+    plan match {
+      case stage: ShuffleQueryStageExec => (true, stage)
+      case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) =>


nit: case SortExec(_, _, s: ShuffleQueryStageExec, _)

cloud-fan · 2020-01-16T03:19:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+  private def reOptimizeChild(
+      skewedReader: SkewedPartitionReaderExec,
+      child: SparkPlan): SparkPlan = child match {
+    case sort: SortExec if (sort.child.isInstanceOf[ShuffleQueryStageExec]) =>


cloud-fan · 2020-01-16T04:40:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-          |Try to optimize skewed join.
-          |Left side partition size: $leftSizeInfo
-          |Right side partition size: $rightSizeInfo
+           |Try to optimize skewed join.


the previous indentation seems corrected.

cloud-fan · 2020-01-16T04:41:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-        s1 @ SortExec(_, _, left: ShuffleQueryStageExec, _),
-        s2 @ SortExec(_, _, right: ShuffleQueryStageExec, _))
-      if supportedJoinTypes.contains(joinType) =>
+  private def containShuffleQueryStage(plan : SparkPlan): (Boolean, ShuffleQueryStageExec) =


why not just return Option[ShuffleQueryStageExec]? we can rename the method to getShuffleQueryStage

cloud-fan · 2020-01-16T04:43:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+      child: SparkPlan): SparkPlan = child match {
+    case sort @ SortExec(_, _, s: ShuffleQueryStageExec, _) =>
+      sort.copy(child = skewedReader)
+    case _ => child


shouldn't this be: case _: ShuffleQueryStageExec => skewedReader?

cloud-fan · 2020-01-16T04:44:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

          }
        }
      }
      logDebug(s"number of skewed partitions is ${skewedPartitions.size}")
      if (skewedPartitions.nonEmpty) {
+        val visitedStages = HashSet.empty[Int]
        val optimizedSmj = smj.transformDown {


how about transformUp? Then we don't need the visitedStages

cloud-fan · 2020-01-16T04:45:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

        val optimizedSmj = smj.transformDown {
-          case sort @ SortExec(_, _, shuffleStage: ShuffleQueryStageExec, _) =>
-            sort.copy(child = PartialShuffleReaderExec(shuffleStage, skewedPartitions.toSet))
+          case shuffleStage: ShuffleQueryStageExec if !visitedStages.contains(shuffleStage.id) =>


to be safe, we should do case s: ShuffleQueryStageExec if s.id == left.id || s.id == right.id

cloud-fan · 2020-01-16T04:45:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

@@ -189,6 +230,21 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
      }
  }

+  def handleSkewJoin(plan: SparkPlan): SparkPlan = {


this is not a long method, maybe just inline it in apply?

cloud-fan · 2020-01-16T04:46:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

@@ -579,6 +579,33 @@ class AdaptiveQueryExecSuite
    }
  }

+  test("SPARK-30524: AQE should disable OptimizeSkewedJoin rule" +


nit: SPARK-30524: Do not optimize skew join if introduce additional shuffle

SparkQA · 2020-01-16T07:57:22Z

Test build #116810 has finished for PR 27226 at commit 903309f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-16T08:05:02Z

Test build #116816 has finished for PR 27226 at commit 81ad999.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-16T08:05:10Z

Test build #116819 has finished for PR 27226 at commit c37f397.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-16T08:05:12Z

Test build #116808 has finished for PR 27226 at commit c1c05d4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-16T08:50:38Z

retest this please

SparkQA · 2020-01-16T13:23:00Z

Test build #116835 has finished for PR 27226 at commit c37f397.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-16T14:52:19Z

thanks, merging to master!

maryannxue · 2020-01-16T16:32:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-      handleSkewJoin(plan)
+      // When multi table join, there will be too many complex combination to consider.
+      // Currently we only handle 2 table join like following two use cases.
+      // SMJ                    SMJ


Sorry that my previous comment was wrong. Once we have shuffle, there should always be a sort. So we don't need to match this.

maryannxue · 2020-01-16T16:33:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

@@ -95,8 +96,20 @@ case class SortMergeJoinExec(
        s"${getClass.getSimpleName} should not take $x as the JoinType")
  }

-  override def requiredChildDistribution: Seq[Distribution] =


We should probably make this a flag to indicate it's a partial SMJ. This whole matching is too tightly coupled with the skew join rule itself.

maryannxue · 2020-01-16T16:43:45Z

@JkSelf can you do a quick follow up for the comments above as well as this one:
https://github.com/apache/spark/pull/26434/files#r367506078 ?

### What changes were proposed in this pull request? Resolve the remaining comments in [PR#27226](#27226). ### Why are the changes needed? Resolve the comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27253 from JkSelf/followup-skewjoinoptimization2. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

JkSelf added 6 commits January 3, 2005 22:35

disable OptimizeSkewedJoin rule when introducing additional shuffle

4bed602

fix the compile error

c1c05d4

resolve comments

903309f

resolve the comments

d9f0c6b

remove unused import

81ad999

small fix

c37f397

JkSelf changed the title ~~Disable OptimizeSkewedJoin rule when introducing additional shuffle~~ [SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle Jan 16, 2020

cloud-fan reviewed Jan 16, 2020

View reviewed changes

cloud-fan approved these changes Jan 16, 2020

View reviewed changes

cloud-fan closed this in 6e5b4bf Jan 16, 2020

maryannxue reviewed Jan 16, 2020

View reviewed changes

JkSelf mentioned this pull request Jan 17, 2020

[SPARK-30524] [SQL] follow up SPARK-30524 to resolve comments #27253

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle #27226

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle #27226

JkSelf commented Jan 16, 2020

JkSelf commented Jan 16, 2020

SparkQA commented Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

cloud-fan Jan 16, 2020

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

cloud-fan commented Jan 16, 2020

SparkQA commented Jan 16, 2020

cloud-fan commented Jan 16, 2020

maryannxue Jan 16, 2020

maryannxue Jan 16, 2020

maryannxue commented Jan 16, 2020

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle #27226

[SPARK-30524] [SQL] Disable OptimizeSkewedJoin rule when introducing additional shuffle #27226

Conversation

JkSelf commented Jan 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

JkSelf commented Jan 16, 2020

SparkQA commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

cloud-fan commented Jan 16, 2020

SparkQA commented Jan 16, 2020

cloud-fan commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maryannxue commented Jan 16, 2020