[SPARK-31134][SQL] optimize skew join after shuffle partitions are coalesced by cloud-fan · Pull Request #27893 · apache/spark

cloud-fan · 2020-03-12T17:06:37Z

What changes were proposed in this pull request?

Run the OptimizeSkewedJoin rule after the CoalesceShufflePartitions rule.

Why are the changes needed?

Remove duplicated coalescing code in OptimizeSkewedJoin.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

cloud-fan · 2020-03-12T17:13:02Z

cc @maryannxue @JkSelf

SparkQA · 2020-03-12T21:55:43Z

Test build #119721 has finished for PR 27893 at commit 4b8a195.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

maryannxue · 2020-03-13T15:36:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+      val sizes = mapStats.bytesByPartitionId
+      val partitions = partitionSpecs.map {
+        case spec @ CoalescedPartitionSpec(start, end) =>
+          var sum = 0L


nit: sizes.slice(start, end).sum?

slice will create a new array, which is less efficient.

maryannxue · 2020-03-13T15:45:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-            val mapStartIndices = getMapStartIndices(left, partitionIndex, leftTargetSize)
-            if (mapStartIndices.length > 1) {
+            val CoalescedPartitionSpec(start, end) = left.partitions(partitionIndex)._1
+            assert(start + 1 == end, "coalesced partition should never be skewed.")


First of all, our factor check is not strict enough as being "> 0", what happens here if it's set to "1"?
Second, the assert is usually disabled in production, which could lead to errors later in this code.
We should probably make it more robust by putting this condition into isSkew. And you can still add such an assertion in isSkew implementation.

maryannxue · 2020-03-13T16:29:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-        val rightSize = rightStats.bytesByPartitionId(partitionIndex)
+        val rightSize = rightSizes(partitionIndex)
        val isRightSkew = isSkewed(rightSize, rightMedSize) && canSplitRight
        if (isLeftSkew || isRightSkew) {


Not related to this PR, but I think we can remove this outer if now.

JkSelf · 2020-03-16T06:16:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

          val rightParts = if (isRightSkew) {
-            val mapStartIndices = getMapStartIndices(right, partitionIndex, rightTargetSize)
-            if (mapStartIndices.length > 1) {
+            val CoalescedPartitionSpec(start, end) = right.partitions(partitionIndex)._1


The code in the calculation of leftParts and rightParts is almost same. It is better to wrap the code in a method.

SparkQA · 2020-03-16T07:05:01Z

Test build #119840 has finished for PR 27893 at commit a66933b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-16T07:05:01Z

Test build #119837 has finished for PR 27893 at commit fbf616c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-16T07:05:01Z

Test build #119839 has finished for PR 27893 at commit d7e55e8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-16T07:06:38Z

retest this please

SparkQA · 2020-03-16T12:48:00Z

Test build #119841 has finished for PR 27893 at commit a66933b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-16T16:31:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+        val isRightCoalesced = rightPartSpec.startReducerIndex + 1 < rightPartSpec.endReducerIndex
+
+        // Ideally a skewed partition won't get coalesced, but skip it here for safety.
+        val leftParts = if (isLeftSkew && !isLeftCoalesced) {


@JkSelf I tried to create a common method to handle both sides, but the method takes too many parameters so I give up. Besides, it's not much duplicated code here.

maryannxue · 2020-03-16T16:46:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+        val rightPartSpec = right.partitionsWithSizes(partitionIndex)._1
+        val isRightCoalesced = rightPartSpec.startReducerIndex + 1 < rightPartSpec.endReducerIndex
+
+        // Ideally a skewed partition won't get coalesced, but skip it here for safety.


nit: A skewed partition should never be coalesced, but skip it here just to be safe.

maryannxue

LGTM.

SparkQA · 2020-03-16T21:29:41Z

Test build #119878 has finished for PR 27893 at commit ba07313.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-16T22:33:38Z

Test build #119881 has finished for PR 27893 at commit 89ab703.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-16T22:38:39Z

Test build #119883 has finished for PR 27893 at commit c67590a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-03-17T02:46:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+        val rightPartSpec = right.partitionsWithSizes(partitionIndex)._1
+        val isRightCoalesced = rightPartSpec.startReducerIndex + 1 < rightPartSpec.endReducerIndex
+
+        // A skewed partition should never be coalesced, but skip it here just to be safe.


Say we have original map output: 100, 10, 2000, and the coalesce target is 100. So, after CoalesceShufflePartitions, we shall have CoalescedPartitionSpec(0, 1) and CoalescedPartitionSpec(1, 3). Then, we start to apply OptimizeSkewedJoin where CoalescedPartitionSpec(1, 3) is obviously skewed but can be missed. Right?

I don't think the coalesce rule will coalesce 10 and 2000, can you double check?

Oh yeah, I checked, you're right!

Ngone51 · 2020-03-17T06:43:41Z

LGTM

gatorsmile · 2020-03-17T07:22:15Z

Thanks! Merged to master/3.0

…alesced ### What changes were proposed in this pull request? Run the `OptimizeSkewedJoin` rule after the `CoalesceShufflePartitions` rule. ### Why are the changes needed? Remove duplicated coalescing code in `OptimizeSkewedJoin`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27893 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 30d9535) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…alesced ### What changes were proposed in this pull request? Run the `OptimizeSkewedJoin` rule after the `CoalesceShufflePartitions` rule. ### Why are the changes needed? Remove duplicated coalescing code in `OptimizeSkewedJoin`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes apache#27893 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

maryannxue reviewed Mar 13, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala Show resolved Hide resolved

maryannxue reviewed Mar 13, 2020

View reviewed changes

JkSelf reviewed Mar 16, 2020

View reviewed changes

optimize skew join after shuffle partitions are coalesced

fbf616c

cloud-fan force-pushed the aqe branch 2 times, most recently from 5671ce2 to d7e55e8 Compare March 16, 2020 06:48

address comment

a66933b

cloud-fan force-pushed the aqe branch from d7e55e8 to a66933b Compare March 16, 2020 06:51

simplify

ba07313

cloud-fan commented Mar 16, 2020

View reviewed changes

maryannxue reviewed Mar 16, 2020

View reviewed changes

cloud-fan added 2 commits March 17, 2020 01:06

improve logging

89ab703

fix

c67590a

maryannxue approved these changes Mar 16, 2020

View reviewed changes

Ngone51 reviewed Mar 17, 2020

View reviewed changes

gatorsmile closed this in 30d9535 Mar 17, 2020

Conversation

cloud-fan commented Mar 12, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 12, 2020

Uh oh!

SparkQA commented Mar 12, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

cloud-fan commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

SparkQA commented Mar 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Mar 17, 2020

Uh oh!

gatorsmile commented Mar 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants