[SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE by AngersZhuuuu · Pull Request #29692 · apache/spark

AngersZhuuuu · 2020-09-09T10:04:01Z

What changes were proposed in this pull request?

For BroadcastNestedLoopJoin, we will broadcast boradcast-side child to all executor and use stream side partition's data traversal broadcast-side data one-by-one.

We have meet some case that stream side data skew and all success task wait for skewed partition to finish.

We know that the execution time increases exponentially with the amount of partition's data.

If skewd with 100x, skewed partition's data will execute 100x than non-skewed part.

It is a bottleneck， with AE, we can avoid this by split skewed part's data to make it more balanced.

Why are the changes needed?

NO

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added UT

SparkQA · 2020-09-09T10:09:40Z

Test build #128445 has finished for PR 29692 at commit 7aba44d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-09T18:46:30Z

Test build #128453 has finished for PR 29692 at commit ec19256.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-09-10T05:11:49Z

ping @cloud-fan @hvanhovell @maryannxue

SparkQA · 2020-09-10T07:05:02Z

Test build #128487 has finished for PR 29692 at commit 8b33b7f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-09-10T07:06:03Z

retest this please

maropu · 2020-09-10T07:25:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+             |Stream side partitions size info:
+             |${getSizeInfo(streamMedSize, stream.mapStats.bytesByPartitionId)}
+        """.stripMargin)
+        val canSplitStream = canSplitLeftSide(joinType)


Is this really correct? How about the case: BuildLeft and right-outer?

Is this really correct? How about the case: BuildLeft and right-outer?

Sorry, a mistake, we should remove this condition since for BroadcastNestedLoopJoin, we don't need to care which kind join type.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

maropu · 2020-09-10T07:38:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

      }
+
+    case bnl @ BroadcastNestedLoopJoinExec(leftChild, rightChild, buildSide, joinType, _, _) =>
+      def resolveBroadcastNLJoinSkew(


This is just a suggestion; could we share code between smj and bnl cases? Most parts look duplicated.

This is just a suggestion; could we share code between smj and bnl cases? Most parts look duplicated.

After all detail ok, I will start this work. I quite agree with this suggestion.

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

maropu · 2020-09-10T10:15:45Z

Ah, one more comment; could you update the code comment in OptimizeSkewedJoin? It looks most comments assume smj only, e.g., https://github.com/apache/spark/pull/29692/files#diff-2d6bea6eed43ca6f37fe3531cb574069R151

AngersZhuuuu · 2020-09-10T10:59:16Z

Ah, one more comment; could you update the code comment in OptimizeSkewedJoin? It looks most comments assume smj only, e.g., https://github.com/apache/spark/pull/29692/files#diff-2d6bea6eed43ca6f37fe3531cb574069R151

Updated, no comment point to SMJ only now.

cloud-fan · 2020-09-10T11:15:54Z

What's the high-level idea? We can handle skew SMJ because there is a shuffle and we can split the partition with the granularity of shuffle blocks. Broadcast join doesn't have shuffles.

SparkQA · 2020-09-10T11:21:31Z

Test build #128513 has finished for PR 29692 at commit 1e36ca0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-10T11:24:12Z

Test build #128514 has finished for PR 29692 at commit 827a0b4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-10T13:19:51Z

Test build #128496 has finished for PR 29692 at commit 8b33b7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-09-10T15:32:21Z

What's the high-level idea? We can handle skew SMJ because there is a shuffle and we can split the partition with the granularity of shuffle blocks. Broadcast join doesn't have shuffles.

Yeah, thought a lot that normal SQL can't match this case.

For our production env experience, data skew (such as stream side is group by and skewed )before broadcast nested loop join always cause a long running time.

two ways to avoid this case :

avoid nested loop join.
use distribute by to increase parallelism by dispersing data.

method 1 need to handle SQL logic, for method 2, although it will cause one more time shuffle, it 's narrow dependent. Nowadays， network cost is cheap and always not a bottleneck.

After try with AQE, AQE's mode is not suitable for this case. Since it doesn't have a shuffle before BNLJ.
In our inner version, in BroadcastNestedLoopJoinExec, after stream side executed, we will get the raw count of each partition and judge if it's skewed seriously, if skewed seriously and volume is large, repartition stream side to make stream side RDD average.

It's not very elegant but the benefits are very clear. I am not sure if community will accept this way.

SparkQA · 2020-09-10T16:37:15Z

Test build #128508 has finished for PR 29692 at commit 8e8b1bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-10T16:55:49Z

Test build #128515 has finished for PR 29692 at commit 5c6f895.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-11T04:49:51Z

after stream side executed, we will get the raw count of each partition and judge if it's skewed seriously, if skewed seriously and volume is large, repartition stream side to make stream side RDD average.

So you add a shuffle to the stream side to stop it before the join node and get statistics?

AngersZhuuuu · 2020-09-11T04:56:45Z

after stream side executed, we will get the raw count of each partition and judge if it's skewed seriously, if skewed seriously and volume is large, repartition stream side to make stream side RDD average.

So you add a shuffle to the stream side to stop it before the join node and get statistics?

Yes, stop before join and get count row of each stream side partitions. re-shuffle stream side then join.

cloud-fan · 2020-09-11T07:38:46Z

If the user manually adds a shuffle (DISTRIBUTE BY) in the query before broadcast join, I think we can take care of the skew. Spark query optimizer should not add the extra shuffle by itself, as it's likely to cause perf regression.

But we need to be careful to break output partitioning by splitting the skewed partitions, and cause extra shuffle.

AngersZhuuuu · 2020-09-11T07:51:21Z

Spark query optimizer should not add the extra shuffle by itself, as it's likely to cause perf regression.

With this rule, we can't handle such data skew case automatic. With strict and reasonable conf value, extra shuffle 's cost is much less than the overhead of data skew.

Especially like broadcast join/broadcast nested loop join. if stream side executing end with a group by(There are many such business scenarios) and always data skew seriously. Getting business people to tune each job is difficult.
For the community, what do you think about this scene

cloud-fan · 2020-09-11T08:10:21Z

Then we need some estimation work, as the shuffle/scan node may be far away from the join node. We also need to carefully justify if the extra shuffle cost worths the skew elimination benefits.

manuzhang · 2020-09-14T03:33:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

    val shuffleStages = collectShuffleStages(plan)

-    if (shuffleStages.length == 2) {
+    if (shuffleStages.length >= 1) {


Does this mean we will also handle multi-table join (e.g. three-table join) ?

Does this mean we will also handle multi-table join (e.g. three-table join) ?

Since broadcast nested loop join only can have one side shuffle exchange, but sort merge join with two

AngersZhuuuu · 2020-09-20T01:43:17Z

Then we need some estimation work, as the shuffle/scan node may be far away from the join node. We also need to carefully justify if the extra shuffle cost worths the skew elimination benefits.

Yea, only when skewed very serious and threshold is reached , worth to re-shuffle data.
Hope for some advise: in current code, is there any method to estimate shuffle cost?

github-actions · 2020-12-30T01:02:03Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AE

8828687

probot-autolabeler bot added the SQL label Sep 9, 2020

Merge branch 'master' into SPARK-32830-BROADCASET-NL-SKEW-JOIN-

7aba44d

Update SQLQuerySuite.scala

ec19256

add UT

8b33b7f

AngersZhuuuu changed the title ~~[WIP][SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AE~~ [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AE Sep 10, 2020

maropu reviewed Sep 10, 2020

View reviewed changes

maropu changed the title ~~[SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AE~~ [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE Sep 10, 2020

add ut and optimize UT performance

8e8b1bd

AngersZhuuuu added 2 commits September 10, 2020 18:56

Update OptimizeSkewedJoin.scala

1e36ca0

Update OptimizeSkewedJoin.scala

827a0b4

Update OptimizeSkewedJoin.scala

5c6f895

manuzhang reviewed Sep 14, 2020

View reviewed changes

github-actions bot added the Stale label Dec 30, 2020

github-actions bot closed this Dec 31, 2020

Conversation

AngersZhuuuu commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 9, 2020

Uh oh!

SparkQA commented Sep 9, 2020

Uh oh!

AngersZhuuuu commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

AngersZhuuuu commented Sep 10, 2020

Uh oh!

maropu Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maropu Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maropu commented Sep 10, 2020

Uh oh!

AngersZhuuuu commented Sep 10, 2020

Uh oh!

cloud-fan commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

AngersZhuuuu commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

cloud-fan commented Sep 11, 2020

Uh oh!

AngersZhuuuu commented Sep 11, 2020

Uh oh!

cloud-fan commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AngersZhuuuu commented Sep 11, 2020

Uh oh!

cloud-fan commented Sep 11, 2020

Uh oh!

manuzhang Sep 14, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Sep 20, 2020

Uh oh!

github-actions bot commented Dec 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

AngersZhuuuu commented Sep 9, 2020 •

edited

Loading

cloud-fan commented Sep 11, 2020 •

edited

Loading