[SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE#29692
[SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE#29692AngersZhuuuu wants to merge 8 commits intoapache:masterfrom
Conversation
|
Test build #128445 has finished for PR 29692 at commit
|
|
Test build #128453 has finished for PR 29692 at commit
|
|
Test build #128487 has finished for PR 29692 at commit
|
|
retest this please |
| |Stream side partitions size info: | ||
| |${getSizeInfo(streamMedSize, stream.mapStats.bytesByPartitionId)} | ||
| """.stripMargin) | ||
| val canSplitStream = canSplitLeftSide(joinType) |
There was a problem hiding this comment.
Is this really correct? How about the case: BuildLeft and right-outer?
There was a problem hiding this comment.
Is this really correct? How about the case:
BuildLeftand right-outer?
Sorry, a mistake, we should remove this condition since for BroadcastNestedLoopJoin, we don't need to care which kind join type.
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| case bnl @ BroadcastNestedLoopJoinExec(leftChild, rightChild, buildSide, joinType, _, _) => | ||
| def resolveBroadcastNLJoinSkew( |
There was a problem hiding this comment.
This is just a suggestion; could we share code between smj and bnl cases? Most parts look duplicated.
There was a problem hiding this comment.
This is just a suggestion; could we share code between smj and bnl cases? Most parts look duplicated.
After all detail ok, I will start this work. I quite agree with this suggestion.
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
Outdated
Show resolved
Hide resolved
|
Ah, one more comment; could you update the code comment in |
Updated, no comment point to SMJ only now. |
|
What's the high-level idea? We can handle skew SMJ because there is a shuffle and we can split the partition with the granularity of shuffle blocks. Broadcast join doesn't have shuffles. |
|
Test build #128513 has finished for PR 29692 at commit
|
|
Test build #128514 has finished for PR 29692 at commit
|
|
Test build #128496 has finished for PR 29692 at commit
|
Yeah, thought a lot that normal SQL can't match this case. For our production env experience, data skew (such as stream side is group by and skewed )before broadcast nested loop join always cause a long running time. two ways to avoid this case :
method 1 need to handle SQL logic, for method 2, although it will cause one more time shuffle, it 's narrow dependent. Nowadays, network cost is cheap and always not a bottleneck. After try with AQE, AQE's mode is not suitable for this case. Since it doesn't have a shuffle before BNLJ. It's not very elegant but the benefits are very clear. I am not sure if community will accept this way. |
|
Test build #128508 has finished for PR 29692 at commit
|
|
Test build #128515 has finished for PR 29692 at commit
|
So you add a shuffle to the stream side to stop it before the join node and get statistics? |
Yes, stop before join and get count row of each stream side partitions. re-shuffle stream side then join. |
|
If the user manually adds a shuffle (DISTRIBUTE BY) in the query before broadcast join, I think we can take care of the skew. Spark query optimizer should not add the extra shuffle by itself, as it's likely to cause perf regression. But we need to be careful to break output partitioning by splitting the skewed partitions, and cause extra shuffle. |
With this rule, we can't handle such data skew case automatic. With strict and reasonable conf value, extra shuffle 's cost is much less than the overhead of data skew. Especially like broadcast join/broadcast nested loop join. if stream side executing end with a group by(There are many such business scenarios) and always data skew seriously. Getting business people to tune each job is difficult. |
|
Then we need some estimation work, as the shuffle/scan node may be far away from the join node. We also need to carefully justify if the extra shuffle cost worths the skew elimination benefits. |
| val shuffleStages = collectShuffleStages(plan) | ||
|
|
||
| if (shuffleStages.length == 2) { | ||
| if (shuffleStages.length >= 1) { |
There was a problem hiding this comment.
Does this mean we will also handle multi-table join (e.g. three-table join) ?
There was a problem hiding this comment.
Does this mean we will also handle multi-table join (e.g. three-table join) ?
Since broadcast nested loop join only can have one side shuffle exchange, but sort merge join with two
Yea, only when skewed very serious and threshold is reached , worth to re-shuffle data. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
For BroadcastNestedLoopJoin, we will broadcast boradcast-side child to all executor and use stream side partition's data traversal broadcast-side data one-by-one.
We have meet some case that stream side data skew and all success task wait for skewed partition to finish.
We know that the execution time increases exponentially with the amount of partition's data.
If skewd with 100x, skewed partition's data will execute 100x than non-skewed part.
It is a bottleneck, with AE, we can avoid this by split skewed part's data to make it more balanced.
Why are the changes needed?
NO
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Added UT