[SPARK-36087][SQL][WIP] An Impl of skew key detection and data inflation optimization#33298
Closed
zhengruifeng wants to merge 1 commit intoapache:masterfrom
Closed
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #140910 has finished for PR 33298 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
1, introduce
ShuffleExecAccumulatorinShuffleExchangeExecto support arbitrary statistics;2, impl a key sampling
ShuffleExecAccumulatorto detect skew keys and show debug info on SparkUI;3, in
OptimizeSkewedJoin, estimate the joined size of each partition based on the sampled keys, and split a partition if it is not split yet and its estimated joined size is too larger.Why are the changes needed?
1, make it easy to add a new statistics which can be used in AQE rules;
2, showing skew info on sparkUI is usefully;
3, spliting partitions based on joined size can resolve data inflation;
Does this PR introduce any user-facing change?
Yes, new features are added
How was this patch tested?
added testsuites