Skip to content

[SPARK-36087][SQL][WIP] An Impl of skew key detection and data inflation optimization#33298

Closed
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:skew_key_detect_and_data_inflation_opt
Closed

[SPARK-36087][SQL][WIP] An Impl of skew key detection and data inflation optimization#33298
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:skew_key_detect_and_data_inflation_opt

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jul 12, 2021

What changes were proposed in this pull request?

1, introduce ShuffleExecAccumulator in ShuffleExchangeExec to support arbitrary statistics;

2, impl a key sampling ShuffleExecAccumulator to detect skew keys and show debug info on SparkUI;

3, in OptimizeSkewedJoin, estimate the joined size of each partition based on the sampled keys, and split a partition if it is not split yet and its estimated joined size is too larger.

Why are the changes needed?

1, make it easy to add a new statistics which can be used in AQE rules;
2, showing skew info on sparkUI is usefully;
3, spliting partitions based on joined size can resolve data inflation;

Does this PR introduce any user-facing change?

Yes, new features are added

How was this patch tested?

added testsuites

init

init
@github-actions github-actions bot added the SQL label Jul 12, 2021
@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45422/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45422/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Test build #140910 has finished for PR 33298 at commit a9888ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ShuffleExecAccumulator extends AccumulatorV2[InternalRow, String]

@SparkQA
Copy link

SparkQA commented Oct 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48412/

@SparkQA
Copy link

SparkQA commented Oct 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48412/

@zhengruifeng zhengruifeng deleted the skew_key_detect_and_data_inflation_opt branch March 30, 2023 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants