[SPARK-36087][SQL][WIP] An Impl of skew key detection and data inflation optimization by zhengruifeng · Pull Request #33298 · apache/spark

zhengruifeng · 2021-07-12T09:30:07Z

What changes were proposed in this pull request?

1, introduce ShuffleExecAccumulator in ShuffleExchangeExec to support arbitrary statistics;

2, impl a key sampling ShuffleExecAccumulator to detect skew keys and show debug info on SparkUI;

3, in OptimizeSkewedJoin, estimate the joined size of each partition based on the sampled keys, and split a partition if it is not split yet and its estimated joined size is too larger.

Why are the changes needed?

1, make it easy to add a new statistics which can be used in AQE rules;
2, showing skew info on sparkUI is usefully;
3, spliting partitions based on joined size can resolve data inflation;

Does this PR introduce any user-facing change?

Yes, new features are added

How was this patch tested?

added testsuites

init init

SparkQA · 2021-07-12T10:16:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45422/

SparkQA · 2021-07-12T10:53:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45422/

SparkQA · 2021-07-12T13:51:35Z

Test build #140910 has finished for PR 33298 at commit a9888ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ShuffleExecAccumulator extends AccumulatorV2[InternalRow, String]

SparkQA · 2021-10-07T01:35:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48412/

SparkQA · 2021-10-07T01:43:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48412/

init

a9888ba

init init

github-actions bot added the SQL label Jul 12, 2021

zhengruifeng closed this Nov 19, 2021

zhengruifeng deleted the skew_key_detect_and_data_inflation_opt branch March 30, 2023 03:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36087][SQL][WIP] An Impl of skew key detection and data inflation optimization#33298

[SPARK-36087][SQL][WIP] An Impl of skew key detection and data inflation optimization#33298
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:skew_key_detect_and_data_inflation_opt

zhengruifeng commented Jul 12, 2021 •

edited

Loading

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Oct 7, 2021

Uh oh!

SparkQA commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhengruifeng commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Jul 12, 2021

Uh oh!

SparkQA commented Oct 7, 2021

Uh oh!

SparkQA commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented Jul 12, 2021 •

edited

Loading