Skip to content

[SPARK-36967][CORE] Report accurate shuffle block size if its skewed#34234

Closed
wankunde wants to merge 10 commits intoapache:masterfrom
wankunde:map_status
Closed

[SPARK-36967][CORE] Report accurate shuffle block size if its skewed#34234
wankunde wants to merge 10 commits intoapache:masterfrom
wankunde:map_status

Conversation

@wankunde
Copy link
Contributor

@wankunde wankunde commented Oct 10, 2021

What changes were proposed in this pull request?

A shuffle block is considered as skewed and will be accurately recorded in HighlyCompressedMapStatus if its size if larger than this factor multiplying the median shuffle block size.

Before this change

map_status_before

After this change

map_status_after

Why are the changes needed?

Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are a large number of map tasks and the shuffle block sizes of these tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be unrecognized data skew.

For example, there are 10000 map task and 10000 reduce task, and each map task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this information.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Update exists UTs

@github-actions github-actions bot added the CORE label Oct 10, 2021
@wankunde wankunde changed the title [SPARK-36967]Report accurate block size threshold per reduce task [SPARK-36967][CORE]Report accurate block size threshold per reduce task Oct 10, 2021
@HyukjinKwon HyukjinKwon changed the title [SPARK-36967][CORE]Report accurate block size threshold per reduce task [SPARK-36967][CORE] Report accurate block size threshold per reduce task Oct 11, 2021
@wankunde wankunde changed the title [SPARK-36967][CORE] Report accurate block size threshold per reduce task [WIP][SPARK-36967][CORE] Report accurate block size threshold per reduce task Oct 13, 2021
@wankunde wankunde closed this Oct 14, 2021
@wankunde wankunde reopened this Oct 14, 2021
@wankunde wankunde changed the title [WIP][SPARK-36967][CORE] Report accurate block size threshold per reduce task [SPARK-36967][CORE] Report accurate block size threshold per reduce task Oct 14, 2021
@wankunde wankunde changed the title [SPARK-36967][CORE] Report accurate block size threshold per reduce task [SPARK-36967][CORE] Report accurate shuffle block size if skewed Oct 14, 2021
@wankunde wankunde changed the title [SPARK-36967][CORE] Report accurate shuffle block size if skewed [SPARK-36967][CORE] Report accurate shuffle block size if its skewed Oct 14, 2021
Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea makes sense to me.

@Ngone51
Copy link
Member

Ngone51 commented Oct 18, 2021

cc @cloud-fan @JoshRosen

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. This seems similar to the stale PR #32733 by @exmy (cc @mridulm who reviewed that PR).

The goal of HighlyCompressedMapStatus is to achieve a trade-off between the storage space of the compressed map status and the accuracy of size information for large blocks. We care about accurate block sizes for two reasons:

  1. Reducers use estimated block sizes in order to limit the total amount of data that they try to fetch at a given time. If the estimated block sizes are severe underestimates then reducers may fetch more data than intended, potentially causing OOMs.
  2. Spark SQL's OptimizeSkewJoin rule uses map output statistics to identify skewed partitions). If we underestimate the sizes of skewed map outputs then we'll underestimate partitions sizes and thereby fail to perform skew-join optimization.

I think that PR #32733 was primarily motivated by (1) whereas this PR is motivated by (2), but the issues are related.

A key difference between the two PRs is the threshold for deciding whether to report a more accurate size for a skewed map output: #32733 compared against 5x the average non-empty map output size (plus an additional configurable threshold), whereas this PR compares against the median map output size.

Given how Spark SQL's skew optimization works, I think it makes sense to compare against the median rather than the average. In that code, a partition is considered skewed if its size is larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor * medianPartitionSize and larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes. With default configuration values, a partition is considered to be skewed if size > 5 * medianSize && size > 256MB. Note that these thresholds apply to reduce partition sizes (summing map output sizes to obtain the total input size of each reduce partition).

I guess the main risk to this change is that we could somehow report too many large blocks, thereby increasing memory pressure on the driver and reduce tasks (which currently need to hold all of the map statuses in memory). Theoretically I could create a contrived map output size distribution where up to half of the map outputs are > 5 * medianSize but maybe that pathological distribution is unlikely to occur in practice.

I'll keep brainstorming about this.

Comment on lines 258 to 265
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this code is copied form

private def medianSize(sizes: Array[Long]): Long = {
val numPartitions = sizes.length
val bytes = sizes.sorted
numPartitions match {
case _ if (numPartitions % 2 == 0) =>
math.max((bytes(numPartitions / 2) + bytes(numPartitions / 2 - 1)) / 2, 1)
case _ => math.max(bytes(numPartitions / 2), 1)
}
}

Instead of copy-paste duplicating it, I think we should extract the common code into a median(long[]) helper method in org.apache.spark.Utils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JoshRosen
Copy link
Contributor

If we proceed with this approach then I think we should add new unit tests. We might be able to re-use test cases / logic from from https://github.com/apache/spark/pull/32733/files#diff-6f905716753c4647e146b60bb2e397cb19b7cda76f05208cbbdc891a5f16d54a

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For shuffle map stages with large number of partitions, the benefits of more accurate estimation are at odds with nontrivial cost of maintaining accurate statistics.
Unfortunately any statistics we pick (median, avg, x-th percentile, x-th percentile * scale, top N'th value, etc) has drawbacks - either with increased memory use, or inaccuracy in statistics.

As long as we can turn the behavior off, that should minimize the impact for degenerate situations.

Thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the cost, compute this only if required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the sort of thousands of number will cost too much ? Am i right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us avoid unnecessary allocations and cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have updated PR, sort the shuffle blocks only if user enable this future manually. Is this OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable this only if explicitly configured ? So that we preserve behavior and see what the impact would be.
We can make it a default in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many long running skewed jobs in our cluster, but the driver does not recognize the skewed tasks.
For example, the job in the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are making a change to the current behavior, I would prefer to keep it disabled by default and enable it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mridulm , I just updated code, and disable this behavior by default. Could you help me to review again ?

@wankunde
Copy link
Contributor Author

Hi, @Ngone51 @JoeyValentine @mridulm
I add a parameter to limit the number of reported shuffle blocks if there are too many huge skewed blocks.
I think this is also helpful to limit the memory usage for MapStatus object.
Could you help me to review this PR again?
Many thanks.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a couple of comments, +CC @JoshRosen, @Ngone51

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Avoid this duplication and pull this value out of the if/else ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished to pull the code out of if/else.

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this a doubleConf ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this conf to doubelConf

@mridulm
Copy link
Contributor

mridulm commented Oct 25, 2021

Ok to test

@SparkQA
Copy link

SparkQA commented Oct 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49048/

@SparkQA
Copy link

SparkQA commented Oct 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49048/

@SparkQA
Copy link

SparkQA commented Oct 25, 2021

Test build #144577 has finished for PR 34234 at commit 4068f58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49052/

@SparkQA
Copy link

SparkQA commented Oct 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49052/

@SparkQA
Copy link

SparkQA commented Oct 25, 2021

Test build #144581 has finished for PR 34234 at commit 8fa9e86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49253/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49253/

@wankunde
Copy link
Contributor Author

wankunde commented Nov 1, 2021

hi, @attilapiros
Jenkins fails to build due to StackOverflowError, could you help me to retest again?
Thanks.

@attilapiros
Copy link
Contributor

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49275/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49275/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Test build #144805 has finished for PR 34234 at commit a1505ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one comment, rest looks good to me.

+CC @Ngone51, @JoshRosen, @attilapiros

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49699/

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49699/

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Test build #145229 has finished for PR 34234 at commit 2083cc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Nov 15, 2021

The changes look reasonable to me - but will want additional eyes on this.
Any thoughts @Ngone51, @JoshRosen, @attilapiros ?

@wankunde
Copy link
Contributor Author

Hi, @Ngone51 @JoshRosen @attilapiros
Could you help me to review this PR?
Thanks

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some ideas to make the tests more readable

@wankunde
Copy link
Contributor Author

@attilapiros Thanks for your review. I have updated UT, could you help review the code again. Thanks.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close.

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wankunde!

LGTM

dchvn pushed a commit to dchvn/spark that referenced this pull request Jan 19, 2022
### What changes were proposed in this pull request?

A shuffle block is considered as skewed and will be accurately recorded in HighlyCompressedMapStatus if its size if larger than this factor multiplying  the median shuffle block size.

Before this change

![map_status_before](https://user-images.githubusercontent.com/3626747/137251903-08a3544c-dc77-4b78-8ae5-93b42a54bd03.png)

After this change

![map_status_after](https://user-images.githubusercontent.com/3626747/137251871-355db24d-d66b-4702-8766-216db30a39e0.jpg)

### Why are the changes needed?

Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are a large number of map tasks and the shuffle block sizes of these tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be unrecognized data skew.

For example, there are 10000 map task and 10000 reduce task, and each map task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this information.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Update exists UTs

Closes apache#34234 from wankunde/map_status.

Authored-by: Kun Wan <wankun@apache.org>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
val threshold =
if (accurateBlockSkewedFactor > 0) {
val sortedSizes = uncompressedSizes.sorted
val medianSize: Long = Utils.median(sortedSizes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we sort uncompressedSizes twice. The sortedSizes is sorted but the Utils.median sort it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, uncompressedSizes is sorted twice. Maybe we can change Utils.median(sizes: Array[Long]) to Utils.median(sizes: Array[Long], alreadySorted: Boolean = false), and change Utils.median(sortedSizes) to Utils.median(sortedSizes, true) to avoid this extra sort ?

@ulysses-you WDYH

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wankunde looks good , thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants