[SPARK-36967][CORE] Report accurate shuffle block size if its skewed by wankunde · Pull Request #34234 · apache/spark

wankunde · 2021-10-10T02:28:30Z

What changes were proposed in this pull request?

A shuffle block is considered as skewed and will be accurately recorded in HighlyCompressedMapStatus if its size if larger than this factor multiplying the median shuffle block size.

Before this change

After this change

Why are the changes needed?

Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are a large number of map tasks and the shuffle block sizes of these tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be unrecognized data skew.

For example, there are 10000 map task and 10000 reduce task, and each map task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this information.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Update exists UTs

Ngone51

The general idea makes sense to me.

Ngone51 · 2021-10-18T13:05:24Z

cc @cloud-fan @JoshRosen

JoshRosen

Thanks for this PR. This seems similar to the stale PR #32733 by @exmy (cc @mridulm who reviewed that PR).

The goal of HighlyCompressedMapStatus is to achieve a trade-off between the storage space of the compressed map status and the accuracy of size information for large blocks. We care about accurate block sizes for two reasons:

Reducers use estimated block sizes in order to limit the total amount of data that they try to fetch at a given time. If the estimated block sizes are severe underestimates then reducers may fetch more data than intended, potentially causing OOMs.
Spark SQL's OptimizeSkewJoin rule uses map output statistics to identify skewed partitions). If we underestimate the sizes of skewed map outputs then we'll underestimate partitions sizes and thereby fail to perform skew-join optimization.

I think that PR #32733 was primarily motivated by (1) whereas this PR is motivated by (2), but the issues are related.

A key difference between the two PRs is the threshold for deciding whether to report a more accurate size for a skewed map output: #32733 compared against 5x the average non-empty map output size (plus an additional configurable threshold), whereas this PR compares against the median map output size.

Given how Spark SQL's skew optimization works, I think it makes sense to compare against the median rather than the average. In that code, a partition is considered skewed if its size is larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor * medianPartitionSize and larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes. With default configuration values, a partition is considered to be skewed if size > 5 * medianSize && size > 256MB. Note that these thresholds apply to reduce partition sizes (summing map output sizes to obtain the total input size of each reduce partition).

I guess the main risk to this change is that we could somehow report too many large blocks, thereby increasing memory pressure on the driver and reduce tasks (which currently need to hold all of the map statuses in memory). Theoretically I could create a contrived map output size distribution where up to half of the map outputs are > 5 * medianSize but maybe that pathological distribution is unlikely to occur in practice.

I'll keep brainstorming about this.

JoshRosen · 2021-10-18T22:49:31Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

It looks like this code is copied form

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

Lines 71 to 79 in 1b2bb38

private def medianSize(sizes: Array[Long]): Long = {

val numPartitions = sizes.length

val bytes = sizes.sorted

numPartitions match {

case _ if (numPartitions % 2 == 0) =>

math.max((bytes(numPartitions / 2) + bytes(numPartitions / 2 - 1)) / 2, 1)

case _ => math.max(bytes(numPartitions / 2), 1)

}

}

Instead of copy-paste duplicating it, I think we should extract the common code into a median(long[]) helper method in org.apache.spark.Utils.

JoshRosen · 2021-10-19T00:26:50Z

If we proceed with this approach then I think we should add new unit tests. We might be able to re-use test cases / logic from from https://github.com/apache/spark/pull/32733/files#diff-6f905716753c4647e146b60bb2e397cb19b7cda76f05208cbbdc891a5f16d54a

mridulm

For shuffle map stages with large number of partitions, the benefits of more accurate estimation are at odds with nontrivial cost of maintaining accurate statistics.
Unfortunately any statistics we pick (median, avg, x-th percentile, x-th percentile * scale, top N'th value, etc) has drawbacks - either with increased memory use, or inaccuracy in statistics.

As long as we can turn the behavior off, that should minimize the impact for degenerate situations.

Thoughts ?

mridulm · 2021-10-19T05:21:51Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

Given the cost, compute this only if required ?

I don't think the sort of thousands of number will cost too much ? Am i right ?

Let us avoid unnecessary allocations and cost.

Yes, I have updated PR, sort the shuffle blocks only if user enable this future manually. Is this OK?

mridulm · 2021-10-19T05:23:46Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

Enable this only if explicitly configured ? So that we preserve behavior and see what the impact would be.
We can make it a default in future.

There are many long running skewed jobs in our cluster, but the driver does not recognize the skewed tasks.
For example, the job in the PR description.

If we are making a change to the current behavior, I would prefer to keep it disabled by default and enable it explicitly.

Thanks @mridulm , I just updated code, and disable this behavior by default. Could you help me to review again ?

wankunde · 2021-10-21T12:09:18Z

Hi, @Ngone51 @JoeyValentine @mridulm
I add a parameter to limit the number of reported shuffle blocks if there are too many huge skewed blocks.
I think this is also helpful to limit the memory usage for MapStatus object.
Could you help me to review this PR again?
Many thanks.

mridulm

Just had a couple of comments, +CC @JoshRosen, @Ngone51

mridulm · 2021-10-25T04:15:42Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

nit: Avoid this duplication and pull this value out of the if/else ?

Finished to pull the code out of if/else.

Thanks

mridulm · 2021-10-25T04:18:49Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

Do we want to make this a doubleConf ?

Updated this conf to doubelConf

mridulm · 2021-10-25T04:21:41Z

Ok to test

SparkQA · 2021-10-25T05:48:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49048/

SparkQA · 2021-10-25T06:41:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49048/

SparkQA · 2021-10-25T07:46:46Z

Test build #144577 has finished for PR 34234 at commit 4068f58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-25T09:07:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49052/

SparkQA · 2021-10-25T10:06:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49052/

SparkQA · 2021-10-25T11:01:04Z

Test build #144581 has finished for PR 34234 at commit 8fa9e86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ether to report a shuffle block size

SparkQA · 2021-10-30T16:03:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49253/

SparkQA · 2021-10-30T16:47:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49253/

wankunde · 2021-11-01T01:47:05Z

hi, @attilapiros
Jenkins fails to build due to StackOverflowError, could you help me to retest again?
Thanks.

attilapiros · 2021-11-01T07:17:32Z

Jenkins retest this please

SparkQA · 2021-11-01T08:21:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49275/

SparkQA · 2021-11-01T08:59:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49275/

SparkQA · 2021-11-01T10:14:28Z

Test build #144805 has finished for PR 34234 at commit a1505ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

Had one comment, rest looks good to me.

+CC @Ngone51, @JoshRosen, @attilapiros

core/src/main/scala/org/apache/spark/internal/config/package.scala

SparkQA · 2021-11-15T11:13:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49699/

SparkQA · 2021-11-15T11:56:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49699/

SparkQA · 2021-11-15T13:03:54Z

Test build #145229 has finished for PR 34234 at commit 2083cc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2021-11-15T18:52:25Z

The changes look reasonable to me - but will want additional eyes on this.
Any thoughts @Ngone51, @JoshRosen, @attilapiros ?

wankunde · 2021-11-23T08:37:32Z

Hi, @Ngone51 @JoshRosen @attilapiros
Could you help me to review this PR?
Thanks

attilapiros

some ideas to make the tests more readable

core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala

wankunde · 2022-01-12T03:26:32Z

@attilapiros Thanks for your review. I have updated UT, could you help review the code again. Thanks.

attilapiros

We are very close.

core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala

attilapiros

Thanks @wankunde!

LGTM

### What changes were proposed in this pull request? A shuffle block is considered as skewed and will be accurately recorded in HighlyCompressedMapStatus if its size if larger than this factor multiplying the median shuffle block size. Before this change ![map_status_before](https://user-images.githubusercontent.com/3626747/137251903-08a3544c-dc77-4b78-8ae5-93b42a54bd03.png) After this change ![map_status_after](https://user-images.githubusercontent.com/3626747/137251871-355db24d-d66b-4702-8766-216db30a39e0.jpg) ### Why are the changes needed? Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are a large number of map tasks and the shuffle block sizes of these tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be unrecognized data skew. For example, there are 10000 map task and 10000 reduce task, and each map task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this information. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update exists UTs Closes apache#34234 from wankunde/map_status. Authored-by: Kun Wan <wankun@apache.org> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>

ulysses-you · 2022-02-23T05:41:51Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

+    val threshold =
+      if (accurateBlockSkewedFactor > 0) {
+        val sortedSizes = uncompressedSizes.sorted
+        val medianSize: Long = Utils.median(sortedSizes)


it seems we sort uncompressedSizes twice. The sortedSizes is sorted but the Utils.median sort it again.

Yes, uncompressedSizes is sorted twice. Maybe we can change Utils.median(sizes: Array[Long]) to Utils.median(sizes: Array[Long], alreadySorted: Boolean = false), and change Utils.median(sortedSizes) to Utils.median(sortedSizes, true) to avoid this extra sort ?

@ulysses-you WDYH

@wankunde looks good , thanks

github-actions bot added the CORE label Oct 10, 2021

wankunde changed the title ~~[SPARK-36967]Report accurate block size threshold per reduce task~~ [SPARK-36967][CORE]Report accurate block size threshold per reduce task Oct 10, 2021

HyukjinKwon changed the title ~~[SPARK-36967][CORE]Report accurate block size threshold per reduce task~~ [SPARK-36967][CORE] Report accurate block size threshold per reduce task Oct 11, 2021

wankunde changed the title ~~[SPARK-36967][CORE] Report accurate block size threshold per reduce task~~ [WIP][SPARK-36967][CORE] Report accurate block size threshold per reduce task Oct 13, 2021

wankunde closed this Oct 14, 2021

wankunde force-pushed the map_status branch from 9ee31c0 to d9b4cc6 Compare October 14, 2021 04:23

wankunde reopened this Oct 14, 2021

wankunde changed the title ~~[WIP][SPARK-36967][CORE] Report accurate block size threshold per reduce task~~ [SPARK-36967][CORE] Report accurate block size threshold per reduce task Oct 14, 2021

wankunde changed the title ~~[SPARK-36967][CORE] Report accurate block size threshold per reduce task~~ [SPARK-36967][CORE] Report accurate shuffle block size if skewed Oct 14, 2021

wankunde changed the title ~~[SPARK-36967][CORE] Report accurate shuffle block size if skewed~~ [SPARK-36967][CORE] Report accurate shuffle block size if its skewed Oct 14, 2021

Ngone51 reviewed Oct 18, 2021

View reviewed changes

JoshRosen reviewed Oct 19, 2021

View reviewed changes

mridulm reviewed Oct 19, 2021

View reviewed changes

wankunde force-pushed the map_status branch from 43324ae to 337ee05 Compare October 21, 2021 08:56

github-actions bot added the SQL label Oct 21, 2021

mridulm reviewed Oct 25, 2021

View reviewed changes

wankunde added 4 commits October 25, 2021 22:41

add spark.shuffle.accurateBlockSkewedFactor parameter to determine wh…

0384a79

…ether to report a shuffle block size

optimize code

5ba65aa

disable skew detection future by default

71c60e8

disable skew detection future by default

bd1678c

wankunde requested review from JoshRosen, Ngone51, attilapiros and mridulm November 8, 2021 02:43

mridulm approved these changes Nov 9, 2021

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Show resolved Hide resolved

make configs internal

2083cc4

attilapiros reviewed Dec 21, 2021

View reviewed changes

update UT

2949668

attilapiros reviewed Jan 12, 2022

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala Outdated Show resolved Hide resolved

core/src/test/scala/org/apache/spark/scheduler/MapStatusSuite.scala Outdated Show resolved Hide resolved

wankunde added 2 commits January 13, 2022 17:19

update UT

2c123b7

update UT

fb0aed9

attilapiros approved these changes Jan 13, 2022

View reviewed changes

attilapiros closed this in fcc5176 Jan 14, 2022

ulysses-you reviewed Feb 23, 2022

View reviewed changes

wankunde mentioned this pull request Feb 25, 2022

[FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed #35619

Closed

	private def medianSize(sizes: Array[Long]): Long = {
	val numPartitions = sizes.length
	val bytes = sizes.sorted
	numPartitions match {
	case _ if (numPartitions % 2 == 0) =>
	math.max((bytes(numPartitions / 2) + bytes(numPartitions / 2 - 1)) / 2, 1)
	case _ => math.max(bytes(numPartitions / 2), 1)
	}
	}

Conversation

wankunde commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Oct 18, 2021

Uh oh!

JoshRosen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Oct 19, 2021

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wankunde commented Oct 21, 2021

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 25, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

wankunde commented Nov 1, 2021

Uh oh!

attilapiros commented Nov 1, 2021

Uh oh!

SparkQA commented Nov 1, 2021

Uh oh!

wankunde commented Oct 10, 2021 •

edited

Loading

JoshRosen left a comment •

edited

Loading

mridulm left a comment •

edited

Loading

mridulm left a comment •

edited

Loading