[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

Ngone51 · 2021-07-21T04:23:45Z

What changes were proposed in this pull request?

This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this:
The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown.

After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis.

Please check out #32385 to see the completed proposal of the shuffle checksum project.

Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users.

Does this PR introduce any user-facing change?

Yes, users may know the cause of the shuffle corruption after this change.

How was this patch tested?

Added tests.

Ngone51 · 2021-07-21T04:24:20Z

cc @mridulm @tgravescs @otterc @cloud-fan Please help review, thanks!

SparkQA · 2021-07-21T05:04:04Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45897/

SparkQA · 2021-07-21T05:08:07Z

Test build #141382 has finished for PR 33451 at commit 7220442.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-21T07:21:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45903/

SparkQA · 2021-07-21T08:12:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45903/

SparkQA · 2021-07-21T08:32:00Z

Test build #141385 has finished for PR 33451 at commit df30c80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

otterc

Did one round of review. My general thoughts about this change are

Though it avoids re-fetch of a corrupted block for which the cause of corruption is disk_issue, the act of finding the cause of corruption, which is by sending another message to the server, is as high as just retrying the corrupt block. The retry of a corrupt block happens just once. So, I don't think this change saves much time with respect to this. If there is any data (benchmark results) which suggest otherwise, please do share.
With respect to diagnostics, most of the times the corruption is due to disk. I think corruption due to network(other issues) would be rare. I feel that this broad classification of corruption may not be that helpful to the user. Again, if there are metrics that suggest otherwise, please share. I looked at the other PR but didn't find additional information there.

common/network-common/src/main/java/org/apache/spark/network/corruption/Cause.java

...network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

...rc/main/java/org/apache/spark/network/shuffle/checksum/ShuffleCorruptionDiagnosisHelper.java

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

mridulm

Had a few comments, mostly looks good - thanks for working on this @Ngone51 !

mridulm · 2021-07-22T03:50:08Z

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

+    File probeFile = ExecutorDiskUtils.getFile(
+      executor.localDirs,
+      executor.subDirsPerLocalDir,
+      fileName);


Btw, while looking at this, an unrelated potential issue - we use intern in ExecutorDiskUtils.
Probably should move to using guava interner (Utils.weakIntern does this) ... thoughts @Ngone51 ?

Unfortunately, Utils can't be referenced in network-shuffle module.

Yeah, I meant something similar ... we dont need to do this for this PR btw; just thinking out.

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

...rc/main/java/org/apache/spark/network/shuffle/checksum/ShuffleCorruptionDiagnosisHelper.java

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/DiagnoseCorruption.java

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

...rc/main/java/org/apache/spark/network/shuffle/checksum/ShuffleCorruptionDiagnosisHelper.java

Ngone51 · 2021-07-23T15:10:00Z

@otterc

Though it avoids re-fetch of a corrupted block for which the cause of corruption is disk_issue, the act of finding the cause of corruption, which is by sending another message to the server, is as high as just retrying the corrupt block.

The main motivation behind the shuffle checksum project is to give the cause of data corruption to users/developers to help debug the underlying root causes further. It doesn't really try to bring performance improvement here. And please also note that diagnosis only happens for the corruption error (which is a corner case). So, it won't have a big impact on performance.

I feel that this broad classification of corruption may not be that helpful to the user

These are the only causes we can give under the current solution. And I think it's actually helpful. Without this change, people can only guess the cause. Even if we all suspect the most cause is due to disk issues, but no one can tell it for sure.

Ngone51 · 2021-07-23T15:10:22Z

@otterc @mridulm Thanks for the review. I'll try to address them soon.

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

SparkQA · 2021-07-26T07:22:18Z

Test build #141617 has finished for PR 33451 at commit ac44409.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-26T07:22:22Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46135/

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

Ngone51 · 2021-07-26T15:55:28Z

...k-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/ShuffleChecksumHelper.java

+        cause = Cause.CHECKSUM_VERIFY_PASS;
+      }
+    } catch (UnsupportedOperationException e) {
+      cause = Cause.UNSUPPORTED_CHECKSUM_ALGORITHM;


cc @mridulm @otterc @tgravescs Who discussed the upgrade issue of ESS before.

Ngone51 · 2021-07-26T16:14:08Z

FYI, there's a major change after addressing #33451 (comment):

Previously, we'd diagnose corruption when the first corruption of the block is detected. Now, we only diagnose corruption at the send corruption of the same block. Thus, we resolve the concern #33451 (comment).

SparkQA · 2021-07-26T16:22:07Z

Test build #141647 has finished for PR 33451 at commit efeea10.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-26T17:24:36Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46163/

mridulm

Mostly looks good, just a few minor comments.
Thanks for the fixes @Ngone51 !

common/network-common/src/main/java/org/apache/spark/network/corruption/Cause.java

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

mridulm · 2021-07-26T19:06:34Z

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

+    File probeFile = ExecutorDiskUtils.getFile(
+      executor.localDirs,
+      executor.subDirsPerLocalDir,
+      fileName);


Yeah, I meant something similar ... we dont need to do this for this PR btw; just thinking out.

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

...k-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/ShuffleChecksumHelper.java

...etwork-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalBlockHandlerSuite.java

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

otterc

There doesn't seem to by any UTs added to ShuffleBlockFetchIteratorSuite. Don't have any other comments.

...k-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/ShuffleChecksumHelper.java

SparkQA · 2021-07-27T14:39:17Z

Test build #141710 has finished for PR 33451 at commit a66db1a.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

SparkQA · 2021-07-27T15:25:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46223/

SparkQA · 2021-07-27T16:02:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46223/

Ngone51 · 2021-07-27T16:22:37Z

There doesn't seem to by any UTs added to ShuffleBlockFetchIteratorSuite.

Sure. I'll add there.

SparkQA · 2021-07-27T17:14:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46227/

SparkQA · 2021-08-02T07:39:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46454/

SparkQA · 2021-08-02T08:14:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46454/

SparkQA · 2021-08-02T09:34:51Z

Test build #141944 has finished for PR 33451 at commit ca1b058.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out #32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a98d919) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

mridulm · 2021-08-02T15:01:52Z

Tests passed, merged to master and branch-3.2
+CC @gengliangwang

Thanks for working on this @Ngone51 !
Thanks for the reviews @cloud-fan, @otterc, @gengliangwang :-)

Ngone51 · 2021-08-02T15:26:32Z

Thank you, everybody!

…ffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out apache#32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes apache#33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a98d919) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

github-actions bot added the CORE label Jul 21, 2021

Ngone51 force-pushed the SPARK-36206 branch from 7220442 to df30c80 Compare July 21, 2021 05:12

otterc reviewed Jul 21, 2021

View reviewed changes

mridulm reviewed Jul 22, 2021

View reviewed changes

...rc/main/java/org/apache/spark/network/shuffle/checksum/ShuffleCorruptionDiagnosisHelper.java Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 26, 2021

View reviewed changes

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java Outdated Show resolved Hide resolved

Ngone51 commented Jul 26, 2021

View reviewed changes

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java Show resolved Hide resolved

Ngone51 commented Jul 26, 2021

View reviewed changes

mridulm reviewed Jul 26, 2021

View reviewed changes

otterc reviewed Jul 27, 2021

View reviewed changes

...k-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/ShuffleChecksumHelper.java Outdated Show resolved Hide resolved

Ngone51 force-pushed the SPARK-36206 branch from efeea10 to a66db1a Compare July 27, 2021 13:53

otterc reviewed Jul 27, 2021

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

Ngone51 added 19 commits August 2, 2021 14:05

update comment of diagnoseCorruption

1292390

improve error msg for different causes

8610333

inline diagnosisResponse into diagnoseCorruption

47764e3

update warning msg

c041f01

fix ShuffleBlockFetcherIteratorSuite

4cd8350

fix java lint

2da1b91

fix tests

95ef9db

swallow exception from diagnoseCorruption

925491c

ensure test stability

73b7b70

update version to 3.2.0

be116fb

fix import style

64071ee

add tests

a994c0d

refactor bufIn

261f5ec

use nano time

8d9db93

Use Option

6f72ab1

move Cause

e5e58d5

fix tests

7e4c91d

fix

67262c9

address comment

ca1b058

Ngone51 force-pushed the SPARK-36206 branch from 09c3375 to ca1b058 Compare August 2, 2021 06:07

asfgit closed this in a98d919 Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

Ngone51 commented Jul 21, 2021

Ngone51 commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

otterc left a comment

mridulm left a comment

mridulm Jul 22, 2021

Ngone51 Jul 26, 2021

mridulm Jul 26, 2021

Ngone51 commented Jul 23, 2021

Ngone51 commented Jul 23, 2021

SparkQA commented Jul 26, 2021

SparkQA commented Jul 26, 2021

Ngone51 Jul 26, 2021

Ngone51 commented Jul 26, 2021

SparkQA commented Jul 26, 2021

SparkQA commented Jul 26, 2021

mridulm left a comment

mridulm Jul 26, 2021

otterc left a comment

SparkQA commented Jul 27, 2021

SparkQA commented Jul 27, 2021

SparkQA commented Jul 27, 2021

Ngone51 commented Jul 27, 2021

SparkQA commented Jul 27, 2021

SparkQA commented Aug 2, 2021

SparkQA commented Aug 2, 2021

SparkQA commented Aug 2, 2021

mridulm commented Aug 2, 2021

Ngone51 commented Aug 2, 2021

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

Conversation

Ngone51 commented Jul 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

SparkQA commented Jul 21, 2021

otterc left a comment

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

mridulm Jul 22, 2021

Choose a reason for hiding this comment

Ngone51 Jul 26, 2021

Choose a reason for hiding this comment

mridulm Jul 26, 2021

Choose a reason for hiding this comment

Ngone51 commented Jul 23, 2021

Ngone51 commented Jul 23, 2021

SparkQA commented Jul 26, 2021

SparkQA commented Jul 26, 2021

Ngone51 Jul 26, 2021

Choose a reason for hiding this comment

Ngone51 commented Jul 26, 2021

SparkQA commented Jul 26, 2021

SparkQA commented Jul 26, 2021

mridulm left a comment

Choose a reason for hiding this comment

mridulm Jul 26, 2021

Choose a reason for hiding this comment

otterc left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 27, 2021

SparkQA commented Jul 27, 2021

SparkQA commented Jul 27, 2021

Ngone51 commented Jul 27, 2021

SparkQA commented Jul 27, 2021

SparkQA commented Aug 2, 2021

SparkQA commented Aug 2, 2021

SparkQA commented Aug 2, 2021

mridulm commented Aug 2, 2021

Ngone51 commented Aug 2, 2021