[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

Ngone51 · 2021-04-28T15:49:01Z

What changes were proposed in this pull request?

This PR proposes to add checksum support for shuffle blocks. The basic idea is:

On the mapper side, we'll wrap a CheckedOutputStream upon the FileOutputStream to calculate the checksum (use the same checksum calculator Adler32 with broadcast) for each shuffle block (a.k.a partition) at the same time when we writing map output files. And similar to the index file, we'll have a checksum file to save these checksums.

On the reducer side, we'll also wrap a CheckedInputStream upon the FileInputStream to read the block. When block corruption is detected, we'll try to diagnose corruption for the cause:

First, we'll use the CheckedInputStream to consume the remaining data of the corrupted block to calculate the checksum (c1);

Second, the reducer send an RPC request called DiagnoseCorruption (which contains c1) to the server (where the reducer executed)

Third, the server will read (using a very small memory) the corresponding block back from the disk and calculate the checksum (c2) again for it. And also read back the checksum(c3) of the block saved in the checksum file. Then, if c2 != c3, we'll suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we'll suspect the corruption is caused by the network issue. Otherwise, the cause remains unknown. The server then will reply to the reducer with CorruptionCause containing the cause.

Fourth, the reducer needs to take action after it receives the cause. If it's a disk issue or unknown, it will throw fetch failure directly. If it's a network issue, it will re-fetch the block later. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis.

Overall, I think we don't introduce severe overhead with this proposal. In a normal case, the checksum is calculated in the streaming way as well as other streams, e.g., encryption, compression. And the major overhead here is that we need an extra data file traverse in the error case in order to calculate the checksum (c2).

And the proposal in this PR is much simpler compared to the previous one #15894 (abandoned due to complexity ), which introduce more overhead as it need to traverse the data file twice for every block. In that proposal, the checksum is appended to each block data, so it's also invasive to the existing code.

Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the checksum support for the shuffle, Spark itself can at least distinguish the cause between disk and network, which is very important for users.

Does this PR introduce any user-facing change?

Yes.

Added a conf spark.shuffle.checksum to let user enables/disables the checksum (enabled by default)
With checksum enabled, users can know the possible cause of corruption rather than "Stream is corrupted" only.

How was this patch tested?

Added an end-to-end unit test in ShuffleSuite.

I'll add more tests if the community accepts the proposal.

Ngone51 · 2021-04-28T15:52:33Z

I marked PR as WIP because I want to hear the community's feedback before working further, e.g., adding more unit tests.
And there will be two following PRs (if the community accept this proposal):

Corruption diagnosis support for the batch fetched block
Corruption diagnosis implementation for the external shuffle service

Ngone51 · 2021-04-28T15:54:12Z

cc @mridulm @otterc @attilapiros @tgravescs @cloud-fan Please take a look, thanks!

SparkQA · 2021-04-28T16:04:52Z

Test build #138049 has finished for PR 32385 at commit c09ccd0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-28T16:42:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42568/

SparkQA · 2021-04-28T16:42:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42568/

mridulm · 2021-04-28T20:44:09Z

+CC @otterc This should be of interest given recent discussions.

otterc · 2021-04-29T00:53:00Z

Thanks for copying me. I will take a look at it in few days.

Ngone51 · 2021-04-29T16:26:13Z

Hi all, to ease the review for everyone, I have planned to split these PRs into 2 smaller PR first:

calculate checksums and save them into a checksum file

(I have created the PR (#32401) for this part so you can start reviewing from there.)

use the checksum file to diagnose the corruption.

(I'll create the PR later)

tgravescs · 2021-04-29T17:35:22Z

I haven't had time to look in detail, if this is a working prototype, have you done any performance measurements to see example impact this has? I know you said it should be minimal but some numbers would be nice.

Also at the high level how does this affect other shuffle work going on - like merging and pluggable? Is it independent of that or would need to be reimplemented?

Ngone51 · 2021-05-19T03:43:10Z

@tgravescs Thanks for the good points!

I did find some perf regression by benchmarking with the change. I'll double-check it for sure and try to get rid of it if possible.

Also at the high level how does this affect other shuffle work going on - like merging and pluggable? Is it independent of that or would need to be implemented?

For merging, it needs extension to send checksum values along with the block data while merging. The extension is also needed for the decommission feature.

For pluggable, my current implementation is added at LocalDiskShuffleMapOutputWriter, which is supposed to be the default shuffle writer plugin for Spark. It means, in this way, other custom plugins needs its own implementation for checksum support. I adopted that way becase I realized it's easier and more clear to implement at that time.

An alternative way to support checksum for all plugins or say to make it a built-in feature maybe is to implement it in DiskBlockObjectWriter/ShufflePartitionPairsWriter, which is the upstream to the shuffle I/O plugin. I need more investigation on this.

Ngone51 · 2021-06-08T17:01:33Z

Hi @tgravescs @mridulm @otterc , I have resolved the regression issue (verified by running the TPCDS benchmark with 3tb data internally) and made the checksum as a built-in feature of Spark.

And I have updated PR #32401 (which adds checksum support at shuffle writer side only) and I think it's ready for review.

tgravescs · 2021-06-09T13:33:50Z

thanks @Ngone51, I'm very busy this week, so will take a look early next week.

Ngone51 · 2021-06-09T15:21:24Z

Sure, take your time :)

mridulm

Took one pass through the PR, thanks for working on this @Ngone51 !

mridulm · 2021-05-24T16:51:10Z

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/DiagnoseCorruption.java

+      if (!appId.equals(that.appId)) return false;
+      if (!execId.equals(that.execId)) return false;
+      if (!blockId.equals(that.blockId)) return false;
+      return checksum == that.checksum;


super nit: check checksum first ? cheapest check ..

mridulm · 2021-06-14T21:29:00Z

...network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/CorruptionCause.java

+
+    @Override
+    public void encode(ByteBuf buf) {
+      buf.writeInt(cause.ordinal());


int -> byte ?

mridulm · 2021-06-14T21:30:33Z

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

@@ -47,6 +48,15 @@
  protected volatile TransportClientFactory clientFactory;
  protected String appId;

+  public Cause diagnoseCorruption(


Include javadoc ?

mridulm · 2021-06-14T22:30:07Z

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/DiagnoseCorruption.java

+      int result = appId.hashCode();
+      result = 31 * result + execId.hashCode();
+      result = 31 * result + blockId.hashCode();
+      result = 31 * result + (int) checksum;


nit: checksum -> Long.hashCode(checksum) ?

mridulm · 2021-06-14T23:11:38Z

core/src/main/java/org/apache/spark/shuffle/sort/SpillInfo.java

    this.partitionLengths = new long[numPartitions];
+    this.partitionChecksums = checksumEnabled ? new long[numPartitions] : new long[0];


We are using null in MapOutputCommitMessage while empty array here when checksum is disabled.
Unify to a single idiom ? Given writeMetadataFileAndCommit is depending on empty array (based on how it is written up right now), thoughts on using long[0] ?

(Btw, use a constant EMPTY_LONG_ARRAY if deciding to using new long[0]

mridulm · 2021-06-15T00:26:42Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

        }
      }
    } finally {
      logDebug(s"Shuffle index for mapId $mapId: ${lengths.mkString("[", ",", "]")}")
      if (indexTmp.exists() && !indexTmp.delete()) {
        logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
      }
+      checksumTmpOpt.foreach { checksumTmp =>
+        if (checksumTmp.exists() && !checksumTmp.delete()) {
+          logError(s"Failed to delete temporary checksum file at ${checksumTmp.getAbsolutePath}")


logInfo ?

mridulm · 2021-06-15T00:33:36Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+        val channel = Files.newByteChannel(checksumFile.toPath)
+        channel.position(reduceId * 8L)
+        in = new DataInputStream(Channels.newInputStream(channel))
+        val goldenChecksum = in.readLong()


Extract out a readChecksum and computeChecksum methods ?
Btw, tryWithResource { DataInputStream(FileInputStream()).skip(reduceId * 8L).readLong() } would do the trick for readChecksum.

mridulm · 2021-06-15T00:39:53Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+        val checksumIn = new CheckedInputStream(blockData.createInputStream(), new Adler32)
+        val buffer = new Array[Byte](8192)
+        while (checksumIn.read(buffer, 0, 8192) != -1) {}
+        val recalculatedChecksum = checksumIn.getChecksum.getValue


We are not closing checksumIn btw.

mridulm · 2021-06-15T00:41:47Z

core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala

@@ -76,6 +77,9 @@ private[spark] class DiskBlockObjectWriter(
  private var initialized = false
  private var streamOpen = false
  private var hasBeenClosed = false
+  private var checksumEnabled = false
+  private var checksumCal: Checksum = null
+  private var checksumOutputStream: CheckedOutputStream = null


Same comment as above - reduce the number of streaming as fields ?

mridulm · 2021-06-15T00:42:02Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

-
-  override def close(): Unit = sink.close()
-
-}


Nice unification !

Ngone51 · 2021-06-15T07:55:51Z

oh..@mridulm Sorry if I confused you here. I have planed to split this PR into two separate PRs to ease the review:

write checksum file (Ready to review [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file #32401)
diagnose corruption (Not done yet)

So please help review the smaller PR there.

And I'll try to resolve your comments in the separate PRs. Thanks!

mridulm · 2021-06-16T04:12:50Z

lol, thanks for the links @Ngone51 :-)
Glad I went through this once more anyway - will help me with better understanding of the sub-pr's !
Will wait for the update before taking a look at #32401.

…checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of #32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes #32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of #32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes #32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 4783fb7) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

…ffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out #32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…ffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out #32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a98d919) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

github-actions · 2021-09-25T00:09:05Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

…checksum file This is the initial work of add checksum support of shuffle. This is a piece of apache#32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. Yes, added a new conf: `spark.shuffle.checksum`. Added unit tests. Closes apache#32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 4783fb7) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

…ffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out apache#32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes apache#33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a98d919) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

pan3793 · 2022-02-27T14:27:19Z

Hi @Ngone51, thanks for providing the checksum feature for shuffle, have a thought about the following case.

if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs.

I think the checksum could handle this case. Pick the idea from #28525, we can consume the input stream inside ShuffleBlockFetcherIterator and verify the checksum immediately, then

If it's a disk issue or unknown, it will throw fetch failure directly. If it's a network issue, it will re-fetch the block later.

To achieve this, we may need to update the shuffle network protocol to support passing checksums when fetching shuffle blocks.

Compare to the existing Utils.copyStreamUpTo(input, maxBytesInFlight / 3), this approach use less memory and can verify any size of blocks, but it introduce another overhead because it need to read the data 2 times.

We encounter this issues in our production everyday, for some jobs, the performance overhead is acceptable comparing to stability.

@Ngone51 WDYT?

…checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of apache#32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes apache#32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 4783fb7) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

…ffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out apache#32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes apache#33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a98d919) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

Ngone51 added 9 commits April 27, 2021 18:40

init

a7707f6

revert testing code

11cbe38

revert testing code

6c6def7

revert testing code

a8466f8

fix npe

1bdf396

update error message

3ad3b52

add test

38976b7

fix fmt

646f3cf

update

c09ccd0

github-actions bot added the CORE label Apr 28, 2021

Ngone51 changed the title ~~[WIP][SPARK-18188][CORE] Add checksum for shuffle blocks~~ [WIP][SPARK-18188][CORE] Add checksum for shuffle blocks and diagnose corruption Apr 29, 2021

Ngone51 mentioned this pull request Apr 29, 2021

[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file #32401

Closed

Ngone51 changed the title ~~[WIP][SPARK-18188][CORE] Add checksum for shuffle blocks and diagnose corruption~~ [WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption Apr 29, 2021

mridulm reviewed Jun 15, 2021

View reviewed changes

Ngone51 mentioned this pull request Jul 21, 2021

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

Closed

github-actions bot added the Stale label Sep 25, 2021

github-actions bot closed this Sep 26, 2021

pan3793 mentioned this pull request Mar 31, 2022

[SPARK-37535][CORE] Update default spark.io.compression.codec to zstd #34798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

Ngone51 commented Apr 28, 2021

Ngone51 commented Apr 28, 2021

Ngone51 commented Apr 28, 2021

SparkQA commented Apr 28, 2021

SparkQA commented Apr 28, 2021

SparkQA commented Apr 28, 2021

mridulm commented Apr 28, 2021

otterc commented Apr 29, 2021

Ngone51 commented Apr 29, 2021

tgravescs commented Apr 29, 2021

Ngone51 commented May 19, 2021

Ngone51 commented Jun 8, 2021 •

edited

Loading

tgravescs commented Jun 9, 2021

Ngone51 commented Jun 9, 2021

mridulm left a comment

mridulm May 24, 2021

mridulm Jun 14, 2021

mridulm Jun 14, 2021

mridulm Jun 14, 2021

mridulm Jun 14, 2021

mridulm Jun 15, 2021

mridulm Jun 15, 2021

mridulm Jun 15, 2021

mridulm Jun 15, 2021

mridulm Jun 15, 2021

Ngone51 commented Jun 15, 2021

mridulm commented Jun 16, 2021 •

edited

Loading

github-actions bot commented Sep 25, 2021

pan3793 commented Feb 27, 2022 •

edited

Loading

		this.partitionLengths = new long[numPartitions];
		this.partitionChecksums = checksumEnabled ? new long[numPartitions] : new long[0];

[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

Conversation

Ngone51 commented Apr 28, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Apr 28, 2021

Ngone51 commented Apr 28, 2021

SparkQA commented Apr 28, 2021

SparkQA commented Apr 28, 2021

SparkQA commented Apr 28, 2021

mridulm commented Apr 28, 2021

otterc commented Apr 29, 2021

Ngone51 commented Apr 29, 2021

tgravescs commented Apr 29, 2021

Ngone51 commented May 19, 2021

Ngone51 commented Jun 8, 2021 • edited Loading

tgravescs commented Jun 9, 2021

Ngone51 commented Jun 9, 2021

mridulm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 commented Jun 15, 2021

mridulm commented Jun 16, 2021 • edited Loading

github-actions bot commented Sep 25, 2021

pan3793 commented Feb 27, 2022 • edited Loading

Ngone51 commented Jun 8, 2021 •

edited

Loading

mridulm commented Jun 16, 2021 •

edited

Loading

pan3793 commented Feb 27, 2022 •

edited

Loading