[SPARK-38987][shuffle] Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true #36305

zhouyejoe · 2022-04-21T17:54:08Z

What changes were proposed in this pull request?

Adds the corruption exception handling for merged shuffle chunk when spark.shuffle.detectCorrupt is set to true(default value is true)

Why are the changes needed?

Prior to Spark 3.0, spark.shuffle.detectCorrupt is set to true by default, and this configuration is one of the knob for early corruption detection. So the fallback can be triggered as expected.

After Spark 3.0, even though spark.shuffle.detectCorrupt is still set to true by default, but the early corruption detect knob is controlled with a new configuration spark.shuffle.detectCorrupt.useExtraMemory, and it set to false by default. Thus the default behavior, with only Magnet enabled after Spark 3.2.0(internal li-3.1.1), will disable the early corruption detection, thus no fallback will be triggered. And it will drop to throw an exception when start to read the corrupted blocks.

We need to handle the corrupted stream for merged blocks with/out fallback in different scenarios:

If user sets the spark.shuffle.detectCorrupt.useExtraMemory to true, this will trigger the fallback. But this block only puts a small portion of the shuffle block and evaluate whether it has been corrupted. There is still possibility that it will be corrupted in later parts of the shuffle blocks. Then it will be handled by the spark.shuffle.detectCorrupt.
If the spark.shuffle.detectCorrupt.useExtraMemory is set to false, but spark.shuffle.detectCorrupt is set to true, it shouldn't throw an exception saying ShuffleChunk is not a shuffle block, and it should trigger the retry if the shuffleblock is shufflechunk.
If spark.shuffle.detectCorrupt.useExtraMemory is set to false, and spark.shuffle.detectCorrupt is set to false, it should just throw an exception in the client side and fail the task.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test is WIP. UT to be added

… corrupted and spark.shuffle.detectCorrupt is set to true

zhouyejoe · 2022-04-21T17:55:52Z

cc @mridulm @otterc Thoughts about the handling here? Should we also add the fallback, when the merged shuffle chunk is corrupted and spark.shuffle.detectCorrupt is set to false? The exception handling needs to be outside of the ShuffleBlockFetcherIterator though, which will make things complicated.

otterc · 2022-04-21T18:06:08Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -1166,6 +1166,8 @@ final class ShuffleBlockFetcherIterator(
      case ShuffleBlockBatchId(shuffleId, mapId, startReduceId, _) =>
        throw SparkCoreErrors.fetchFailedError(address, shuffleId, mapId, mapIndex, startReduceId,
          msg, e)
+      case ShuffleBlockChunkId(_, _, _, _) =>
+        pushBasedFetchHelper.initiateFallbackFetchForPushMergedBlock(blockId, address)


Wouldn't this result in duplicate records? Some of the chunk is already consumed but we don't know how much is that correct? So now we fallback to original blocks and that could re-fetch the records that were already consumed.

Can you change it to throwing fetch failed exception in this case @zhouyejoe ?

Sure, will update accordingly

otterc · 2022-04-21T18:59:31Z

I don't think we can handle corruption when it happens in between a shuffle chunk and fallback because we don't know how much of the chunk has been consumed. We would have to treat this as a corruption of a regular shuffle block and propagate it to driver. So, on the driver now we will have to handle fetch failures related to a shuffle chunk.

AmplabJenkins · 2022-04-22T12:11:11Z

Can one of the admins verify this patch?

mridulm

Looks good to me.
I would like to merge this into branch-3.3, given it causes stage failures without it @MaxGekk

+CC @Ngone51, @otterc for taking a look.

mridulm · 2022-04-28T01:39:34Z

I removed the WIP tag @zhouyejoe.

mridulm

Actually, can you add a test for this as well for this @zhouyejoe ?

otterc · 2022-04-28T02:26:29Z

@zhouyejoe @mridulm I don't think this change is sufficient. We have to make changes on the driver to handle FetchFailure of a shuffle chunk. At the least we have to unregister the merge results corresponding to the shuffle chunk

otterc · 2022-04-28T19:57:42Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -1166,6 +1166,8 @@ final class ShuffleBlockFetcherIterator(
      case ShuffleBlockBatchId(shuffleId, mapId, startReduceId, _) =>
        throw SparkCoreErrors.fetchFailedError(address, shuffleId, mapId, mapIndex, startReduceId,
          msg, e)
+      case ShuffleBlockChunkId(shuffleId, _, reduceId, _) =>
+        SparkCoreErrors.fetchFailedError(address, shuffleId, -1L, -1, reduceId, msg, e)


This will propagate the FetchFailure with mapId/mapIndex = -1. Currently the driver will not unregister the mergeStatus because mapIndex = -1. Since the mergeStatus is still registered, the next attempt will still try to fetch the corrupted merged block. Eventually the application will fail.
So, this requires a change in the driver as well. Also once we make that change, we should add a UT to DAGScheduler to verify that.

Thanks for the detailed description. Will update accordingly then.

github-actions · 2022-09-23T00:28:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-38987][shuffle] Handle fallback when merged shuffle blocks are…

a75c7bb

… corrupted and spark.shuffle.detectCorrupt is set to true

otterc reviewed Apr 21, 2022

View reviewed changes

github-actions bot added the CORE label Apr 21, 2022

Address comments

997ca9f

mridulm approved these changes Apr 28, 2022

View reviewed changes

mridulm requested changes Apr 28, 2022

View reviewed changes

otterc reviewed Apr 28, 2022

View reviewed changes

akpatnam25 mentioned this pull request May 18, 2022

[SPARK-38987][SHUFFLE] Throw FetchFailedException when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true #36601

Closed

github-actions bot added the Stale label Sep 23, 2022

github-actions bot closed this Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38987][shuffle] Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true #36305

[SPARK-38987][shuffle] Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true #36305

zhouyejoe commented Apr 21, 2022

zhouyejoe commented Apr 21, 2022

otterc Apr 21, 2022 •

edited

mridulm Apr 22, 2022 •

edited

zhouyejoe Apr 22, 2022

otterc commented Apr 21, 2022

AmplabJenkins commented Apr 22, 2022

mridulm left a comment

mridulm commented Apr 28, 2022

mridulm left a comment

otterc commented Apr 28, 2022

otterc Apr 28, 2022

zhouyejoe Apr 28, 2022

github-actions bot commented Sep 23, 2022

[SPARK-38987][shuffle] Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true #36305

[SPARK-38987][shuffle] Handle fallback when merged shuffle blocks are corrupted and spark.shuffle.detectCorrupt is set to true #36305

Conversation

zhouyejoe commented Apr 21, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhouyejoe commented Apr 21, 2022

otterc Apr 21, 2022 • edited

Choose a reason for hiding this comment

mridulm Apr 22, 2022 • edited

Choose a reason for hiding this comment

zhouyejoe Apr 22, 2022

Choose a reason for hiding this comment

otterc commented Apr 21, 2022

AmplabJenkins commented Apr 22, 2022

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Apr 28, 2022

mridulm left a comment

Choose a reason for hiding this comment

otterc commented Apr 28, 2022

otterc Apr 28, 2022

Choose a reason for hiding this comment

zhouyejoe Apr 28, 2022

Choose a reason for hiding this comment

github-actions bot commented Sep 23, 2022

otterc Apr 21, 2022 •

edited

mridulm Apr 22, 2022 •

edited