[SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38333

gaoyajun02 · 2022-10-21T12:37:04Z

What changes were proposed in this pull request?

When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

Why are the changes needed?

When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

mridulm · 2022-10-21T13:59:18Z

+CC @otterc

AmplabJenkins · 2022-10-21T15:03:22Z

Can one of the admins verify this patch?

gaoyajun02 · 2022-11-01T12:40:50Z

We have located the cause of the zero-size chunk problem on the shuffle service node. and there is the following information in the system dmesg -T:

[Tue Nov  1 19:40:04 2022] EXT4-fs (sde1): Delayed block allocation failed for inode 25755946 at logical offset 0 with max blocks 15 with error 117
[Tue Nov  1 19:40:04 2022] EXT4-fs (sde1): This should not happen!! Data will be lost

Although this is not from the software layer, and the number of bad nodes that lose data is very low, I think it makes sense to support fallback here.

cc @otterc

mridulm · 2022-11-02T06:16:22Z

For cases like this, it might actually be better to fail the task (and recompute the parent stage) - and leverage deny list to prevent tasks from running on the problematic node ?

gaoyajun02 · 2022-11-02T13:03:36Z

For cases like this, it might actually be better to fail the task (and recompute the parent stage) - and leverage deny list to prevent tasks from running on the problematic node ?

it is not necessary to recompute the parent stage. This case is similar to chunk corruption. We can fallback original shuffle block. The reasons are:

Original shuffle blocks are available
Recomputing the parent stage is very expensive in some large jobs and will make the application execution time longer
We observed that these bad nodes have a very low chance of losing data, maybe once every few days

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

mridulm · 2022-11-02T22:50:25Z

If there are hardware issues which are causing failures - it is better to move the nodes to deny list and prevent them from getting used: we will keep seeing more failures, including for vanilla shuffle.

On other hand, I can also look at this as a data corruption issue - @otterc what was the plan around how we support shuffle corruption diagnosis for push based shuffle (SPARK-36206, etc). Is the expectation that we fallback ? Or we diagnose + fail ?

Thoughts @Ngone51

otterc · 2022-11-03T17:56:03Z

If there are hardware issues which are causing failures - it is better to move the nodes to deny list and prevent them from getting used: we will keep seeing more failures, including for vanilla shuffle.

On other hand, I can also look at this as a data corruption issue - @otterc what was the plan around how we support shuffle corruption diagnosis for push based shuffle (SPARK-36206, etc). Is the expectation that we fallback ? Or we diagnose + fail ?

I think it is more efficient to fallback and fetch map outputs instead of failing the stage and regenerating the data of the partition. When the corrupt blocks are merged shuffle blocks or chunks we don't retry to fetch them anyways and fallback immediately.

mridulm · 2022-11-11T00:40:20Z

I think it is more efficient to fallback and fetch map outputs instead of failing the stage and regenerating the data of the partition. When the corrupt blocks are merged shuffle blocks or chunks we don't retry to fetch them anyways and fallback immediately.

Sounds good.

mridulm

Thanks for fixing this @gaoyajun02.

+CC @otterc, @Ngone51

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

mridulm

Couple of changes ...

mridulm · 2022-11-15T18:12:15Z

There is a pending comment, can you take a look at it @gaoyajun02 ? Thx

mridulm · 2022-11-15T18:16:05Z

Also, can you please update to latest master @gaoyajun02 ? Not sure why we are seeing the linter failure in build

…huffle chunk is zero-size

gaoyajun02

done

mridulm · 2022-11-17T07:54:59Z

The test failure looks unrelated, can you retrigger the tests @gaoyajun02 ...

mridulm

Looks good to me.

+CC @otterc

otterc · 2022-11-18T18:32:18Z

Looks good to me

mridulm · 2022-11-21T06:15:09Z

Merged to master.
Thanks for fixing this @gaoyajun02 !
Thanks for the review @otterc :-)

dongjoon-hyun · 2022-11-21T06:22:05Z

Thank you, @gaoyajun02 , @mridulm , @otterc .

Do we need to backport this to branch-3.3?
According to the previous failure description, what happens in branch-3.3 in case of failure?

mridulm · 2022-11-21T06:40:39Z

I was on two minds whether to fix this in 3.3 as well ...
Yes, 3.3 is affected by it.

But agree, a backport to branch-3.3 would be helpful.
Can you get it a shot @gaoyajun02 ? Might need to fix some minor nits to get a patch

dongjoon-hyun · 2022-11-21T06:42:09Z

Thank you, @mridulm !

…huffle chunk is zero-size When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks. When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry. No. UT Closes apache#38333 from gaoyajun02/SPARK-40872. Authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Mridul <mridul<at>gmail.com> (cherry picked from commit 72cce5c)

gaoyajun02 · 2022-11-22T05:48:50Z

I was on two minds whether to fix this in 3.3 as well ... Yes, 3.3 is affected by it.

But agree, a backport to branch-3.3 would be helpful. Can you get it a shot @gaoyajun02 ? Might need to fix some minor nits to get a patch

ok, can you take a look @mridulm cc @otterc @dongjoon-hyun

gaoyajun02 · 2022-11-22T06:06:12Z

Thank you, @gaoyajun02 , @mridulm , @otterc .

Do we need to backport this to branch-3.3?

According to the previous failure description, what happens in branch-3.3 in case of failure?

Since the 3.3 branch does not contain the pr of SPARK-38987, if the mergedChunk is zero-size, throwFetchFailedException is actually a SparkException, which will eventually cause the app to fail due to task failure 4 times.

@dongjoon-hyun

…ged shuffle chunk is zero-size ### What changes were proposed in this pull request? This is a backport PR of #38333. When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks. ### Why are the changes needed? When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT Closes #38751 from gaoyajun02/SPARK-40872-backport. Authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…huffle chunk is zero-size ### What changes were proposed in this pull request? When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks. ### Why are the changes needed? When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT Closes apache#38333 from gaoyajun02/SPARK-40872. Authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Mridul <mridul<at>gmail.com>

github-actions bot added the CORE label Oct 21, 2022

otterc reviewed Nov 2, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Show resolved Hide resolved

mridulm approved these changes Nov 11, 2022

View reviewed changes

mridulm reviewed Nov 11, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Show resolved Hide resolved

mridulm reviewed Nov 11, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

mridulm requested changes Nov 11, 2022

View reviewed changes

gaoyajun02 requested a review from mridulm November 12, 2022 12:15

gaoyajun02 added 3 commits November 16, 2022 20:28

[SPARK-40872] Fallback to original shuffle block when a push-merged s…

8366082

…huffle chunk is zero-size

update comment

15e0bce

fixup

b992830

gaoyajun02 force-pushed the SPARK-40872 branch from d653b6c to b992830 Compare November 16, 2022 12:34

gaoyajun02 commented Nov 16, 2022

View reviewed changes

gaoyajun02 added 2 commits November 18, 2022 17:17

fixup

c570799

update

b58a8fc

mridulm approved these changes Nov 18, 2022

View reviewed changes

asfgit closed this in 72cce5c Nov 21, 2022

gaoyajun02 mentioned this pull request Nov 22, 2022

[SPARK-40872][3.3] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38333

[SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38333

gaoyajun02 commented Oct 21, 2022 •

edited

mridulm commented Oct 21, 2022

AmplabJenkins commented Oct 21, 2022

gaoyajun02 commented Nov 1, 2022 •

edited

mridulm commented Nov 2, 2022 •

edited

gaoyajun02 commented Nov 2, 2022 •

edited

mridulm commented Nov 2, 2022 •

edited

otterc commented Nov 3, 2022

mridulm commented Nov 11, 2022

mridulm left a comment

mridulm left a comment

mridulm commented Nov 15, 2022

mridulm commented Nov 15, 2022

gaoyajun02 left a comment •

edited

mridulm commented Nov 17, 2022

mridulm left a comment

otterc commented Nov 18, 2022

mridulm commented Nov 21, 2022

dongjoon-hyun commented Nov 21, 2022

mridulm commented Nov 21, 2022 •

edited

dongjoon-hyun commented Nov 21, 2022

gaoyajun02 commented Nov 22, 2022

gaoyajun02 commented Nov 22, 2022

[SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38333

[SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38333

Conversation

gaoyajun02 commented Oct 21, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

mridulm commented Oct 21, 2022

AmplabJenkins commented Oct 21, 2022

gaoyajun02 commented Nov 1, 2022 • edited

mridulm commented Nov 2, 2022 • edited

gaoyajun02 commented Nov 2, 2022 • edited

mridulm commented Nov 2, 2022 • edited

otterc commented Nov 3, 2022

mridulm commented Nov 11, 2022

mridulm left a comment

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

mridulm commented Nov 15, 2022

mridulm commented Nov 15, 2022

gaoyajun02 left a comment • edited

Choose a reason for hiding this comment

mridulm commented Nov 17, 2022

mridulm left a comment

Choose a reason for hiding this comment

otterc commented Nov 18, 2022

mridulm commented Nov 21, 2022

dongjoon-hyun commented Nov 21, 2022

mridulm commented Nov 21, 2022 • edited

dongjoon-hyun commented Nov 21, 2022

gaoyajun02 commented Nov 22, 2022

gaoyajun02 commented Nov 22, 2022

gaoyajun02 commented Oct 21, 2022 •

edited

gaoyajun02 commented Nov 1, 2022 •

edited

mridulm commented Nov 2, 2022 •

edited

gaoyajun02 commented Nov 2, 2022 •

edited

mridulm commented Nov 2, 2022 •

edited

gaoyajun02 left a comment •

edited

mridulm commented Nov 21, 2022 •

edited