Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size #38333

Closed
wants to merge 5 commits into from

Conversation

gaoyajun02
Copy link
Contributor

@gaoyajun02 gaoyajun02 commented Oct 21, 2022

What changes were proposed in this pull request?

When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

Why are the changes needed?

When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

@github-actions github-actions bot added the CORE label Oct 21, 2022
@mridulm
Copy link
Contributor

mridulm commented Oct 21, 2022

+CC @otterc

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@gaoyajun02
Copy link
Contributor Author

gaoyajun02 commented Nov 1, 2022

We have located the cause of the zero-size chunk problem on the shuffle service node. and there is the following information in the system dmesg -T:

[Tue Nov  1 19:40:04 2022] EXT4-fs (sde1): Delayed block allocation failed for inode 25755946 at logical offset 0 with max blocks 15 with error 117
[Tue Nov  1 19:40:04 2022] EXT4-fs (sde1): This should not happen!! Data will be lost

Although this is not from the software layer, and the number of bad nodes that lose data is very low, I think it makes sense to support fallback here.

cc @otterc

@mridulm
Copy link
Contributor

mridulm commented Nov 2, 2022

For cases like this, it might actually be better to fail the task (and recompute the parent stage) - and leverage deny list to prevent tasks from running on the problematic node ?

@gaoyajun02
Copy link
Contributor Author

gaoyajun02 commented Nov 2, 2022

For cases like this, it might actually be better to fail the task (and recompute the parent stage) - and leverage deny list to prevent tasks from running on the problematic node ?

it is not necessary to recompute the parent stage. This case is similar to chunk corruption. We can fallback original shuffle block. The reasons are:

  1. Original shuffle blocks are available
  2. Recomputing the parent stage is very expensive in some large jobs and will make the application execution time longer
  3. We observed that these bad nodes have a very low chance of losing data, maybe once every few days

@mridulm
Copy link
Contributor

mridulm commented Nov 2, 2022

If there are hardware issues which are causing failures - it is better to move the nodes to deny list and prevent them from getting used: we will keep seeing more failures, including for vanilla shuffle.

On other hand, I can also look at this as a data corruption issue - @otterc what was the plan around how we support shuffle corruption diagnosis for push based shuffle (SPARK-36206, etc). Is the expectation that we fallback ? Or we diagnose + fail ?

Thoughts @Ngone51

@otterc
Copy link
Contributor

otterc commented Nov 3, 2022

If there are hardware issues which are causing failures - it is better to move the nodes to deny list and prevent them from getting used: we will keep seeing more failures, including for vanilla shuffle.

On other hand, I can also look at this as a data corruption issue - @otterc what was the plan around how we support shuffle corruption diagnosis for push based shuffle (SPARK-36206, etc). Is the expectation that we fallback ? Or we diagnose + fail ?

I think it is more efficient to fallback and fetch map outputs instead of failing the stage and regenerating the data of the partition. When the corrupt blocks are merged shuffle blocks or chunks we don't retry to fetch them anyways and fallback immediately.

@mridulm
Copy link
Contributor

mridulm commented Nov 11, 2022

I think it is more efficient to fallback and fetch map outputs instead of failing the stage and regenerating the data of the partition. When the corrupt blocks are merged shuffle blocks or chunks we don't retry to fetch them anyways and fallback immediately.

Sounds good.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @gaoyajun02.

+CC @otterc, @Ngone51

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of changes ...

@mridulm
Copy link
Contributor

mridulm commented Nov 15, 2022

There is a pending comment, can you take a look at it @gaoyajun02 ? Thx

@mridulm
Copy link
Contributor

mridulm commented Nov 15, 2022

Also, can you please update to latest master @gaoyajun02 ? Not sure why we are seeing the linter failure in build

Copy link
Contributor Author

@gaoyajun02 gaoyajun02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mridulm
Copy link
Contributor

mridulm commented Nov 17, 2022

The test failure looks unrelated, can you retrigger the tests @gaoyajun02 ...

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

+CC @otterc

@otterc
Copy link
Contributor

otterc commented Nov 18, 2022

Looks good to me

@asfgit asfgit closed this in 72cce5c Nov 21, 2022
@mridulm
Copy link
Contributor

mridulm commented Nov 21, 2022

Merged to master.
Thanks for fixing this @gaoyajun02 !
Thanks for the review @otterc :-)

@dongjoon-hyun
Copy link
Member

Thank you, @gaoyajun02 , @mridulm , @otterc .

  • Do we need to backport this to branch-3.3?
  • According to the previous failure description, what happens in branch-3.3 in case of failure?

@mridulm
Copy link
Contributor

mridulm commented Nov 21, 2022

I was on two minds whether to fix this in 3.3 as well ...
Yes, 3.3 is affected by it.

But agree, a backport to branch-3.3 would be helpful.
Can you get it a shot @gaoyajun02 ? Might need to fix some minor nits to get a patch

@dongjoon-hyun
Copy link
Member

Thank you, @mridulm !

gaoyajun02 pushed a commit to gaoyajun02/spark that referenced this pull request Nov 22, 2022
…huffle chunk is zero-size

When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

No.

UT

Closes apache#38333 from gaoyajun02/SPARK-40872.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Mridul <mridul<at>gmail.com>
(cherry picked from commit 72cce5c)
@gaoyajun02
Copy link
Contributor Author

I was on two minds whether to fix this in 3.3 as well ... Yes, 3.3 is affected by it.

But agree, a backport to branch-3.3 would be helpful. Can you get it a shot @gaoyajun02 ? Might need to fix some minor nits to get a patch

ok, can you take a look @mridulm cc @otterc @dongjoon-hyun

@gaoyajun02
Copy link
Contributor Author

Thank you, @gaoyajun02 , @mridulm , @otterc .

  • Do we need to backport this to branch-3.3?
  • According to the previous failure description, what happens in branch-3.3 in case of failure?

Since the 3.3 branch does not contain the pr of SPARK-38987, if the mergedChunk is zero-size, throwFetchFailedException is actually a SparkException, which will eventually cause the app to fail due to task failure 4 times.

@dongjoon-hyun

HyukjinKwon pushed a commit that referenced this pull request Nov 27, 2022
…ged shuffle chunk is zero-size

### What changes were proposed in this pull request?
This is a backport PR of #38333.
When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

### Why are the changes needed?
When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT

Closes #38751 from gaoyajun02/SPARK-40872-backport.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…huffle chunk is zero-size

### What changes were proposed in this pull request?
When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

### Why are the changes needed?
When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT

Closes apache#38333 from gaoyajun02/SPARK-40872.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Mridul <mridul<at>gmail.com>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
…huffle chunk is zero-size

### What changes were proposed in this pull request?
When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

### Why are the changes needed?
When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT

Closes apache#38333 from gaoyajun02/SPARK-40872.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Mridul <mridul<at>gmail.com>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…huffle chunk is zero-size

### What changes were proposed in this pull request?
When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks.

### Why are the changes needed?
When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT

Closes apache#38333 from gaoyajun02/SPARK-40872.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Mridul <mridul<at>gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants