New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor #38852
Conversation
f37e961
to
24b694b
Compare
@holdenk @dongjoon-hyun Help take a look? |
Can one of the admins verify this patch? |
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
Show resolved
Hide resolved
@holdenk @dongjoon-hyun @Ngone51 Help take a look? |
2 similar comments
@holdenk @dongjoon-hyun @Ngone51 Help take a look? |
@holdenk @dongjoon-hyun @Ngone51 Help take a look? |
I like this addition conceptually 👍 Shuffle fetches should be fairly short in general as well. Any thoughts on adding automated testing? |
Thanks for feedback, I'll add UT to cover this. |
Just following up @warrenzhu25 do you have the time to add a unit test? |
Thanks for looking at this. Any ideas where I should add UT to cover which case? The change about |
6e56a44
to
bca201c
Compare
@holdenk @dongjoon-hyun @Ngone51 Help take a look? |
So with #42296, as long as all the shuffle data has been migrated, I think the fetcher would be able to retry the shuffle fetch after the decommissioned executor is gone. |
Maybe the |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Wait for shuffle fetch to finish when decommissioning executors by checking number of opening streams.
Why are the changes needed?
When shuffle fetch with some opening streams is in progress and executor self-exit after decommission, there'll be fetch failed caused by executor self-exit. To fix this, let decommissioned executor wait pending shuffle fetch to be finished.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually tested