Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor #38852

Closed
wants to merge 1 commit into from

Conversation

warrenzhu25
Copy link
Contributor

@warrenzhu25 warrenzhu25 commented Nov 30, 2022

What changes were proposed in this pull request?

Wait for shuffle fetch to finish when decommissioning executors by checking number of opening streams.

Why are the changes needed?

When shuffle fetch with some opening streams is in progress and executor self-exit after decommission, there'll be fetch failed caused by executor self-exit. To fix this, let decommissioned executor wait pending shuffle fetch to be finished.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually tested

@warrenzhu25
Copy link
Contributor Author

@holdenk @dongjoon-hyun Help take a look?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@warrenzhu25
Copy link
Contributor Author

@holdenk @dongjoon-hyun @Ngone51 Help take a look?

2 similar comments
@warrenzhu25
Copy link
Contributor Author

@holdenk @dongjoon-hyun @Ngone51 Help take a look?

@warrenzhu25
Copy link
Contributor Author

@holdenk @dongjoon-hyun @Ngone51 Help take a look?

@holdenk
Copy link
Contributor

holdenk commented Apr 20, 2023

I like this addition conceptually 👍 Shuffle fetches should be fairly short in general as well. Any thoughts on adding automated testing?

@warrenzhu25
Copy link
Contributor Author

I like this addition conceptually 👍 Shuffle fetches should be fairly short in general as well. Any thoughts on adding automated testing?

Thanks for feedback, I'll add UT to cover this.

@holdenk
Copy link
Contributor

holdenk commented Jul 8, 2023

Just following up @warrenzhu25 do you have the time to add a unit test?

@warrenzhu25
Copy link
Contributor Author

Just following up @warrenzhu25 do you have the time to add a unit test?

Thanks for looking at this. Any ideas where I should add UT to cover which case? The change about BlockManager and BlockTransferService is pretty straightforward, the only place might need UT is in CoarseGrainedExecutorBackend, but it seems hard to mock BlockManager there.

@warrenzhu25
Copy link
Contributor Author

@holdenk @dongjoon-hyun @Ngone51 Help take a look?

@Ngone51
Copy link
Member

Ngone51 commented Sep 27, 2023

So with #42296, as long as all the shuffle data has been migrated, I think the fetcher would be able to retry the shuffle fetch after the decommissioned executor is gone.

@holdenk
Copy link
Contributor

holdenk commented Oct 10, 2023

Maybe the NettyBlockTransferServiceSuite would be enough so we can assert that we are keeping track of active streams and also once done that the count is zero (so we know this is not blocking shutdown indefinitely). WDYT?

Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 19, 2024
@github-actions github-actions bot closed this Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants