Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retire workers in SpecCluster by name #4074

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jrbourbeau
Copy link
Member

Currently there's a mismatch between how workers are stored on the scheduler and spec cluster. Scheduler.workers uses worker addresses for keys while SpecCluster.workers uses the output of SpecCluster._new_worker_name for keys (which is often an integer that's incremented when a new worker is added). This mismatch results in workers not being retired properly here (xref #4069):

await self.scheduler_comm.retire_workers(workers=list(to_close))

This PR updates how we call retire_workers to use worker names instead of instead of addresses which will ensure the Scheduler and SpecCluster are on the same page about which workers should be retired.

Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:

if "name" not in opts:
opts = opts.copy()
opts["name"] = name

I added a new SpecCluster._spec_name_to_worker_name mapping which maps between the name stored in SpecCluster.workers and the actual name used for the worker that's created.

Closes #4069

@jrbourbeau
Copy link
Member Author

cc @bbockelm

@quasiben
Copy link
Member

@jacobtomlinson if you have a moment would you be able to look this over ?

@mrocklin
Copy link
Member

Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:

We could also make this a requirement. I don't recall why this distinction was made, but it might not be strictly required, and this sounds like the kind of thing that would make things more consistent.

Also, a corner case to watch out for with SpecCluster is multi-worker jobs (grep for MultiWorker for a test). This comes up with dask-cuda and with dask-jobqueue.

@jacobtomlinson
Copy link
Member

This looks reasonable to me. But I want to echo what Matt said about multiple workers. A worker object in respect to SpecCluster may reference mutliple tightly coupled workers, like in dask-cuda.

When you run dask-cuda-worker it inspects the number of GPUs and spawns one worker per GPU. So it may not always be possible to reconsile a 1-to-1 relationship.

Base automatically changed from master to main March 8, 2021 19:04
LunarLanding added a commit to LunarLanding/distributed that referenced this pull request Apr 4, 2022
LunarLanding added a commit to LunarLanding/distributed that referenced this pull request Apr 4, 2022
@LunarLanding
Copy link

Hi, I'm interested in fixing this too, I updated it and added support for MultiWorkers ( #6065 ), locally the only failing tests seem unrelated, at this point if I could have some guidance on how to fix them that would be great. Maybe using a stable version to base the patch on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SpecCluster + Dask JobQueue does not retire workers when scaling down.
5 participants