-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retire workers in SpecCluster by name #4074
base: main
Are you sure you want to change the base?
Conversation
cc @bbockelm |
@jacobtomlinson if you have a moment would you be able to look this over ? |
We could also make this a requirement. I don't recall why this distinction was made, but it might not be strictly required, and this sounds like the kind of thing that would make things more consistent. Also, a corner case to watch out for with SpecCluster is multi-worker jobs (grep for |
This looks reasonable to me. But I want to echo what Matt said about multiple workers. A worker object in respect to When you run |
Updating and completing dask#4074
Updates / Completes dask#4074
Hi, I'm interested in fixing this too, I updated it and added support for MultiWorkers ( #6065 ), locally the only failing tests seem unrelated, at this point if I could have some guidance on how to fix them that would be great. Maybe using a stable version to base the patch on? |
Currently there's a mismatch between how workers are stored on the scheduler and spec cluster.
Scheduler.workers
uses worker addresses for keys whileSpecCluster.workers
uses the output ofSpecCluster._new_worker_name
for keys (which is often an integer that's incremented when a new worker is added). This mismatch results in workers not being retired properly here (xref #4069):distributed/distributed/deploy/spec.py
Line 328 in 586ded3
This PR updates how we call
retire_workers
to use worker names instead of instead of addresses which will ensure theScheduler
andSpecCluster
are on the same page about which workers should be retired.Note that since the key stored in
SpecCluster.workers
isn't always used as the corresponding worker's name:distributed/distributed/deploy/spec.py
Lines 343 to 345 in 586ded3
I added a new
SpecCluster._spec_name_to_worker_name
mapping which maps between the name stored inSpecCluster.workers
and the actual name used for the worker that's created.Closes #4069