Retire workers in SpecCluster by name #4074

jrbourbeau · 2020-08-25T19:03:32Z

Currently there's a mismatch between how workers are stored on the scheduler and spec cluster. Scheduler.workers uses worker addresses for keys while SpecCluster.workers uses the output of SpecCluster._new_worker_name for keys (which is often an integer that's incremented when a new worker is added). This mismatch results in workers not being retired properly here (xref #4069):

distributed/distributed/deploy/spec.py

Line 328 in 586ded3

await self.scheduler_comm.retire_workers(workers=list(to_close))

This PR updates how we call retire_workers to use worker names instead of instead of addresses which will ensure the Scheduler and SpecCluster are on the same page about which workers should be retired.

Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:

distributed/distributed/deploy/spec.py

Lines 343 to 345 in 586ded3

    
           if "name" not in opts: 
        
               opts = opts.copy() 
        
               opts["name"] = name

I added a new SpecCluster._spec_name_to_worker_name mapping which maps between the name stored in SpecCluster.workers and the actual name used for the worker that's created.

Closes #4069

jrbourbeau · 2020-08-25T19:04:43Z

cc @bbockelm

quasiben · 2020-08-26T13:11:09Z

@jacobtomlinson if you have a moment would you be able to look this over ?

mrocklin · 2020-08-26T14:12:29Z

Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:

We could also make this a requirement. I don't recall why this distinction was made, but it might not be strictly required, and this sounds like the kind of thing that would make things more consistent.

Also, a corner case to watch out for with SpecCluster is multi-worker jobs (grep for MultiWorker for a test). This comes up with dask-cuda and with dask-jobqueue.

jacobtomlinson · 2020-08-26T14:23:26Z

This looks reasonable to me. But I want to echo what Matt said about multiple workers. A worker object in respect to SpecCluster may reference mutliple tightly coupled workers, like in dask-cuda.

When you run dask-cuda-worker it inspects the number of GPUs and spawns one worker per GPU. So it may not always be possible to reconsile a 1-to-1 relationship.

Updating and completing dask#4074

Updates / Completes dask#4074

LunarLanding · 2022-04-05T16:21:57Z

Hi, I'm interested in fixing this too, I updated it and added support for MultiWorkers ( #6065 ), locally the only failing tests seem unrelated, at this point if I could have some guidance on how to fix them that would be great. Maybe using a stable version to base the patch on?

Retire workers in SpecCluster by name

efc97dd

jrbourbeau mentioned this pull request Feb 22, 2021

Retiring worker by name in SpecCluster #4532

Open

Base automatically changed from master to main March 8, 2021 19:04

oshadura mentioned this pull request Jun 20, 2021

SpecCluster + Dask JobQueue does not retire workers when scaling down. #4069

Open

LunarLanding added a commit to LunarLanding/distributed that referenced this pull request Apr 4, 2022

Fix (Multi) Worker retiring in SpecCluster

523542f

Updating and completing dask#4074

LunarLanding added a commit to LunarLanding/distributed that referenced this pull request Apr 4, 2022

Fix (Multi) Worker Retiring in SpecCluster

8173415

Updates / Completes dask#4074

LunarLanding mentioned this pull request Apr 4, 2022

Fix (Multi) Worker Retiring in SpecCluster #6065

Draft

3 tasks

jrbourbeau requested a review from jacobtomlinson as a code owner January 23, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retire workers in SpecCluster by name #4074

Retire workers in SpecCluster by name #4074

jrbourbeau commented Aug 25, 2020

jrbourbeau commented Aug 25, 2020

quasiben commented Aug 26, 2020

mrocklin commented Aug 26, 2020

jacobtomlinson commented Aug 26, 2020

LunarLanding commented Apr 5, 2022

Retire workers in SpecCluster by name #4074

Are you sure you want to change the base?

Retire workers in SpecCluster by name #4074

Conversation

jrbourbeau commented Aug 25, 2020

jrbourbeau commented Aug 25, 2020

quasiben commented Aug 26, 2020

mrocklin commented Aug 26, 2020

jacobtomlinson commented Aug 26, 2020

LunarLanding commented Apr 5, 2022