Using grouped workers and Adaptive #1987

jhamman · 2018-05-18T00:30:02Z

I am seeing what appears to be some buggy behavior when using Adaptive with grouped workers.

A fully reproducible example here is a bit tough because this also includes dask-jobqueue (dask/dask-jobqueue#26) but I hopefully can lay out what I see as potential problems and we can go from there.

Here's my current workflow:

from distributed import Client
from dask_jobqueue import PBSCluster

cluster = PBSCluster(..., processes=12, threads=4)
client = Client(cluster)
cluster.adapt(minimum=2, maximum=10, interval='500ms')

In jobqueue, this leads to calling the following command for each scale up call:

dask-worker --nprocs 12 --nthreads 4 ...

(hence the grouped workers)

Problem description:

Initializing the cluster / client goes as expected. The problem occurs when using the Adaptive scheduler. minimum=2 calls scaled up twice and is translated into two groups of workers (24 in total). These workers come online and are immediately culled.

So problem 1 may just be a semantics issue. Do the minimum/maximum kwargs to Adaptive correspond to individual workers (processes), and not to grouped workers (executions of `dask-worker)?

Problem 2 is perhaps a bit harder to see. Even if we're treating each group incorrectly, 1 of the 2 workers should have survived and I should be left with 12 processes/workers. But all the workers are culled so this seems like a bug.

I should note that manually scaling the PBSCluster using scale_up/scale_down works just fine.

cc @mrocklin @guillaumeeb

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-05-18T13:13:58Z

Yeah, there are definitely bugs here. Thanks for raising the issue.

calls scaled up twice and is translated into two groups of workers (24 in total)

What happens if you change your interval to something very fast, like 10ms? I suspect that this might become much much worse. My recollection is that we don't yet have a good way to track the jobs that are in-flight in dask-jobqueue.

Do the minimum/maximum kwargs to Adaptive correspond to individual workers (processes), and not to grouped workers (executions of `dask-worker)?

I suspect that they refer to the number of python Worker processes connected to the Scheduler.

But all the workers are culled so this seems like a bug.

Yup, I agree

jhamman · 2018-05-18T21:28:24Z

My recollection is that we don't yet have a good way to track the jobs that are in-flight in dask-jobqueue.

That's right (see dask/dask-jobqueue#11)

I suspect that they refer to the number of python Worker processes connected to the Scheduler.

That is my assumption now too. This should be documented (here and in jobqueue I think).

mrocklin · 2018-06-13T12:51:44Z

@jhamman is this now resolved?

jhamman · 2018-06-20T20:08:22Z

Yes, @mrocklin. We can close this now.

jhamman closed this as completed Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using grouped workers and Adaptive #1987

Using grouped workers and Adaptive #1987

jhamman commented May 18, 2018

mrocklin commented May 18, 2018

jhamman commented May 18, 2018

mrocklin commented Jun 13, 2018

jhamman commented Jun 20, 2018

Using grouped workers and Adaptive #1987

Using grouped workers and Adaptive #1987

Comments

jhamman commented May 18, 2018

Here's my current workflow:

Problem description:

mrocklin commented May 18, 2018

jhamman commented May 18, 2018

mrocklin commented Jun 13, 2018

jhamman commented Jun 20, 2018