Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using grouped workers and Adaptive #1987

Closed
jhamman opened this issue May 18, 2018 · 4 comments
Closed

Using grouped workers and Adaptive #1987

jhamman opened this issue May 18, 2018 · 4 comments

Comments

@jhamman
Copy link
Member

jhamman commented May 18, 2018

I am seeing what appears to be some buggy behavior when using Adaptive with grouped workers.

A fully reproducible example here is a bit tough because this also includes dask-jobqueue (dask/dask-jobqueue#26) but I hopefully can lay out what I see as potential problems and we can go from there.

Here's my current workflow:

from distributed import Client
from dask_jobqueue import PBSCluster

cluster = PBSCluster(..., processes=12, threads=4)
client = Client(cluster)
cluster.adapt(minimum=2, maximum=10, interval='500ms')

In jobqueue, this leads to calling the following command for each scale up call:

dask-worker --nprocs 12 --nthreads 4 ...

(hence the grouped workers)

Problem description:

Initializing the cluster / client goes as expected. The problem occurs when using the Adaptive scheduler. minimum=2 calls scaled up twice and is translated into two groups of workers (24 in total). These workers come online and are immediately culled.

So problem 1 may just be a semantics issue. Do the minimum/maximum kwargs to Adaptive correspond to individual workers (processes), and not to grouped workers (executions of `dask-worker)?

Problem 2 is perhaps a bit harder to see. Even if we're treating each group incorrectly, 1 of the 2 workers should have survived and I should be left with 12 processes/workers. But all the workers are culled so this seems like a bug.

I should note that manually scaling the PBSCluster using scale_up/scale_down works just fine.

cc @mrocklin @guillaumeeb

@mrocklin
Copy link
Member

Yeah, there are definitely bugs here. Thanks for raising the issue.

calls scaled up twice and is translated into two groups of workers (24 in total)

What happens if you change your interval to something very fast, like 10ms? I suspect that this might become much much worse. My recollection is that we don't yet have a good way to track the jobs that are in-flight in dask-jobqueue.

Do the minimum/maximum kwargs to Adaptive correspond to individual workers (processes), and not to grouped workers (executions of `dask-worker)?

I suspect that they refer to the number of python Worker processes connected to the Scheduler.

But all the workers are culled so this seems like a bug.

Yup, I agree

@jhamman
Copy link
Member Author

jhamman commented May 18, 2018

My recollection is that we don't yet have a good way to track the jobs that are in-flight in dask-jobqueue.

That's right (see dask/dask-jobqueue#11)

I suspect that they refer to the number of python Worker processes connected to the Scheduler.

That is my assumption now too. This should be documented (here and in jobqueue I think).

@mrocklin
Copy link
Member

@jhamman is this now resolved?

@jhamman
Copy link
Member Author

jhamman commented Jun 20, 2018

Yes, @mrocklin. We can close this now.

@jhamman jhamman closed this as completed Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants