Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use number of cores as a parameter to scale() #130

Closed
guillaumeeb opened this issue Aug 16, 2018 · 11 comments
Closed

Use number of cores as a parameter to scale() #130

guillaumeeb opened this issue Aug 16, 2018 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@guillaumeeb
Copy link
Member

We've recently switched from number of jobs to number of worker processes as an argument to scale method. This is good for working with adaptive cluster as defined in distributed, but somehow not really easy to deal with while using dask-jobqueue, at least for me.

Ideally, we would use unitary job that spawn an entire node of our cluster. But in real life, I often find myself changing cores and process parameters between use cases, or to better fit with the current use of my cluster, when there is other users' job that are not using full nodes for example.

And at the end, when using scale method, what I want is to scale to a given number of CPU cores, no matter the number of jobs or workers or process or threads. So I'm left doing the (simple) maths of translating between number of cores and number of processes given my JobqueueCluster initialization.

It would be really easier if we could have the option to specify a cores kwarg or something like this to scale method.

@mrocklin
Copy link
Member

mrocklin commented Aug 16, 2018 via email

@jhamman
Copy link
Member

jhamman commented Aug 17, 2018

I like the ideas here. These kwargs would probably fit in the base cluster, yes?

@guillaumeeb, I'm wondering if you want to play with my branch a bit in #97. I'm having trouble wrapping it up but I think the general improvements there would make using scale and adaptive more friendly.

@guillaumeeb
Copy link
Member Author

I like the ideas here. These kwargs would probably fit in the base cluster, yes?

Yes I believe so too, @mrocklin do you agree?

@mrocklin
Copy link
Member

mrocklin commented Aug 17, 2018 via email

@mrocklin
Copy link
Member

mrocklin commented Aug 17, 2018 via email

@guillaumeeb
Copy link
Member Author

After a dask-jobqueue presentation today with teams doing heavy flight dynamics computation, I believe that this proposal could be an important improvement for easier adoption. They were really enthusiast with the tool, but explaining the scaling parameter was a pain point, even given it's not that hard. Dealing with scheduler jobs not corresponding to number of workers is unclear.

I see two ways of doing this:

  • Easier would just be to overload scale and adapt, and convert ncores to nworkers here in dask-jobqueue.
  • Better one will be to implement functionnality upstream as @mrocklin and @jhamman suggest. This would probably prove more complicated, especially dealing with Cluster sub classes not implementing methods or attributes, and synchronization between development on both projects. Special thought to the cluster widget, but that can be addressed later on.

I'm keen to work on either, solution. It would be nice if I could have some advice, and maybe rough insights on how to do it best if we chose the second option.

@mrocklin
Copy link
Member

I agree that doing the second option would be nicer. I think that this requires us to develop a convention used by the dask deployment subprojects to record cpu and memory per worker in a standard way. I'll raise an issue upstream.

@mrocklin
Copy link
Member

Raised upstream at dask/distributed#2208

@guillaumeeb
Copy link
Member Author

After discussions in dask/distributed#2209 and dask/distributed#2235, and after some thoughts, but without actually trying anything, I feel that it may be simpler to just copy and paste the code from dask/distributed#2209 to dask-jobqueue. But should I:

I don't see other solutions, but I don't really like any of those, an external opinion would be welcomed here. Is this really wrong to duplicate distributed.deploy.Cluster here while waiting for dask/distributed#2235 to be solved?

@mrocklin
Copy link
Member

mrocklin commented Oct 2, 2018 via email

@jhamman
Copy link
Member

jhamman commented Oct 2, 2018

I also don't have particularly strong feelings about this. If you remember, I was also heading down the path of overriding the scale method. My hope is that a more elegant solution comes out of dask/distributed#2235. In the short term though, I think going down this path may yield some important learning lessons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants