Use number of cores as a parameter to scale() #130

guillaumeeb · 2018-08-16T20:00:53Z

We've recently switched from number of jobs to number of worker processes as an argument to scale method. This is good for working with adaptive cluster as defined in distributed, but somehow not really easy to deal with while using dask-jobqueue, at least for me.

Ideally, we would use unitary job that spawn an entire node of our cluster. But in real life, I often find myself changing cores and process parameters between use cases, or to better fit with the current use of my cluster, when there is other users' job that are not using full nodes for example.

And at the end, when using scale method, what I want is to scale to a given number of CPU cores, no matter the number of jobs or workers or process or threads. So I'm left doing the (simple) maths of translating between number of cores and number of processes given my JobqueueCluster initialization.

It would be really easier if we could have the option to specify a cores kwarg or something like this to scale method.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-08-16T20:42:48Z

This makes sense to me. I could imagine wanting scale(cores=100) or scale(memory='1TB') in practice

…

On Thu, Aug 16, 2018 at 2:00 PM, Guillaume EB ***@***.***> wrote: We've recently switched from number of jobs to number of worker processes as an argument to scale method. This is good for working with adaptive cluster as defined in distributed, but somehow not really easy to deal with while using dask-jobqueue, at least for me. Ideally, we would use unitary job that spawn an entire node of our cluster. But in real life, I often find myself changing cores and process parameters between use cases, or to better fit with the current use of my cluster, when there is other users' job that are not using full nodes for example. And at the end, when using scale method, what I want is to scale to a given number of CPU cores, no matter the number of jobs or workers or process or threads. So I'm left doing the (simple) maths of translating between number of cores and number of processes given my JobqueueCluster initialization. It would be really easier if we could have the option to specify a cores kwarg or something like this to scale method. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#130>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszNRfOeNgZA9hZCy6Wck8Be2q44P9ks5uRc91gaJpZM4WAfaB> .

jhamman · 2018-08-17T00:21:23Z

I like the ideas here. These kwargs would probably fit in the base cluster, yes?

@guillaumeeb, I'm wondering if you want to play with my branch a bit in #97. I'm having trouble wrapping it up but I think the general improvements there would make using scale and adaptive more friendly.

guillaumeeb · 2018-08-17T13:45:37Z

I like the ideas here. These kwargs would probably fit in the base cluster, yes?

Yes I believe so too, @mrocklin do you agree?

mrocklin · 2018-08-17T13:50:28Z

Yes, but only if we communicate cores-per-worker and memory-per-worker from the downstream projects back up to the main class. None of the projects do this today.

…

On Fri, Aug 17, 2018 at 7:45 AM, Guillaume EB ***@***.***> wrote: I like the ideas here. These kwargs would probably fit in the base cluster, yes? Yes I believe so too, @mrocklin <https://github.com/mrocklin> do you agree? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLW1OCRdYn71Eg1W3z3DXNHLgCjuks5uRskBgaJpZM4WAfaB> .

mrocklin · 2018-08-17T13:50:50Z

This is a case where some coordination between the various dask-foo projects for deployment would be valuable On Fri, Aug 17, 2018 at 7:50 AM, Matthew Rocklin <mrocklin@anaconda.com> wrote:

…

Yes, but only if we communicate cores-per-worker and memory-per-worker from the downstream projects back up to the main class. None of the projects do this today. On Fri, Aug 17, 2018 at 7:45 AM, Guillaume EB ***@***.***> wrote: > I like the ideas here. These kwargs would probably fit in the base > cluster, yes? > > Yes I believe so too, @mrocklin <https://github.com/mrocklin> do you > agree? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#130 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AASszLW1OCRdYn71Eg1W3z3DXNHLgCjuks5uRskBgaJpZM4WAfaB> > . >

guillaumeeb · 2018-08-23T20:44:06Z

After a dask-jobqueue presentation today with teams doing heavy flight dynamics computation, I believe that this proposal could be an important improvement for easier adoption. They were really enthusiast with the tool, but explaining the scaling parameter was a pain point, even given it's not that hard. Dealing with scheduler jobs not corresponding to number of workers is unclear.

I see two ways of doing this:

Easier would just be to overload scale and adapt, and convert ncores to nworkers here in dask-jobqueue.
Better one will be to implement functionnality upstream as @mrocklin and @jhamman suggest. This would probably prove more complicated, especially dealing with Cluster sub classes not implementing methods or attributes, and synchronization between development on both projects. Special thought to the cluster widget, but that can be addressed later on.

I'm keen to work on either, solution. It would be nice if I could have some advice, and maybe rough insights on how to do it best if we chose the second option.

mrocklin · 2018-08-23T22:42:25Z

I agree that doing the second option would be nicer. I think that this requires us to develop a convention used by the dask deployment subprojects to record cpu and memory per worker in a standard way. I'll raise an issue upstream.

mrocklin · 2018-08-23T22:49:53Z

Raised upstream at dask/distributed#2208

guillaumeeb · 2018-10-02T21:29:16Z

After discussions in dask/distributed#2209 and dask/distributed#2235, and after some thoughts, but without actually trying anything, I feel that it may be simpler to just copy and paste the code from dask/distributed#2209 to dask-jobqueue. But should I:

Copy paste adapt, scale, possibly the widget inside JobQueueCluster class?
Copy the Cluster class and put it in this repo and between JobQueueCluster and distributed.deploy.Cluster in the class hierarchy.
Try using super() calls, knowing that it won't work for the widget or Use the grouped workers worker_key function in scale outside of adaptive #152.

I don't see other solutions, but I don't really like any of those, an external opinion would be welcomed here. Is this really wrong to duplicate distributed.deploy.Cluster here while waiting for dask/distributed#2235 to be solved?

mrocklin · 2018-10-02T22:33:40Z

I don't have strong opinions here. I wonder if @lesteve or @jhamman have thoughts?

…

On Tue, Oct 2, 2018 at 5:29 PM Guillaume Eynard-Bontemps < ***@***.***> wrote: After discussions in dask/distributed#2209 <dask/distributed#2209> and dask/distributed#2235 <dask/distributed#2235>, and after some thoughts, but without actually trying anything, I feel that it may be simpler to just copy and paste the code from dask/distributed#2209 <dask/distributed#2209> to dask-jobqueue. But should I: - Copy paste adapt, scale, possibly the widget inside JobQueueCluster class? - Copy the Cluster class and put it in this repo and between JobQueueCluster and distributed.deploy.Cluster in the class hierarchy. - Try using super() calls, knowing that it won't work for the widget or #152 <#152>. I don't see other solutions, but I don't really like any of those, an external opinion would be welcomed here. Is this really wrong to duplicate distributed.deploy.Cluster here while waiting for dask/distributed#2235 <dask/distributed#2235> to be solved? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMfXiZ5Z8qtEabCRhSTtdoHtnmkoks5ug9qtgaJpZM4WAfaB> .

jhamman · 2018-10-02T23:05:06Z

I also don't have particularly strong feelings about this. If you remember, I was also heading down the path of overriding the scale method. My hope is that a more elegant solution comes out of dask/distributed#2235. In the short term though, I think going down this path may yield some important learning lessons.

guillaumeeb mentioned this issue Aug 16, 2018

Deployment meeting dask/distributed#2189

Closed

guillaumeeb mentioned this issue Aug 17, 2018

more adaptive scaling fixes #97

Merged

guillaumeeb mentioned this issue Aug 24, 2018

Adding info for scaling with cores and memory #136

Closed

guillaumeeb mentioned this issue Oct 6, 2018

Implementing an Ersatz of ClusterManager to fix jobqueue issues linked to upstream deploy.Cluster limitations #170

Closed

guillaumeeb self-assigned this Oct 27, 2018

guillaumeeb added the enhancement New feature or request label Oct 27, 2018

guillaumeeb mentioned this issue Oct 30, 2018

Scale using number of cores or memory with ClusterManager #184

Merged

guillaumeeb closed this as completed in #184 Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use number of cores as a parameter to scale() #130

Use number of cores as a parameter to scale() #130

guillaumeeb commented Aug 16, 2018

mrocklin commented Aug 16, 2018 via email

jhamman commented Aug 17, 2018

guillaumeeb commented Aug 17, 2018

mrocklin commented Aug 17, 2018 via email

mrocklin commented Aug 17, 2018 via email

guillaumeeb commented Aug 23, 2018

mrocklin commented Aug 23, 2018

mrocklin commented Aug 23, 2018

guillaumeeb commented Oct 2, 2018

mrocklin commented Oct 2, 2018 via email

jhamman commented Oct 2, 2018

Use number of cores as a parameter to scale() #130

Use number of cores as a parameter to scale() #130

Comments

guillaumeeb commented Aug 16, 2018

mrocklin commented Aug 16, 2018 via email

jhamman commented Aug 17, 2018

guillaumeeb commented Aug 17, 2018

mrocklin commented Aug 17, 2018 via email

mrocklin commented Aug 17, 2018 via email

guillaumeeb commented Aug 23, 2018

mrocklin commented Aug 23, 2018

mrocklin commented Aug 23, 2018

guillaumeeb commented Oct 2, 2018

mrocklin commented Oct 2, 2018 via email

jhamman commented Oct 2, 2018