Support for nprocs? #349

hhuuggoo · 2021-07-21T04:13:44Z

Is there was any interest in building in support for nprocs? I know in #84 the consensus was that having a 1-1 relationship between processes and pods makes the most sense.

We use nprocs because

Some workloads work better with processes vs threads.
We prefer to think in terms of machines rather than pods.

I've considered thinking in pods rather than machines but for the clusters we manage, machines are the fundamental unit people pay for, and it's easy to end up in a situation where machines are under-utilized at the k8s level. Yes k8s can move pods around, but that ends up potentially disrupting longer running workloads.

For the most part using dask-kubernetes with nprocs>1 has worked pretty well. It can get a little goofy because if nprocs=4 and I call scale(4) I end up with 16 workers. I think the most value would be accomplished in making adaptive understand nprocs

So the question is just if anyone else cares about this? If it's just me, I'll subclass Adaptive and call it a day. Otherwise I can add this functionality into dask-kubernetes .

The text was updated successfully, but these errors were encountered:

woodcockr · 2021-07-21T07:42:04Z

I think this would be good. There's a difference between processes and Pods in terms of worker locality for data comms.
I also prefer to think of machines rather than pods.

jacobtomlinson · 2021-07-21T09:04:18Z

I have no specific objection to having multiple processes within a pod. Ideally I would like things to be configurable with sensible defaults. One process per pod is a sensible default, but if you have a workload that would benefit from some additional tuning then sure go for it.

The issue with Adaptive not being aware of processes is not dask-kubernetes specific, that would need to be addressed upstream in distributed.

@hhuuggoo do you have someone at Saturn who could contribute here and work on this?

hhuuggoo · 2021-07-21T15:38:36Z

@hhuuggoo do you have someone at Saturn who could contribute here and work on this?

Yes! It will probably be me - though I might be a bit slow on this. @jacobtomlinson Do you think I should move this issue to distributed? Are there other cluster types that typically have the same issue? dask-kubernetes was the only one I've been thinking about lately.

jacobtomlinson · 2021-07-22T09:40:13Z

I think there are two things here:

dask-kubernetes does not expose nprocs as a configuration option.
Adaptive is not aware of multiple processes.

I think 1 should remain here in this issue, but it might be worth raising another issue on distributed for 2.

hhuuggoo · 2021-07-29T02:00:15Z

@jacobtomlinson I'm not sure #2 needs to be addressed.

The scheduler already understands hosts:

https://github.com/dask/distributed/blob/1be9265ac11876df766bb8bd6d6eb519d04d3bac/distributed/scheduler.py#L6398

and Adaptive supports configuring that parameter

https://github.com/dask/distributed/blob/1be9265ac11876df766bb8bd6d6eb519d04d3bac/distributed/deploy/adaptive.py#L93

I think we would only need to modify dask-kubernetes to configure Adaptive with the proper key?

hhuuggoo · 2021-07-29T12:48:47Z

Actually I think there is one other thing that I can raise with distributed. I'm not sure how important this is yet, but probably if we want to always be using host as the key in calling workers_to_close on the Scheduler, we would need to make sure the scheduler also passes those parameters. adaptive_target (which SpecCluster uses to figure out how many workers to scale down), calls workers_to_close with no arguments

hhuuggoo · 2021-08-26T19:00:24Z

Just a note that I just started digging around, and I'm not sure this is an issue (was looking at 2021.07 earlier last week). I believe the recommendations I'm getting back for the scheduler are for whole pods, but I can confirm on this issue later on when I can dig deeper.

I do think there is an issue where while pods are starting, dask_kubernetes does not know that they are starting. I had a situation where the scheduler wanted to scale down to 1, and it resulted in all pods being shut down, except for one that was still in the process of starting up. When I confirm that, I will write it up as a separate issue, and possibly close this one.

jacobtomlinson · 2024-04-30T15:28:19Z

The classic KubeCluster was removed in #890. All users will need to migrate to the Dask Operator. Closing.

jacobtomlinson added enhancement help wanted labels Jul 21, 2021

jacobtomlinson added the kubecluster (classic) label Jan 11, 2022

jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for nprocs? #349

Support for nprocs? #349

hhuuggoo commented Jul 21, 2021

woodcockr commented Jul 21, 2021

jacobtomlinson commented Jul 21, 2021

hhuuggoo commented Jul 21, 2021

jacobtomlinson commented Jul 22, 2021

hhuuggoo commented Jul 29, 2021

hhuuggoo commented Jul 29, 2021

hhuuggoo commented Aug 26, 2021

jacobtomlinson commented Apr 30, 2024

Support for nprocs? #349

Support for nprocs? #349

Comments

hhuuggoo commented Jul 21, 2021

woodcockr commented Jul 21, 2021

jacobtomlinson commented Jul 21, 2021

hhuuggoo commented Jul 21, 2021

jacobtomlinson commented Jul 22, 2021

hhuuggoo commented Jul 29, 2021

hhuuggoo commented Jul 29, 2021

hhuuggoo commented Aug 26, 2021

jacobtomlinson commented Apr 30, 2024