Redesign Cluster Managers

Over the past few months we've learned more about the requirements to deploy Dask on different cluster resource managers.  This has created several projects like `dask-yarn`, `dask-jobqueue`, and `dask-kubernetes`.   These projects share some code within this repository, but also add their own constraints. 
 Now that we have more experience, this might be a good time to redesign things both in the external projects and in the central API.

There are a few changes that have been proposed that likely affect everyone:

1.  We would like to separate the `Cluster` manager object from the Scheduler in some cases
2.  We would like to optionally change how the user specifies cluster size to incorporate things like number of cores or amount of memory.  In the future this might also expand to other resources like GPUs.
3.  We would like to build more UI elements around the cluster object, like JupyterLab server extensions

The first two will likely require synchronized changes to all of the downstream projects.  

### Separate ClusterManager, Scheduler, and Client

I'm going to start calling the object that starts and stops workers, like `KubeCluster`, `ClusterManager` from here on.

Previously the `ClusterManager`, `Scheduler`, and `Client` were started from within a notebook and so all lived in the same process, and could act on the same event loop thread.  It didn't matter so much which part did which action.  We're now considering separating all of these.  This forces us to think about the kinds of communication they'll have to do, and where certain decisions and actions take place.  At first glance we might assign actions like the following:

1.  ClusterManager
    -  Start new workers
    -  Force-kill old workers
    -  Have UI that allows users to specify a desired makeup of workers, either a specific number or an Adaptive range.  I'm going to call this the Worker Spec
2.  Scheduler
    - Maintain `Adaptive` logic, and determine how many workers should exist given current load
    - Choose workes to remove and remove those workers politely
3.  Client: normal operation, I don't think that anything needs to change here.  

Some of these actions depend on information from the others.  In particular:

-  The `ClusterManager` needs to send the following to the `Scheduler`
    - The Worker Spec (like amount of desired memory or adaptive range), whenever it changes
-  The `Scheduler` needs to send the following to the `ClusterManager`
    - The desired target number of workers given current load (if in adaptive mode)
    - An announcement every time a new worker arrives (to manage the difference between launched/pending and actively running workers)
    - Requests to force-kill and remove workers after it politely kills them

This is all just a guess.  I suspect that other things might arise when actually trying to build everything here.

### How to specify cluster size 

This came up in https://github.com/dask/distributed/issues/2208 by @guillaumeeb .  Today we ask users for a number of desired workers

    cluster.scale(10)

But we might instead want to allow users to specify amount of desired ram or memory

    cluster.scale('1TB')

And in the future people will probably also ask for more complex compositions, like some small workers and some big workers, some with GPUs, and so on.  

If we now have to establish a communication protocol between the `ClusterManager` and the `Scheduler/Adaptive` then it might be an interesting challenge to make that future-proof.  

Further discussion on this topic should probably remain in https://github.com/dask/distributed/issues/2208 , but I wanted to mention it here.

### Shared UI

@ian-r-rose has done some excellent work on the [JupyterLab extension for Dask](https://github.com/dask/dask-labextension).  He has also mentioned thoughts on how to include the `ClusterManager` within something like that as a server extension.  This would optionally move the `ClusterManager` outside of the notebook and into the JupyterLab sidebar.  A number of people have asked for this in the past.  If we're going to redesign things I thought we should also include UI in that process to make sure we get any constraints.

### Organization

Should the `ClusterManager` object continue to be a part of this repository or should it be spun out?  There are probably costs and benefits both ways.

cc @jcrist (dask-yarn) @jacobtomlinson (dask-kubernetes) @jhamman @guillaumeeb @lesteve (dask-jobqueue) @ian-r-rose (dask-labextension) for feedback.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign Cluster Managers #2235

Separate ClusterManager, Scheduler, and Client

How to specify cluster size

Shared UI

Organization

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Redesign Cluster Managers #2235

Description

Separate ClusterManager, Scheduler, and Client

How to specify cluster size

Shared UI

Organization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions