Server extension to start cluster #30

mrocklin · 2018-10-22T00:05:28Z

It would be useful to be able to start and stop clusters from within the sidebar rather than within a notebook. This would allow clusters to persist between notebooks and between notebook sessions.

So, how do we start, stop, scale, and adapt clusters within the sidebar? Presumably this requires ...

A server extension that runs a bit of Python and actually manages the Cluster object (we'll have to configure which cluster object we use with config files)
Some UI elements within the JLab sidebar that connect to that Python process
A mechanism to communicate between that process and the notebook processes (maybe some sort of json file that we write to a pre-configured place?)

@ian-r-rose is this something that you have time to help with? I'm quite happy to help out with this, but I suspect that would benefit from having you (or someone with your experience) lead.

ian-r-rose · 2018-10-22T15:32:18Z

Yes, I can help with some of this. To my mind, the most difficult thing will be managing the lifecycle of the clusters. Your basic description of the setup sounds reasonable. For communicating between the server side and the client side: I think rather than a config json we should probably model it after the rest of the notebook REST API. In particular, the sessions API is probably a good place to start.
Major questions that I have.

How do we poll for clusters in a backend-agnostic way? Does dask-distributed have all the abstractions we need?
What are things that need to be configurable on the backend? How much of that should be exposed to the frontend vs configured at server launch time?

mrocklin · 2018-10-22T17:35:05Z

So whenever someone starts up a Client in a Python session we would optionally hit some address to see if it will respond? Presumably it responds with the address that we should check? That could work.

I can imagine doing this either over HTTP similar to how Jupyter does things, or using Dask comms instead.

>>> client = Client('http://localhost/api/dask') 
>>> client.scheduler.address
'tcp://some-other-address:####'

>>> client = Client('tcp://localhost:8786')  # actually the address of our nbserver extension
>>> client.scheduler.address
'tcp://some-other-address:####'

mrocklin · 2018-10-22T17:39:36Z

What are things that need to be configurable on the backend? How much of that should be exposed to the frontend vs configured at server launch time?

Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly.

At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory.

mrocklin · 2018-10-22T17:44:25Z

ian-r-rose · 2018-10-23T04:13:38Z

So whenever someone starts up a Client in a Python session we would optionally hit some address to see if it will respond? Presumably it responds with the address that we should check? That could work.

Yes, we can have something like a dask/clients endpoint that returns a list of active clients, as well as ids for them. We can then hit dask/clients/clientid with GET, DELETE, etc requests to manage them. We can poll the list every few seconds to keep up to date. This is pretty close to how we handle kernel sessions in the application. I am currently thinking that our server extension would just proxy the client dashboard urls and comms and such, but you have a better idea of the networking requirements there than I.

Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly.

Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?

At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory.

I think this should be doable via a REST API.

mrocklin · 2018-10-24T13:40:45Z

Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?

I think so, yes.

Yes, we can have something like a dask/clients endpoint that returns a list of active clients

I think that we'll want to switch out the term clients for clusters or schedulers. The client is an object that the user will need to interact with in their notebook/script/whatever. That object will need the address of the scheduler to connect to.

@ian-r-rose perhaps we should chat about this real-time? We might be able to bounce back and forth and come up with a plan more quickly. I'm around most of today and tomorrow if you're free.

ian-r-rose · 2018-10-24T14:34:38Z

Sure, I am around and pretty flexible today. Feel free to ping me on Gitter and we can set up a room.

mrocklin · 2018-10-24T17:26:56Z

@ian-r-rose and I had a quick chat we agreed that ...

a server extension should probably manage a few clusters, not just one
a user might attach a particular cluster to a notebook kernel by clicking and dragging something into a notebook
on the server side this can probably be a fairly simple tornado web application

As an initial set of operations, the following probably work pretty well

from dask.distributed import LocalCluster

cluster = LocalCluster(threads_per_worker=2, memory_limit='4GB')  # configure workers and start

cluster.scale(10)  # scale cluster to ten workers
cluster.scale(2)  # scale cluster down to two workers

cluster.adapt(minimum=0, maximum=10)  # adapt cluster between 0 and 10 workers

cluster.close()  # shut down cluster

We may at some point want to start these running on the same event loop as the Jupyter web server, I'm not sure. This will probably affect some discussions that we're thinking about for deployment now upstream.

ian-r-rose · 2018-10-29T21:55:37Z

It looks like you were right to be concerned about the tornado event loop @mrocklin. In my initial explorations, just importing dask.distributed from the same process as that which is running the notebook server seems to cause problems. Specifically, the notebook server no longer responds to HTTP requests. Any ideas about what might be causing the problem or how we could get around it? Since it seems to be happening at import time, I don't really know where to start looking.

https://github.com/ian-r-rose/dask-labextension/tree/serverextension

mrocklin · 2018-10-30T14:33:05Z

@ian-r-rose I'm happy to investigate. This may sound dumb, but what's the right way to install and test this?

ian-r-rose · 2018-10-30T14:37:28Z

Thanks @mrocklin. You can install it with

pip install -e .
jupyter serverextension enable --sys-prefix dask_labextension

This attempts to add an additional REST endpoint to the web server. However, I was able to reproduce the problem with a do-nothing extension that just imported dask.distributed.

ian-r-rose · 2018-10-30T14:39:10Z

My suspicion is that both dask.distributed and the notebook server are trying to wrest control of the default tornado IOLoop and stepping on each other's toes, but I don't know that for sure.

mrocklin · 2018-10-31T10:08:37Z

Some binary search of imports and code lead to this diff on the Dask side, which seems to solve the immediate problem

diff --git a/distributed/utils.py b/distributed/utils.py
index df7561aa..dcdd7f5e 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1394,8 +1394,8 @@ def reset_logger_locks():
 
 
 # Only bother if asyncio has been loaded by Tornado
-if 'asyncio' in sys.modules:
-    fix_asyncio_event_loop_policy(sys.modules['asyncio'])
+# if 'asyncio' in sys.modules:
+#     fix_asyncio_event_loop_policy(sys.modules['asyncio'])
 
 
 def has_keyword(func, keyword):

I'll look into why we did this in the first place. In the mean time though applying this diff directly may allow us to move forward.

mrocklin · 2018-10-31T10:12:49Z

Seems to be a workaround for tornadoweb/tornado#2183

mrocklin · 2018-10-31T11:25:21Z

OK, after looking more at this I'm not sure that Dask is doing something wrong here. I've standardized things on the Dask side at dask/distributed#2326 .

If possible I think we should ask someone on the Jupyter side about why this might cause issues. Who is the right contact for this today?

mrocklin · 2018-10-31T14:59:31Z

@minrk do you have thoughts on why adding the following lines might break the Jupyter server?

    import asyncio
    import tornado.platform.asyncio
    asyncio.set_event_loop_policy(tornado.platform.asyncio.AnyThreadEventLoopPolicy())

ian-r-rose · 2018-10-31T17:35:58Z

Thanks for looking into this @mrocklin, I'll apply your fix while we work out a more permanent solution. @Carreau recently did a lot of work on the IPython event loop and may have some insights as well.

Carreau · 2018-10-31T18:24:36Z

I can have a look. I've been poking at async stuff, and we are still in some places using old way and doing ensure_future instead of yielding anything which is not None, This has lead for me to some prototype just not running coroutines.

So far I'm working on deploying a JupyterHub on the merced cluster and once this is done, I'll likely start integrating dask, so happy to be a guinea pig and debug these things.

I just need to get things to work first :-)

ian-r-rose · 2018-10-31T18:53:10Z

Thanks for the info @Carreau.

ian-r-rose · 2018-12-18T16:59:59Z

Fixed by #31

ian-r-rose mentioned this issue Nov 1, 2018

Serverextension #31

Merged

5 tasks

ian-r-rose closed this as completed Dec 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server extension to start cluster #30

Server extension to start cluster #30

mrocklin commented Oct 22, 2018

ian-r-rose commented Oct 22, 2018

mrocklin commented Oct 22, 2018

mrocklin commented Oct 22, 2018

mrocklin commented Oct 22, 2018

ian-r-rose commented Oct 23, 2018

mrocklin commented Oct 24, 2018

ian-r-rose commented Oct 24, 2018

mrocklin commented Oct 24, 2018

ian-r-rose commented Oct 29, 2018

mrocklin commented Oct 30, 2018

ian-r-rose commented Oct 30, 2018

ian-r-rose commented Oct 30, 2018

mrocklin commented Oct 31, 2018

mrocklin commented Oct 31, 2018

mrocklin commented Oct 31, 2018

mrocklin commented Oct 31, 2018

ian-r-rose commented Oct 31, 2018

Carreau commented Oct 31, 2018

ian-r-rose commented Oct 31, 2018

ian-r-rose commented Dec 18, 2018

Server extension to start cluster #30

Server extension to start cluster #30

Comments

mrocklin commented Oct 22, 2018

ian-r-rose commented Oct 22, 2018

mrocklin commented Oct 22, 2018

mrocklin commented Oct 22, 2018

mrocklin commented Oct 22, 2018

ian-r-rose commented Oct 23, 2018

mrocklin commented Oct 24, 2018

ian-r-rose commented Oct 24, 2018

mrocklin commented Oct 24, 2018

ian-r-rose commented Oct 29, 2018

mrocklin commented Oct 30, 2018

ian-r-rose commented Oct 30, 2018

ian-r-rose commented Oct 30, 2018

mrocklin commented Oct 31, 2018

mrocklin commented Oct 31, 2018

mrocklin commented Oct 31, 2018

mrocklin commented Oct 31, 2018

ian-r-rose commented Oct 31, 2018

Carreau commented Oct 31, 2018

ian-r-rose commented Oct 31, 2018

ian-r-rose commented Dec 18, 2018