Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server extension to start cluster #30

Closed
mrocklin opened this issue Oct 22, 2018 · 20 comments
Closed

Server extension to start cluster #30

mrocklin opened this issue Oct 22, 2018 · 20 comments

Comments

@mrocklin
Copy link
Member

It would be useful to be able to start and stop clusters from within the sidebar rather than within a notebook. This would allow clusters to persist between notebooks and between notebook sessions.

So, how do we start, stop, scale, and adapt clusters within the sidebar? Presumably this requires ...

  1. A server extension that runs a bit of Python and actually manages the Cluster object (we'll have to configure which cluster object we use with config files)
  2. Some UI elements within the JLab sidebar that connect to that Python process
  3. A mechanism to communicate between that process and the notebook processes (maybe some sort of json file that we write to a pre-configured place?)

@ian-r-rose is this something that you have time to help with? I'm quite happy to help out with this, but I suspect that would benefit from having you (or someone with your experience) lead.

@ian-r-rose
Copy link
Collaborator

Yes, I can help with some of this. To my mind, the most difficult thing will be managing the lifecycle of the clusters. Your basic description of the setup sounds reasonable. For communicating between the server side and the client side: I think rather than a config json we should probably model it after the rest of the notebook REST API. In particular, the sessions API is probably a good place to start.
Major questions that I have.

  • How do we poll for clusters in a backend-agnostic way? Does dask-distributed have all the abstractions we need?
  • What are things that need to be configurable on the backend? How much of that should be exposed to the frontend vs configured at server launch time?

@mrocklin
Copy link
Member Author

So whenever someone starts up a Client in a Python session we would optionally hit some address to see if it will respond? Presumably it responds with the address that we should check? That could work.

I can imagine doing this either over HTTP similar to how Jupyter does things, or using Dask comms instead.

>>> client = Client('http://localhost/api/dask') 
>>> client.scheduler.address
'tcp://some-other-address:####'
>>> client = Client('tcp://localhost:8786')  # actually the address of our nbserver extension
>>> client.scheduler.address
'tcp://some-other-address:####'

@mrocklin
Copy link
Member Author

What are things that need to be configurable on the backend? How much of that should be exposed to the frontend vs configured at server launch time?

Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly.

At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory.

@mrocklin
Copy link
Member Author

image

@ian-r-rose
Copy link
Collaborator

So whenever someone starts up a Client in a Python session we would optionally hit some address to see if it will respond? Presumably it responds with the address that we should check? That could work.

Yes, we can have something like a dask/clients endpoint that returns a list of active clients, as well as ids for them. We can then hit dask/clients/clientid with GET, DELETE, etc requests to manage them. We can poll the list every few seconds to keep up to date. This is pretty close to how we handle kernel sessions in the application. I am currently thinking that our server extension would just proxy the client dashboard urls and comms and such, but you have a better idea of the networking requirements there than I.

Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly.

Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?

At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory.

I think this should be doable via a REST API.

@mrocklin
Copy link
Member Author

Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?

I think so, yes.

Yes, we can have something like a dask/clients endpoint that returns a list of active clients

I think that we'll want to switch out the term clients for clusters or schedulers. The client is an object that the user will need to interact with in their notebook/script/whatever. That object will need the address of the scheduler to connect to.

@ian-r-rose perhaps we should chat about this real-time? We might be able to bounce back and forth and come up with a plan more quickly. I'm around most of today and tomorrow if you're free.

@ian-r-rose
Copy link
Collaborator

Sure, I am around and pretty flexible today. Feel free to ping me on Gitter and we can set up a room.

@mrocklin
Copy link
Member Author

@ian-r-rose and I had a quick chat we agreed that ...

  • a server extension should probably manage a few clusters, not just one
  • a user might attach a particular cluster to a notebook kernel by clicking and dragging something into a notebook
  • on the server side this can probably be a fairly simple tornado web application

As an initial set of operations, the following probably work pretty well

from dask.distributed import LocalCluster

cluster = LocalCluster(threads_per_worker=2, memory_limit='4GB')  # configure workers and start

cluster.scale(10)  # scale cluster to ten workers
cluster.scale(2)  # scale cluster down to two workers

cluster.adapt(minimum=0, maximum=10)  # adapt cluster between 0 and 10 workers

cluster.close()  # shut down cluster

We may at some point want to start these running on the same event loop as the Jupyter web server, I'm not sure. This will probably affect some discussions that we're thinking about for deployment now upstream.

@ian-r-rose
Copy link
Collaborator

It looks like you were right to be concerned about the tornado event loop @mrocklin. In my initial explorations, just importing dask.distributed from the same process as that which is running the notebook server seems to cause problems. Specifically, the notebook server no longer responds to HTTP requests. Any ideas about what might be causing the problem or how we could get around it? Since it seems to be happening at import time, I don't really know where to start looking.

https://github.com/ian-r-rose/dask-labextension/tree/serverextension

@mrocklin
Copy link
Member Author

@ian-r-rose I'm happy to investigate. This may sound dumb, but what's the right way to install and test this?

@ian-r-rose
Copy link
Collaborator

Thanks @mrocklin. You can install it with

pip install -e .
jupyter serverextension enable --sys-prefix dask_labextension

This attempts to add an additional REST endpoint to the web server. However, I was able to reproduce the problem with a do-nothing extension that just imported dask.distributed.

@ian-r-rose
Copy link
Collaborator

My suspicion is that both dask.distributed and the notebook server are trying to wrest control of the default tornado IOLoop and stepping on each other's toes, but I don't know that for sure.

@mrocklin
Copy link
Member Author

Some binary search of imports and code lead to this diff on the Dask side, which seems to solve the immediate problem

diff --git a/distributed/utils.py b/distributed/utils.py
index df7561aa..dcdd7f5e 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1394,8 +1394,8 @@ def reset_logger_locks():
 
 
 # Only bother if asyncio has been loaded by Tornado
-if 'asyncio' in sys.modules:
-    fix_asyncio_event_loop_policy(sys.modules['asyncio'])
+# if 'asyncio' in sys.modules:
+#     fix_asyncio_event_loop_policy(sys.modules['asyncio'])
 
 
 def has_keyword(func, keyword):

I'll look into why we did this in the first place. In the mean time though applying this diff directly may allow us to move forward.

@mrocklin
Copy link
Member Author

Seems to be a workaround for tornadoweb/tornado#2183

@mrocklin
Copy link
Member Author

OK, after looking more at this I'm not sure that Dask is doing something wrong here. I've standardized things on the Dask side at dask/distributed#2326 .

If possible I think we should ask someone on the Jupyter side about why this might cause issues. Who is the right contact for this today?

@mrocklin
Copy link
Member Author

@minrk do you have thoughts on why adding the following lines might break the Jupyter server?

    import asyncio
    import tornado.platform.asyncio
    asyncio.set_event_loop_policy(tornado.platform.asyncio.AnyThreadEventLoopPolicy())

@ian-r-rose
Copy link
Collaborator

Thanks for looking into this @mrocklin, I'll apply your fix while we work out a more permanent solution. @Carreau recently did a lot of work on the IPython event loop and may have some insights as well.

@Carreau
Copy link

Carreau commented Oct 31, 2018

I can have a look. I've been poking at async stuff, and we are still in some places using old way and doing ensure_future instead of yielding anything which is not None, This has lead for me to some prototype just not running coroutines.

So far I'm working on deploying a JupyterHub on the merced cluster and once this is done, I'll likely start integrating dask, so happy to be a guinea pig and debug these things.

I just need to get things to work first :-)

@ian-r-rose
Copy link
Collaborator

Thanks for the info @Carreau.

@ian-r-rose ian-r-rose mentioned this issue Nov 1, 2018
5 tasks
@ian-r-rose
Copy link
Collaborator

Fixed by #31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants