-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server extension to start cluster #30
Comments
Yes, I can help with some of this. To my mind, the most difficult thing will be managing the lifecycle of the clusters. Your basic description of the setup sounds reasonable. For communicating between the server side and the client side: I think rather than a config json we should probably model it after the rest of the notebook REST API. In particular, the sessions API is probably a good place to start.
|
So whenever someone starts up a I can imagine doing this either over HTTP similar to how Jupyter does things, or using Dask comms instead. >>> client = Client('http://localhost/api/dask')
>>> client.scheduler.address
'tcp://some-other-address:####' >>> client = Client('tcp://localhost:8786') # actually the address of our nbserver extension
>>> client.scheduler.address
'tcp://some-other-address:####' |
Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly. At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory. |
Yes, we can have something like a
Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?
I think this should be doable via a REST API. |
I think so, yes.
I think that we'll want to switch out the term clients for clusters or schedulers. The client is an object that the user will need to interact with in their notebook/script/whatever. That object will need the address of the scheduler to connect to. @ian-r-rose perhaps we should chat about this real-time? We might be able to bounce back and forth and come up with a plan more quickly. I'm around most of today and tomorrow if you're free. |
Sure, I am around and pretty flexible today. Feel free to ping me on Gitter and we can set up a room. |
@ian-r-rose and I had a quick chat we agreed that ...
As an initial set of operations, the following probably work pretty well from dask.distributed import LocalCluster
cluster = LocalCluster(threads_per_worker=2, memory_limit='4GB') # configure workers and start
cluster.scale(10) # scale cluster to ten workers
cluster.scale(2) # scale cluster down to two workers
cluster.adapt(minimum=0, maximum=10) # adapt cluster between 0 and 10 workers
cluster.close() # shut down cluster We may at some point want to start these running on the same event loop as the Jupyter web server, I'm not sure. This will probably affect some discussions that we're thinking about for deployment now upstream. |
It looks like you were right to be concerned about the tornado event loop @mrocklin. In my initial explorations, just importing https://github.com/ian-r-rose/dask-labextension/tree/serverextension |
@ian-r-rose I'm happy to investigate. This may sound dumb, but what's the right way to install and test this? |
Thanks @mrocklin. You can install it with
This attempts to add an additional REST endpoint to the web server. However, I was able to reproduce the problem with a do-nothing extension that just imported |
My suspicion is that both |
Some binary search of imports and code lead to this diff on the Dask side, which seems to solve the immediate problem diff --git a/distributed/utils.py b/distributed/utils.py
index df7561aa..dcdd7f5e 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1394,8 +1394,8 @@ def reset_logger_locks():
# Only bother if asyncio has been loaded by Tornado
-if 'asyncio' in sys.modules:
- fix_asyncio_event_loop_policy(sys.modules['asyncio'])
+# if 'asyncio' in sys.modules:
+# fix_asyncio_event_loop_policy(sys.modules['asyncio'])
def has_keyword(func, keyword): I'll look into why we did this in the first place. In the mean time though applying this diff directly may allow us to move forward. |
Seems to be a workaround for tornadoweb/tornado#2183 |
OK, after looking more at this I'm not sure that Dask is doing something wrong here. I've standardized things on the Dask side at dask/distributed#2326 . If possible I think we should ask someone on the Jupyter side about why this might cause issues. Who is the right contact for this today? |
@minrk do you have thoughts on why adding the following lines might break the Jupyter server? import asyncio
import tornado.platform.asyncio
asyncio.set_event_loop_policy(tornado.platform.asyncio.AnyThreadEventLoopPolicy()) |
I can have a look. I've been poking at async stuff, and we are still in some places using old way and doing ensure_future instead of yielding anything which is not None, This has lead for me to some prototype just not running coroutines. So far I'm working on deploying a JupyterHub on the merced cluster and once this is done, I'll likely start integrating dask, so happy to be a guinea pig and debug these things. I just need to get things to work first :-) |
Thanks for the info @Carreau. |
Fixed by #31 |
It would be useful to be able to start and stop clusters from within the sidebar rather than within a notebook. This would allow clusters to persist between notebooks and between notebook sessions.
So, how do we start, stop, scale, and adapt clusters within the sidebar? Presumably this requires ...
@ian-r-rose is this something that you have time to help with? I'm quite happy to help out with this, but I suspect that would benefit from having you (or someone with your experience) lead.
The text was updated successfully, but these errors were encountered: