Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-jobqueue binder #276

Open
lesteve opened this issue May 24, 2019 · 6 comments
Open

dask-jobqueue binder #276

lesteve opened this issue May 24, 2019 · 6 comments
Labels
documentation Documentation-related sprint A good issue to tackle during a sprint

Comments

@lesteve
Copy link
Member

lesteve commented May 24, 2019

The idea is to have a binder setup with a toy cluster so that people can play with dask-jobqueue a bit without having to set it up on their cluster.

Our SLURM CI setup uses a single Dockerfile, maybe this image could be used to have a binder.

Binder allows you to use a Dockerfile:
https://github.com/binder-examples/minimal-dockerfile

Questions:

  • how does this idea work in practice. Is 1-2GB RAM enough for a toy cluster ?
  • if I use binder.pangeo.io does it work better (there seems to be more RAM on pangeo.io?)

If this idea works, we could think about what kind of notebooks to create (related to #253).

@lesteve lesteve added the sprint A good issue to tackle during a sprint label May 24, 2019
@lesteve
Copy link
Member Author

lesteve commented May 24, 2019

I am going to try to do this and see how far I can push it.

@lesteve
Copy link
Member Author

lesteve commented May 28, 2019

I have some proof of concept here:
https://github.com/lesteve/test-binder

Here is the binder link:
https://mybinder.org/v2/gh/lesteve/test-binder/master

For now there is a single notebook simple.ipynb. Comments more than welcome @willirath @guillaumeeb!

Full disclosure: I have seen some sporadic problems with the processes supervised by supervisord (mostly mysqld does not start correctly for some reason I have not yet figured out ...). I think we can probably use some work-around for this.

@guillaumeeb
Copy link
Member

Thanks @lesteve! This is nice!

I had trouble making the binder start, I needed to launch it 4 times... Don't know why. Then I have the mysqldb daemon not started, but thanks to your first cell I could start it easily.

I think the idea works and the RAM may not be a limitation for some simple examples. There may be more on Pangeo binder, but not sure this will make a big difference if we don"t use separated pods for the workers.

The first question that came to my mind then is : how using SlurmCluster is different from LocalCluster. That's the beauty of Dask, just change LocalCluster with SlurmCluster and the rest of the code is the same. What specific example can we set up for dask-jobqueue?

  • Is LocalCluster able to use adaptive logic?
  • Should we show the different args specific to a job queuing system, like local-directory or the memory resources?
  • Should we add some HPC-like example : Montecarlo simulation, like Pi computation?

@lesteve
Copy link
Member Author

lesteve commented Feb 28, 2020

I had another look at this, I tweaked it a bit, and it looks like this is working better than my last attempt (not sure why ...). So maybe worth revisiting?

https://mybinder.org/v2/gh/lesteve/test-binder/master?filepath=simple.ipynb

For me the main point would be a quick intro into dask-jobqueue:

  • creating the cluster + client
  • cluster.scale
  • simple example of Dask Dataframe, delayed, and futures
  • cluster.job_script
  • look at the logs created by the workers
  • mention the dashboard
  • mention the different things to tweak the submission script, queue, walltime, job_extra, env_extra, etc ...
  • refer to Dask documentation for more details on Dask, mentioning that SLURMCluster and LocalCluster can be replaced by each other
  • refer to their local cluster doc for more details
  • maybe more stuff that I have missed

Comments more than welcome!

@willirath
Copy link
Collaborator

That's great news! I'll have a look.

@willirath
Copy link
Collaborator

Looks a lot more stable. Scaling the cluster up and down doesn't seem to break the Slurm scheduler anymore.

@lesteve lesteve added the documentation Documentation-related label Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation-related sprint A good issue to tackle during a sprint
Projects
None yet
Development

No branches or pull requests

3 participants