New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation feedback #40

Closed
mrocklin opened this Issue Apr 20, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@mrocklin
Copy link
Member

mrocklin commented Apr 20, 2018

Here are a few high level thoughts on the current documentation:

Looking at the main example on the main page I'm curious if it is realistic:

from dask_jobqueue import PBSCluster

cluster = PBSCluster(processes=6, threads=4, memory="16GB")
cluster.start_workers(10)

from dask.distributed import Client
client = Client(cluster)

Should we include project, queue, resource specs, and other keywords that might both be necessary for realistic use and also recognizable to users of that kind of system? Similarly I think it would be very useful to include a few real-world examples in the example deployments documentation. I suspect that this was the original intent of that page (nice idea!). Perhaps we can socialize this on the pangeo issue tracker and ask people to submit PRs for their clusters?

History

I recommend that we remove the history section from the main page

Description of how it works

My experience trying to explain these projects to users of HPC systems is that most of them are familiar with job scripts. I wonder if we might include a "How does this work?" section that shows the job script that we generate, and explain that we submit this several times.

Thoughts?

@mrocklin

This comment has been minimized.

Copy link
Member Author

mrocklin commented Apr 20, 2018

Socialized the examples documentation on pangeo-data/pangeo#218

@lesteve

This comment has been minimized.

Copy link
Collaborator

lesteve commented Apr 20, 2018

All your suggestion makes sense, thanks for starting this!

@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Apr 20, 2018

I agree with all the suggestions.

The main example in the main page is realistic to me (I probably updated it according to my use 😃 ). We've got a default routing queue on my PBS cluster, project option is not mandatory, and I like automatically computed resource_specs (and once again I'm biased 😁 ).
But I fully agree that this is not the case for everyone! There is a more detailed example that we could put here in the PBSCluster docstrings:

cluster = PBSCluster(queue='regular', project='DaskOnPBS',
                             local_directory=os.getenv('TMPDIR', '/tmp'),
                             threads=4, processes=6, memory='16GB',
                             resource_spec='select=1:ncpus=24:mem=100GB')

interface keyword is important too and should appear.

@jhamman

This comment has been minimized.

Copy link
Member

jhamman commented Apr 20, 2018

Agree with all the suggestions. For the examples, I was indeed hoping to crowd source this piece. Version 0 of the docs was just meant to lay the ground work. If others can help fine tune things, that would be quite welcome.

@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Apr 20, 2018

One thing worth noting on "how it works", is the fact that the Dask Scheduler is started on the host where the JobQueueCluster is initialized, so either on the Jupyter Notebook node, or on a login/interactive node from the HPC cluster for a typical use. This is not clear for all users, and there may be some more thoughts to have on this.

From my point of view, this is OK for prototyping or online analysis of data, but we should maybe propose a more appropriate way to submit batch process. My current opinion about that is to submit the main python script to the job queueing system with enough resources for the scheduler and main process, say 4 CPU and 20GB RAM, and with a long enough wall time for the computation, say 24 hours for example. Workers would then be spawned to the job queueing system from that node by the JobQueueCluster, potentially with a shorter walltime and different resources need.

@jhamman

This comment has been minimized.

Copy link
Member

jhamman commented Apr 23, 2018

I provided this on the dask documentation issue but I'll paste it here as well:

cluster = PBSCluster(processes=18, threads=4, memory="6GB",
                     project='P48500028', queue='premium',
                     resource_spec='select=1:ncpus=36:mem=109G',
                     walltime='02:00:00', interface='ib0')

This is what I've been using on NCAR's HPC system Cheyenne.

@mrocklin

This comment has been minimized.

Copy link
Member Author

mrocklin commented Apr 24, 2018

@jhamman

This comment has been minimized.

Copy link
Member

jhamman commented May 3, 2018

Looking at the most recent build of the docs, we could probably cleanup the docstrings a bit for the API docs:

jobq

I'd like to see us cleanup a few things:

  • figure out how to drop the __init__ method from the abstract job queue class
  • fill in docstrings for job_script, scale_down, scale_up, and stop_workers methods

Some of the docstrings probably can go in the distributed.deploy.cluster (all except for job_script).

@jhamman

This comment has been minimized.

Copy link
Member

jhamman commented May 3, 2018

Actually, everything could be done here and was fairly straightforward: f7b14fe

@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Jul 31, 2018

Closing this, as this has been adressed in #50 by @mrocklin and by @jhamman above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment