Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm refactor and other fixes #25

Merged
merged 5 commits into from Apr 4, 2018
Merged

Conversation

guillaumeeb
Copy link
Member

Closes #24, closes #22, closes #15, closes #21, closes #19;

There are a lot of modifications here, sorry... I was working on #22 when I found the #24 problem. I've also corrected some docstrings errors.

Moreover, I've extracted the environment variable part from SLURMCluster to put it directly in JobQueueCluster. Don't know if we should keep it at all, that may be usefull.

I've no way to test the SLURM part, so if somebody is willing to do it: @leej3 or @bw4sz?

@guillaumeeb
Copy link
Member Author

Ping @jhamman for review too.

@jhamman
Copy link
Member

jhamman commented Mar 27, 2018

This all looks fine to me. It would also be good if 1 or 2 folks could give it a spin on a slurm cluster.

@bw4sz
Copy link

bw4sz commented Mar 27, 2018 via email

@leej3
Copy link
Contributor

leej3 commented Mar 27, 2018 via email

@leej3
Copy link
Contributor

leej3 commented Mar 28, 2018

I had some problems getting this working. I think it was independent of these changes. Using dask-scheduler and dask-worker on the command line I saw intermittent failure. Once I installed distributed from the master branch on github things became a littler saner. Then I installed dask_jobqueue and ran a simple series of commands:

from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(queue='quick',memory =  "12g",processes=1, threads = 4 )
cluster.start_workers(2)

from dask.distributed import Client
c = Client(cluster)
from dask import delayed
import time

def inc(x):
    import time
    time.sleep(5)
    return x + 1

def dec(x):
    import time
    time.sleep(3)
    return x - 1

def add(x, y):
    import time
    time.sleep(7)
    return x + y

x = delayed(inc)(1)
y = delayed(dec)(2)
total = delayed(add)(x, y)
fut = c.compute(total)
c.gather(fut)

Works fine. I'll report back if I have any problems with more realistic pipelines.

By the way there's a few places where pbs needs to be replaced with slurm in slurm.py

@jhamman
Copy link
Member

jhamman commented Mar 29, 2018

Any final comments here? This looks good to me.

@leej3
Copy link
Contributor

leej3 commented Mar 30, 2018 via email

@leej3
Copy link
Contributor

leej3 commented Mar 30, 2018 via email

@guillaumeeb
Copy link
Member Author

So I've followed leej3 command to lower the thread and processes default to something lower.

That maid me think that there was currently no way to specify your own cpus or mem limit to give to SBATCH Slurm command. So I've added some parameters to match for the PBS "resource_spec" keyword. No we are able to specify independently the cpus/mem requested to Slurm, and the processes/threads/memory options to give to workers.

@mrocklin
Copy link
Member

mrocklin commented Apr 4, 2018

It looks like a couple people are waiting for this to go in. Is there anything stopping progress here or is it just that no one feels comfortable merging?

@jhamman jhamman merged commit bdb2c26 into dask:master Apr 4, 2018
@jhamman
Copy link
Member

jhamman commented Apr 4, 2018

@mrocklin - I had missed the notification that this was all done. At this point I'd prefer to encourage fairly rapid iterations as we work out the API so I'll try not to let things sit idle like this again.

Thanks @guillaumeeb.

@guillaumeeb guillaumeeb deleted the slurm_refactor branch August 27, 2018 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants