Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LSF #4

Closed
mrocklin opened this issue Mar 1, 2018 · 14 comments
Closed

Add LSF #4

mrocklin opened this issue Mar 1, 2018 · 14 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mrocklin
Copy link
Member

mrocklin commented Mar 1, 2018

It might be valuable to extend this repository with a solution for LSF. My hope is that this is relatively easy for someone with modest LSF experience. Looking at the current solutions for PBS or SLURM might be helpful (they're about 100 lines, mostly docstrings)

@jakirkham
Copy link
Member

Happy to give this a go. We use LSF at work. So this should be pretty useful.

@lzamparo
Copy link

@jakirkham if it helps, I tried making a dask-distributed LSF script a while back. I encountered an error I couldn't fix, but maybe you can? Here's a gist.

@lesteve lesteve added help wanted Extra attention is needed good first issue Good for newcomers labels Jun 25, 2018
@lesteve
Copy link
Member

lesteve commented Jun 25, 2018

Just to split the task in smaller chunks:

  • add a LSFCluster that inherits from dask_jobqueue.core.JobQueueCluster
  • define LSFCluster.submit_command (probably bsub?) and LSFCluster.cancel_command (probably bkill?)
  • implement LSFCluster._job_id_from_submit_output that takes a string (stdout output from the submit_cmd) and turns into into a job identifier.

Once implemented, it would be great to test it on a LSFCluster that you have at your disposal.

Looking at dask_jobqueue/slurm.py, dask_jobqueue/pbs.py, dask_jobqueue/sge.py is a good way to get started too.

@jakirkham
Copy link
Member

Sorry have been busy with other things lately. If someone has interest and time, they should feel free to go ahead.

@lesteve
Copy link
Member

lesteve commented Jun 26, 2018

If someone has interest and time, they should feel free to go ahead.

Yep, that was what I had in mind by specifying the steps in more details.

@lzamparo
Copy link

@lesteve I can give this a go; thanks for the direction. Is there a specific fixture-based test implemented for other cluster methods, or am I free to try some of my own tasks?

@lesteve
Copy link
Member

lesteve commented Jun 26, 2018

It'd be great if you could give this a go! For the first iteration, I think you can try to put together and run a small snippet, e.g. something along these lines and get it to work on your local LSFCluster:

from dask_jobqueue import LSFCluster, Client

cluster = LSFCluster(...)  # use some arguments that make sense on your local cluster
client = Client(cluster)
result = client.map(lambda x: x + 1, range(10))
client.gather(result)

For tests, you could probably took some inspiration from the existing tests from test_slurm.py, test_pbs.py, etc ...
Personally I would be in favour of doing that in a separate PR.

@raybellwaves raybellwaves mentioned this issue Jun 27, 2018
@lesteve
Copy link
Member

lesteve commented Jun 27, 2018

@lzamparo just so you know and to avoid duplicating work, there is an ongoing PR at #78. It would be great if you can give it a go (nowish or once the PR is merged, your call really) on your local LSF cluster and tell us whether that works for you!

@raybellwaves
Copy link
Member

raybellwaves commented Jun 27, 2018

Still working through my PR and trying to get it to work. Subsequent testing (once working) on other LSF clusters will be great.
Leaving this link here: https://slurm.schedmd.com/rosetta.pdf which is helping me adapt the Slurm and PBS codes.

@lzamparo
Copy link

lzamparo commented Jun 28, 2018

@lesteve Thanks for the heads up. Looks like @raybellwaves is making good headway, I'll step aside.

@lesteve
Copy link
Member

lesteve commented Jun 29, 2018

@lzamparo I'll ping you when #78 is merged and it will be great if you could give it a go on your LSF cluster!

@raybellwaves
Copy link
Member

raybellwaves commented Jul 13, 2018

While #78 is almost finished, i stumbled across https://github.com/IBMSpectrumComputing/lsf-python-api. May be of interest for the future e.g. implementing something like bjobs to check if the dask workers are running or pending. (#11)

@raybellwaves
Copy link
Member

This has been merged now. Happy to know if it works on other LSF schedulers. Not just UM Pegasus.
One thing which cropped up for me again was the psutil issue at showed as my first comment in the PR. If this comes up for others we could add psutil to the dependencies as installing psutil made it go away. But If it's just something unique to pegasus I'll just have to remember to install it.

@lesteve
Copy link
Member

lesteve commented Aug 3, 2018

I am going to close this one since the associated PR has been merged.

@raybellwaves note you can use "Fix #issueNumber" in your PR description, this way the associated issue gets closed automatically when the PR is merged. For more details, look at this.

About the psutil problem that you bumped into, I think it's quite hard to guess what the root cause is but I would bet that it was a problem with your environment (somehow you ended up with a broken psutil) rather than with with dask-jobqueue.

Note psutil is already in the dependencies because dask-jobqueue depends on distributed which depends on psutil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants