Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504

Conversation

alexandervaneck
Copy link

What does this PR change?

It adds support for RemoteSLURMCluster which spins up dask-workers through the SLURM REST API instead of through subprocess.Popen.

Why is this useful?

In environments where the python process doesn't have direct access to the SLURM CLI (f.e. a docker container) we still want to be able to use dask-jobqueue to interface with SLURM.

Open questions

  1. I've been able to write some minimal end-to-end test. The implementation largely inherits from SLURMCluster, which is sufficiently tested. Where could we invest more in tests?
  2. I've not been able to test this with a workin SLURM REST API cluster, since I am still waiting on my colleagues to activate it. I'm opening this PR early to receive early reviews.

Thank you for allowing me to contribute 🙇‍♂️ @jacobtomlinson @guillaumeeb

Resolve #502

@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch 3 times, most recently from 6b16ebb to f3b3e0c Compare July 5, 2021 14:50
Copy link
Member

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together, @alexandervaneck

dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved
@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from f3b3e0c to 0bd738e Compare July 5, 2021 14:56
@alexandervaneck alexandervaneck marked this pull request as draft July 5, 2021 15:04
@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 0bd738e to 3231f6a Compare July 5, 2021 15:08
@alexandervaneck alexandervaneck marked this pull request as ready for review July 5, 2021 15:08
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for raising this. I had a quick pass of the code but do not have a SLURM setup to test against.

dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved
dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved
dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved
dask_jobqueue/tests/test_remote_slurm.py Outdated Show resolved Hide resolved
dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved
@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch 3 times, most recently from cbdd016 to 2ac68dc Compare July 5, 2021 16:20
@alexandervaneck
Copy link
Author

TODO: Add SLURM REST API to ci/slurm and remove mocking from test_end_to_end. We can use the slurm cluster instead.

(cc @jacobtomlinson I think this is a better idea)

@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 2ac68dc to 887b6f5 Compare July 5, 2021 16:30
@jacobtomlinson
Copy link
Member

Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good.

(Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews)

@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 887b6f5 to e82d457 Compare July 6, 2021 09:24
@alexandervaneck
Copy link
Author

alexandervaneck commented Jul 6, 2021

Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good.

(Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews)

I unfortunately have to remove this test because the aresponses dependency cannot be installed by conda. In any case I'm currently working on making the tests work with the ci/slurm docker environment. That should provide plenty of test coverage for us to accept this PR in good conscience.

(I'll stop force pushing :D)

@andersy005 andersy005 added enhancement New feature or request SLURM labels Jul 6, 2021
@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from e82d457 to b6601f0 Compare July 7, 2021 08:38
@alexandervaneck
Copy link
Author

alexandervaneck commented Jul 7, 2021

@lesteve Would you have any tips on how I can update the ci/slurm integration test docker setup and have it be used during integration testing of slurm?

For this PR to pass we would need;

  1. Update slurm version to 20.11 (the version the REST API was released in)
  2. Add slurmrestd somehow to have an endpoint we can send requests to in the tests.

Update slurm

I pushed an image to dockerhub and added it to the FROM section in the ci/slurm/Dockerfile

slurmrestd

Perhaps you know, off the top of your head, where I should look for configuration examples to configure this? any direction you'd prefer would be appreciated.

I'll make time to look at this today 👍

Alternatively I could update this PR with unit tests only (with mocking) and we tackle those integration tests in another PR (due to ci.yml action needing to build the newer slurm docker image later).

@alexandervaneck
Copy link
Author

alexandervaneck commented Jul 13, 2021

I've managed to also add the JWT HTTP configuration - however the integration tests fail. Given that the posted payload is correct it leads me to believe that slurmrestd doesn't actually have JWT support yet.

The NEWS in slurm mentions a couple of bugfixes in version later than 20-11-4-1 (the version currently installable in ubuntu) https://github.com/SchedMD/slurm/blob/master/NEWS. So I'll have to investigate if something is wrong with the token generation/authentication or this is a bug in Slurm.

I figured out JWT authentication, now still need to figure out why we're seeing;

Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f39d33c7048>, 1059921.899742882)]']
connector: <aiohttp.connector.UnixConnector object at 0x7f3a0578d7f0>
Task was destroyed but it is pending!
task: <Task pending coro=<AdaptiveCore.adapt() done, defined at /opt/anaconda/lib/python3.6/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f39d34bc5e8>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py:688]>

🤔

EDIT:

Turns out

Task was destroyed but it is pending!
task: <Task pending coro=<AdaptiveCore.adapt() done, defined at /opt/anaconda/lib/python3.6/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fd92fdbf768>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py:688]>

is always raised - even when using SLURMCluster directly. @jacobtomlinson @lesteve this doesn't seem like I have to look into more right?

memory="2GB",
job_mem="2GB",
loop=loop,
log_directory="/var/log/slurm",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobtomlinson this log_directory & local_directory are now not supposed to be null, because then it'll try to write to /dask-worker-space which is obviously not writeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexandervaneck, is this true for other tests without remote APIs?

@alexandervaneck
Copy link
Author

@jacobtomlinson @lesteve this is now ready for a good review, all the features explained in the issue are in. Please let me know your thoughts 🙇 your thoughts help immensely.

raise_for_status=True, **self.api_client_session_kwargs
)
response = await client_session.delete(
f"{self.api_url}slurm/v0.0.36/job/{self.job_id}",
Copy link
Member

@andersy005 andersy005 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the case that someone could be running a different version other than v0.0.36? So, should we hard-code this here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically yes, if we want to open it up like that then we should probably have users include slurm/v0.0.36 into the api_url itself.

In reality slurmrestd's implementations are nearly identical between 0.0.35/36/37

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good... Thank you for the clarification

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really resolved? I'd say we need a way to configure the version in the URL for future compatibility?

…mages.

This has the added advantage that we can change images in the same PR and have
them be used in the tests.
@@ -1,6 +1,6 @@
#!/bin/bash

docker-compose up -d --no-build
Copy link
Member

@andersy005 andersy005 Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexandervaneck, this option was introduced as a way to avoid building images for each push event (it was making the CI take too long to run). Since you want to be able to use the up-to-date docker images in this PR, I recommend separating this PR into two PRs

(1) for the existing RemoteSLURMCluster and
(2) for updating the docker images.

PR (2) would get merged before merging this PR. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand where you are coming from - however I think we can get away with it.

  1. In the scenario where there's is no change in the Dockerfile
  • We do docker-compose pull first, which pulls the latest from dockerhub
  • docker-compose up --build recognises that there's no changes and uses cache for all layers
  1. In the scenario where there's is a change in the Dockerfile
  • We do docker-compose pull first, which pulls the latest from dockerhub
  • docker-compose up --build recognises that there's some changes and uses cache for all layers that are unaffected

Drawbacks:

  1. It will try to build every service in docker-compose, but it will use cache for all but the first one. f.e. https://github.com/dask/dask-jobqueue/pull/504/checks?check_run_id=3069548195#step:6:4178

Is this not what is expected?

PR into two PRs

The "problem" here would be that the (2) PR does not test the Dockerfile before it is merged because the CI uses the old 'latest' one. and thus it is very likely that I'll push a dockerfile that won't work.

F.e. see the failed tests in this PR which for some reason made the cluster not come up. Locally this worked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005 I've not yet fully understood the new CI process, do you agree with @alexandervaneck answer?

@alexandervaneck
Copy link
Author

I found a bit of a snag working with aiohttp. Due to SLURM responses sometimes (most of the time?) not being parsable the http-parser library fails on reading the response. Unfortunately I have not been able to deduce why these responses cannot be read.

In any case setting AIOHTTP_NO_EXTENSIONS as suggested by aio-libs/aiohttp#1843 fixes this. I have however no idea how we should control for this in _submit_job. 🤔

@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from f1b37d6 to 7e1d70b Compare July 30, 2021 11:25
@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 00026cc to c92bb80 Compare August 5, 2021 15:17
@alexandervaneck alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 5783a8a to c90ee0b Compare August 5, 2021 15:59
@guillaumeeb
Copy link
Member

@alexandervaneck I see you push some commits about a month ago. Do you still feel this PR is ready to review?

@sanjeev-one
Copy link

+1

@jacobtomlinson
Copy link
Member

This PR hasn't has any activity for a long time so I'm going to close it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SLURM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RemoteSlurmJob to connect SLURMCluster to a remote Slurm cluster
7 participants