feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504

alexandervaneck · 2021-07-05T13:58:47Z

What does this PR change?

It adds support for RemoteSLURMCluster which spins up dask-workers through the SLURM REST API instead of through subprocess.Popen.

Why is this useful?

In environments where the python process doesn't have direct access to the SLURM CLI (f.e. a docker container) we still want to be able to use dask-jobqueue to interface with SLURM.

Open questions

I've been able to write some minimal end-to-end test. The implementation largely inherits from SLURMCluster, which is sufficiently tested. Where could we invest more in tests?
I've not been able to test this with a workin SLURM REST API cluster, since I am still waiting on my colleagues to activate it. I'm opening this PR early to receive early reviews.

Thank you for allowing me to contribute 🙇‍♂️ @jacobtomlinson @guillaumeeb

Resolve #502

andersy005

Thank you for putting this together, @alexandervaneck

dask_jobqueue/remote_slurm.py

jacobtomlinson

Thanks so much for raising this. I had a quick pass of the code but do not have a SLURM setup to test against.

dask_jobqueue/remote_slurm.py

dask_jobqueue/tests/test_remote_slurm.py

dask_jobqueue/remote_slurm.py

dask_jobqueue/tests/test_remote_slurm.py

alexandervaneck · 2021-07-05T16:27:07Z

TODO: Add SLURM REST API to ci/slurm and remove mocking from test_end_to_end. We can use the slurm cluster instead.

(cc @jacobtomlinson I think this is a better idea)

This enables end-to-end testing with the new SLURM REST API.

jacobtomlinson · 2021-07-06T09:04:27Z

Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good.

(Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews)

dask_jobqueue/tests/test_remote_slurm.py

alexandervaneck · 2021-07-06T09:26:42Z

Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good.

(Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews)

I unfortunately have to remove this test because the aresponses dependency cannot be installed by conda. In any case I'm currently working on making the tests work with the ci/slurm docker environment. That should provide plenty of test coverage for us to accept this PR in good conscience.

(I'll stop force pushing :D)

alexandervaneck · 2021-07-07T08:43:32Z

@lesteve Would you have any tips on how I can update the ci/slurm integration test docker setup and have it be used during integration testing of slurm?

For this PR to pass we would need;

Update slurm version to 20.11 (the version the REST API was released in)
Add slurmrestd somehow to have an endpoint we can send requests to in the tests.

Update slurm

I pushed an image to dockerhub and added it to the FROM section in the ci/slurm/Dockerfile

slurmrestd

Perhaps you know, off the top of your head, where I should look for configuration examples to configure this? any direction you'd prefer would be appreciated.

I'll make time to look at this today 👍

Alternatively I could update this PR with unit tests only (with mocking) and we tackle those integration tests in another PR (due to ci.yml action needing to build the newer slurm docker image later).

docs/source/examples.rst

alexandervaneck · 2021-07-13T13:47:12Z

I've managed to also add the JWT HTTP configuration - however the integration tests fail. Given that the posted payload is correct it leads me to believe that slurmrestd doesn't actually have JWT support yet.

The NEWS in slurm mentions a couple of bugfixes in version later than 20-11-4-1 (the version currently installable in ubuntu) https://github.com/SchedMD/slurm/blob/master/NEWS. So I'll have to investigate if something is wrong with the token generation/authentication or this is a bug in Slurm.

I figured out JWT authentication, now still need to figure out why we're seeing;

Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f39d33c7048>, 1059921.899742882)]']
connector: <aiohttp.connector.UnixConnector object at 0x7f3a0578d7f0>
Task was destroyed but it is pending!
task: <Task pending coro=<AdaptiveCore.adapt() done, defined at /opt/anaconda/lib/python3.6/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f39d34bc5e8>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py:688]>

🤔

EDIT:

Turns out

Task was destroyed but it is pending!
task: <Task pending coro=<AdaptiveCore.adapt() done, defined at /opt/anaconda/lib/python3.6/site-packages/distributed/deploy/adaptive_core.py:179> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fd92fdbf768>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py:688]>

is always raised - even when using SLURMCluster directly. @jacobtomlinson @lesteve this doesn't seem like I have to look into more right?

alexandervaneck · 2021-07-13T14:01:07Z

dask_jobqueue/tests/test_remote_slurm.py

+        memory="2GB",
+        job_mem="2GB",
+        loop=loop,
+        log_directory="/var/log/slurm",


@jacobtomlinson this log_directory & local_directory are now not supposed to be null, because then it'll try to write to /dask-worker-space which is obviously not writeable.

@alexandervaneck, is this true for other tests without remote APIs?

…Cluster

alexandervaneck · 2021-07-13T15:31:19Z

@jacobtomlinson @lesteve this is now ready for a good review, all the features explained in the issue are in. Please let me know your thoughts 🙇 your thoughts help immensely.

dask_jobqueue/remote_slurm.py

andersy005 · 2021-07-13T15:37:52Z

dask_jobqueue/remote_slurm.py

+            raise_for_status=True, **self.api_client_session_kwargs
+        )
+        response = await client_session.delete(
+            f"{self.api_url}slurm/v0.0.36/job/{self.job_id}",


Isn't it the case that someone could be running a different version other than v0.0.36? So, should we hard-code this here?

Theoretically yes, if we want to open it up like that then we should probably have users include slurm/v0.0.36 into the api_url itself.

In reality slurmrestd's implementations are nearly identical between 0.0.35/36/37

Sounds good... Thank you for the clarification

Is this really resolved? I'd say we need a way to configure the version in the URL for future compatibility?

…mages. This has the added advantage that we can change images in the same PR and have them be used in the tests.

andersy005 · 2021-07-14T17:50:11Z

ci/slurm/start-slurm.sh

@@ -1,6 +1,6 @@
 #!/bin/bash

-docker-compose up -d --no-build


@alexandervaneck, this option was introduced as a way to avoid building images for each push event (it was making the CI take too long to run). Since you want to be able to use the up-to-date docker images in this PR, I recommend separating this PR into two PRs

(1) for the existing RemoteSLURMCluster and
(2) for updating the docker images.

PR (2) would get merged before merging this PR. What do you think?

I understand where you are coming from - however I think we can get away with it.

In the scenario where there's is no change in the Dockerfile

We do docker-compose pull first, which pulls the latest from dockerhub

docker-compose up --build recognises that there's no changes and uses cache for all layers

In the scenario where there's is a change in the Dockerfile

We do docker-compose pull first, which pulls the latest from dockerhub

docker-compose up --build recognises that there's some changes and uses cache for all layers that are unaffected

Drawbacks:

It will try to build every service in docker-compose, but it will use cache for all but the first one. f.e. https://github.com/dask/dask-jobqueue/pull/504/checks?check_run_id=3069548195#step:6:4178

Is this not what is expected?

PR into two PRs

The "problem" here would be that the (2) PR does not test the Dockerfile before it is merged because the CI uses the old 'latest' one. and thus it is very likely that I'll push a dockerfile that won't work.

F.e. see the failed tests in this PR which for some reason made the cluster not come up. Locally this worked.

@andersy005 I've not yet fully understood the new CI process, do you agree with @alexandervaneck answer?

alexandervaneck · 2021-07-29T09:17:11Z

I found a bit of a snag working with aiohttp. Due to SLURM responses sometimes (most of the time?) not being parsable the http-parser library fails on reading the response. Unfortunately I have not been able to deduce why these responses cannot be read.

In any case setting AIOHTTP_NO_EXTENSIONS as suggested by aio-libs/aiohttp#1843 fixes this. I have however no idea how we should control for this in _submit_job. 🤔

… to be off

…se it after use

… log

… body to log" This reverts commit c92bb80.

guillaumeeb · 2021-08-31T14:28:22Z

@alexandervaneck I see you push some commits about a month ago. Do you still feel this PR is ready to review?

sanjeev-one · 2024-06-21T21:45:27Z

+1

jacobtomlinson · 2024-08-06T13:12:26Z

This PR hasn't has any activity for a long time so I'm going to close it out.

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch 3 times, most recently from 6b16ebb to f3b3e0c Compare July 5, 2021 14:50

andersy005 reviewed Jul 5, 2021

View reviewed changes

dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved

andersy005 reviewed Jul 5, 2021

View reviewed changes

dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from f3b3e0c to 0bd738e Compare July 5, 2021 14:56

alexandervaneck marked this pull request as draft July 5, 2021 15:04

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 0bd738e to 3231f6a Compare July 5, 2021 15:08

alexandervaneck marked this pull request as ready for review July 5, 2021 15:08

jacobtomlinson reviewed Jul 5, 2021

View reviewed changes

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch 3 times, most recently from cbdd016 to 2ac68dc Compare July 5, 2021 16:20

jacobtomlinson reviewed Jul 5, 2021

View reviewed changes

dask_jobqueue/tests/test_remote_slurm.py Show resolved Hide resolved

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 2ac68dc to 887b6f5 Compare July 5, 2021 16:30

alexandervaneck-paige added 4 commits July 6, 2021 10:10

style: Re-format versioneer.py with black

649c759

build: Add test dependencies to requirements-dev.txt

045c257

git: Add .lock and .dirlock to gitignore for running tests locally

4a75ce7

build: Upgrade slurm version to latest stable

165554c

This enables end-to-end testing with the new SLURM REST API.

jacobtomlinson reviewed Jul 6, 2021

View reviewed changes

dask_jobqueue/tests/test_remote_slurm.py Outdated Show resolved Hide resolved

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 887b6f5 to e82d457 Compare July 6, 2021 09:24

andersy005 added enhancement New feature or request SLURM labels Jul 6, 2021

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from e82d457 to b6601f0 Compare July 7, 2021 08:38

ocaisa reviewed Jul 7, 2021

View reviewed changes

docs/source/examples.rst Show resolved Hide resolved

feat: Add RemoteSLURMCluster & RemoteSLURMJob dask#502

404ec28

alexandervaneck commented Jul 13, 2021

View reviewed changes

docs(slurm): Add examples for JWT/Unix authentication for RemoteSLURM…

a4ccfb9

…Cluster

andersy005 reviewed Jul 13, 2021

View reviewed changes

dask_jobqueue/remote_slurm.py Outdated Show resolved Hide resolved

alexandervaneck-paige added 2 commits July 13, 2021 17:35

refactor(slurm): Remove unused imports

218958d

refactor(slurm): Use kwargs.pop with default value

03eb3ec

andersy005 reviewed Jul 13, 2021

View reviewed changes

ci(slurm): Build images during docker-compose up, should use cached i…

8b3b0fc

…mages. This has the added advantage that we can change images in the same PR and have them be used in the tests.

alexandervaneck requested review from jacobtomlinson, ocaisa and andersy005 July 14, 2021 08:58

docs(slurm): Update docblock on RemoteSLURMCluster

ef4020a

andersy005 reviewed Jul 14, 2021

View reviewed changes

alexandervaneck-paige added 2 commits July 15, 2021 12:10

docs(slurm): Remove trailing whitespace in docblock

c7370f0

feat: raise for status on client_session post and delete

aee96ba

docs(remote-slurm): Add comment declaring AIOHTTP_NO_EXTENSIONS needs…

7e1d70b

… to be off

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from f1b37d6 to 7e1d70b Compare July 30, 2021 11:25

alexandervaneck-paige added 4 commits August 2, 2021 15:54

refactor(remote-slurm): Use ClientSession as a context manager to clo…

c5008ca

…se it after use

fix(remote-slurm): Set job id in log directory properly

8c80d97

feat(remote-slurm): Add debug log message for slurm rest requests

664bfe2

feat(remote-slurm): Add explicit raise_for_status and capture body to…

c92bb80

… log

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 00026cc to c92bb80 Compare August 5, 2021 15:17

Revert "feat(remote-slurm): Add explicit raise_for_status and capture…

c90ee0b

… body to log" This reverts commit c92bb80.

alexandervaneck force-pushed the feat/502-add-remote-slurm-cluster branch from 5783a8a to c90ee0b Compare August 5, 2021 15:59

jacobtomlinson closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504

feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504

alexandervaneck commented Jul 5, 2021

andersy005 left a comment

jacobtomlinson left a comment

alexandervaneck commented Jul 5, 2021

jacobtomlinson commented Jul 6, 2021

alexandervaneck commented Jul 6, 2021 •

edited

Loading

alexandervaneck commented Jul 7, 2021 •

edited

Loading

alexandervaneck commented Jul 13, 2021 •

edited

Loading

alexandervaneck Jul 13, 2021

guillaumeeb Aug 31, 2021

alexandervaneck commented Jul 13, 2021

andersy005 Jul 13, 2021 •

edited

Loading

alexandervaneck Jul 13, 2021

andersy005 Jul 13, 2021

guillaumeeb Aug 31, 2021

andersy005 Jul 14, 2021 •

edited

Loading

alexandervaneck Jul 15, 2021

guillaumeeb Aug 31, 2021

alexandervaneck commented Jul 29, 2021

guillaumeeb commented Aug 31, 2021

sanjeev-one commented Jun 21, 2024

jacobtomlinson commented Aug 6, 2024

feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504

feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504

Conversation

alexandervaneck commented Jul 5, 2021

What does this PR change?

Why is this useful?

Open questions

andersy005 left a comment

Choose a reason for hiding this comment

jacobtomlinson left a comment

Choose a reason for hiding this comment

alexandervaneck commented Jul 5, 2021

jacobtomlinson commented Jul 6, 2021

alexandervaneck commented Jul 6, 2021 • edited Loading

alexandervaneck commented Jul 7, 2021 • edited Loading

alexandervaneck commented Jul 13, 2021 • edited Loading

alexandervaneck Jul 13, 2021

Choose a reason for hiding this comment

guillaumeeb Aug 31, 2021

Choose a reason for hiding this comment

alexandervaneck commented Jul 13, 2021

andersy005 Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

alexandervaneck Jul 13, 2021

Choose a reason for hiding this comment

andersy005 Jul 13, 2021

Choose a reason for hiding this comment

guillaumeeb Aug 31, 2021

Choose a reason for hiding this comment

andersy005 Jul 14, 2021 • edited Loading

Choose a reason for hiding this comment

alexandervaneck Jul 15, 2021

Choose a reason for hiding this comment

guillaumeeb Aug 31, 2021

Choose a reason for hiding this comment

alexandervaneck commented Jul 29, 2021

guillaumeeb commented Aug 31, 2021

sanjeev-one commented Jun 21, 2024

jacobtomlinson commented Aug 6, 2024

alexandervaneck commented Jul 6, 2021 •

edited

Loading

alexandervaneck commented Jul 7, 2021 •

edited

Loading

alexandervaneck commented Jul 13, 2021 •

edited

Loading

andersy005 Jul 13, 2021 •

edited

Loading

andersy005 Jul 14, 2021 •

edited

Loading