-
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504
feat: Add RemoteSLURMCluster & RemoteSLURMJob #502 #504
Conversation
6b16ebb
to
f3b3e0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together, @alexandervaneck
f3b3e0c
to
0bd738e
Compare
0bd738e
to
3231f6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for raising this. I had a quick pass of the code but do not have a SLURM setup to test against.
cbdd016
to
2ac68dc
Compare
TODO: Add SLURM REST API to ci/slurm and remove mocking from test_end_to_end. We can use the slurm cluster instead. (cc @jacobtomlinson I think this is a better idea) |
2ac68dc
to
887b6f5
Compare
This enables end-to-end testing with the new SLURM REST API.
Given that you've already written the mocked test there's probably little harm in leaving it in, but adding a CI test would be good. (Also don't worry about squashing and force pushing, we will squash on merge. It actually makes it harder to review because we can't easily see what has changed between reviews) |
887b6f5
to
e82d457
Compare
I unfortunately have to remove this test because the (I'll stop force pushing :D) |
e82d457
to
b6601f0
Compare
@lesteve Would you have any tips on how I can update the ci/slurm integration test docker setup and have it be used during integration testing of slurm? For this PR to pass we would need;
I pushed an image to dockerhub and added it to the FROM section in the ci/slurm/Dockerfile
Perhaps you know, off the top of your head, where I should look for configuration examples to configure this? any direction you'd prefer would be appreciated. I'll make time to look at this today 👍 Alternatively I could update this PR with unit tests only (with mocking) and we tackle those integration tests in another PR (due to ci.yml action needing to build the newer slurm docker image later). |
I figured out JWT authentication, now still need to figure out why we're seeing;
🤔 EDIT: Turns out
is always raised - even when using SLURMCluster directly. @jacobtomlinson @lesteve this doesn't seem like I have to look into more right? |
memory="2GB", | ||
job_mem="2GB", | ||
loop=loop, | ||
log_directory="/var/log/slurm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobtomlinson this log_directory
& local_directory
are now not supposed to be null, because then it'll try to write to /dask-worker-space which is obviously not writeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexandervaneck, is this true for other tests without remote APIs?
@jacobtomlinson @lesteve this is now ready for a good review, all the features explained in the issue are in. Please let me know your thoughts 🙇 your thoughts help immensely. |
dask_jobqueue/remote_slurm.py
Outdated
raise_for_status=True, **self.api_client_session_kwargs | ||
) | ||
response = await client_session.delete( | ||
f"{self.api_url}slurm/v0.0.36/job/{self.job_id}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the case that someone could be running a different version other than v0.0.36
? So, should we hard-code this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes, if we want to open it up like that then we should probably have users include slurm/v0.0.36 into the api_url itself.
In reality slurmrestd's implementations are nearly identical between 0.0.35/36/37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good... Thank you for the clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really resolved? I'd say we need a way to configure the version in the URL for future compatibility?
…mages. This has the added advantage that we can change images in the same PR and have them be used in the tests.
@@ -1,6 +1,6 @@ | |||
#!/bin/bash | |||
|
|||
docker-compose up -d --no-build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexandervaneck, this option was introduced as a way to avoid building images for each push event (it was making the CI take too long to run). Since you want to be able to use the up-to-date docker images in this PR, I recommend separating this PR into two PRs
(1) for the existing RemoteSLURMCluster
and
(2) for updating the docker images.
PR (2) would get merged before merging this PR. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand where you are coming from - however I think we can get away with it.
- In the scenario where there's is no change in the Dockerfile
- We do docker-compose pull first, which pulls the latest from dockerhub
- docker-compose up --build recognises that there's no changes and uses cache for all layers
- In the scenario where there's is a change in the Dockerfile
- We do docker-compose pull first, which pulls the latest from dockerhub
- docker-compose up --build recognises that there's some changes and uses cache for all layers that are unaffected
Drawbacks:
- It will try to build every service in docker-compose, but it will use cache for all but the first one. f.e. https://github.com/dask/dask-jobqueue/pull/504/checks?check_run_id=3069548195#step:6:4178
Is this not what is expected?
PR into two PRs
The "problem" here would be that the (2) PR does not test the Dockerfile before it is merged because the CI uses the old 'latest' one. and thus it is very likely that I'll push a dockerfile that won't work.
F.e. see the failed tests in this PR which for some reason made the cluster not come up. Locally this worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersy005 I've not yet fully understood the new CI process, do you agree with @alexandervaneck answer?
I found a bit of a snag working with aiohttp. Due to SLURM responses sometimes (most of the time?) not being parsable the http-parser library fails on reading the response. Unfortunately I have not been able to deduce why these responses cannot be read. In any case setting AIOHTTP_NO_EXTENSIONS as suggested by aio-libs/aiohttp#1843 fixes this. I have however no idea how we should control for this in _submit_job. 🤔 |
f1b37d6
to
7e1d70b
Compare
00026cc
to
c92bb80
Compare
… body to log" This reverts commit c92bb80.
5783a8a
to
c90ee0b
Compare
@alexandervaneck I see you push some commits about a month ago. Do you still feel this PR is ready to review? |
+1 |
This PR hasn't has any activity for a long time so I'm going to close it out. |
What does this PR change?
It adds support for RemoteSLURMCluster which spins up dask-workers through the SLURM REST API instead of through subprocess.Popen.
Why is this useful?
In environments where the python process doesn't have direct access to the SLURM CLI (f.e. a docker container) we still want to be able to use dask-jobqueue to interface with SLURM.
Open questions
Thank you for allowing me to contribute 🙇♂️ @jacobtomlinson @guillaumeeb
Resolve #502