-
-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for dask-ctl #544
Comments
Apologies in advance for naive questions, is a On every HPC system I've used the head node and compute nodes have access to at least one shared filesystem, my brain naively jumped to storing the state of each cluster in text files there. Can you see obvious reasons this wouldn't work? One potential thing we would need to deal with is that cluster admins sometimes kill long running jobs on head nodes (c.f. #471) |
Yeah Please ask as many questions as you like, it helps identify things that are undocumented 😂. I'm happy to answer whatever questions you have. I think we would need to think about moving the scheduler off the head node and into a job in the cluster, but I might be wrong here. A shared filesystem feels like a reasonably safe assumption. It's just the inconsistency of the implementation that would worry me. Let's use SLURM as an example. Cluster discovery and reconstruction would look like this:
|
Gotcha, appreciate the explanation! - my confusion was about whether
This one could maybe be automatically encoded in the job name?
If we find a way to hook into the filesystem, something as simple as text files per scheduler which are created/destroyed on job start/end? Not sure if machinery for this already exists but it could be added
This is tougher for sure - I'll have a think 🙂 Thanks for step-by-stepping it, super useful! |
It is typical on HPC to use the We would have to check if all HPC schedulers support this but when we list the running jobs it is likely we would have the command that was invoked to start the scheduler. So if things like the ID and config path were passed to the scheduler as arguments we could parse those out again. Again that might be a big assumption though. |
My knowledge of jobqueue systems is pretty limited, but I wonder if it would be possible to use environment variables for this? This is a standard OS feature so it is very likely that every single cluster implementation supports it (though the mechanism to export the variables to the job might vary). It should also be possible to retrieve the values of the exported variables from the jobs. The only issue I can imagine would be size limits on the environment variables, but it should be possible to work around those (and ID and config path do not sound like a lot of characters). |
As mentioned in #543 it would be really nice for
dask-jobqueue
to supportdask-ctl
for convenient cluster management. However from what I understand about HPC scheduling systems this may not be a trivial task.Dask Control aims to allow users to create/list/scale/delete Dask clusters via the CLI and a Python API. Support for
dask-ctl
is implements on a per-cluster manager basis with the following tasks.The main challenges here are around moving the state out of the Cluster object into a place that it can be retrieved later. On platforms like Kubernetes or the Cloud much of the state can be serialised into tags/labels on the various tasks, but I'm not sure how many HPC systems support this kind of metadata storage.
The other challenge is how to discover clusters. On Kubernetes for example we set a tag on all resources that marks it as being created by
dask-ctl
and stores an ID that can be used to retrieve the metadata. Again I'm not sure how flexible HPC schedulers are at being able to tag/label jobs with arbitrary metadata.The last thing that maybe a blocker is that the Dask cluster must always run the scheduler remotely, it cannot be within the local (or login node) Python process. I'm not sure how that affects things here.
I'm keen to see this happen, and if folks have thoughts on how this can be implemented I'd be keen to hear.
The text was updated successfully, but these errors were encountered: