-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for PBS clusters #38
Conversation
This works, but still needs tests and periodic status checking enabled.
@jcrist do you have any thoughts on how you might leverage the dask-jobqueue infrastructure or community? It would be unfortunate to cause a significant technology split here. Maybe there is some component in specifying jobs that could be shared between the two projects? |
I don't see a way to use the same codebase - there's too much work that needs to be done to adequately support multiple users in an efficient and secure manner. FWIW, this didn't take terribly long to write up, and I suspect much of this could be pulled into a base class to support other schedulers. Leveraging the community would be useful though, I'm not familiar with these systems and may not be doing things 100% correct. Code review would be quite welcome. |
I'm not suggesting that you use the dask-jobqueue codebase as is, I'm suggesting that there might be some component that could be designed that both systems could use. The advantage of this is that dask-gateway would then benefit from a bunch of other users. Otherwise I suspect that you'll be needled with hundreds of "oh, my job scheduler uses this special parameter, can you add that too please?" requests. However the dask-jobqueue folks have handled things seems to have handled things. |
I think the first step here would be for you to look through the dask-jobqueue codebase, talk with the maintainers of that project, and then maybe discuss ways in which job-specification might be generalized. |
I have been using their codebase as inspiration, but suspect that for now it's easier to copy and augment than it is to figure out what should be shared. Shared components could maybe be extracted at a later point once it's clearer what that means. I would be more than happy to talk to the maintainers of |
Ok I took a quick look at the code from this PR, some thoughts:
This also looks quite similar to batchspawner from Jupyterhub, which is not surprising if I understand correctly the "Daskhub" nature of dask-gateway? Could or do you use Dask Cluster objects for other infrastructure gateway? Don't you think we could design both Cluster and dask-gateway to be compatible? |
Thanks @guillaumeeb for reading through the PR. Responding to a few comments inline:
Why would
I'm not super familiar with these systems. Why the severe restriction (does it result in a noticeable lag on the scheduler)? In this case we make many requests to qstat, but it's a single request for tracking all user's jobs, not one-per-job. The frequency could be turned down, the reason we check at all is to catch failures early rather than waiting for a failed job to timeout on startup.
This is correct - batchspawner's code was also looked at for inspiration.
We don't rely on any of the existing dask cluster implementations inside dask gateway. The gateway server library ( For now I plan to expand support for the various existing dask backends. If at some point it becomes clear how we could share some code we can think about splitting that out then. Of course I welcome any feedback/features/issues/comments you might have - so far this has been a single-person project and I'm definitely not as familiar with job-queue systems as you/the dask-jobqueue team are. |
I was unclear, I just meant you could configure or extend dask-jobqueue to launch job with sudo.
Exactly, it can overwhelm the scheduler if a lot of users do frequent
OK, that's a good thing!
I think we were trying to define this ClusterManager in dask-distributed (dask/distributed#2235). I agree this is not the same as what you describe, but this is a step towards mutualisation. Anyway, I'm happy to try to address this later on (and I just don't have the time to give you support right now), let's just keep in mind that there could be some generalization. |
Co-authored-by: Jacob Tomlinson <jacobtomlinson@users.noreply.github.com>
Adds initial support for the PBS job queue system.
This works by using an intermediate script (
dask-gateway-jobqueue-launcher
) to do the heavy lifting of submitting/killing jobs. The gateway user then needs sudo access for this script for all dask users.The gateway also needs to be a PBS operator in order to check the job status of all users. This could be worked around, but avoiding being an operator would likely be more expensive as
sudo
would have to be used for status queries as well.The configurable fields exposed here were my best attempt at determining which parameters were important for users (balancing flexibility with ease of use). Not being a PBS user myself, I've likely done a poor job at determining what's important.