Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PBS clusters #38

Merged
merged 4 commits into from
Jun 7, 2019
Merged

Add support for PBS clusters #38

merged 4 commits into from
Jun 7, 2019

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Jun 7, 2019

Adds initial support for the PBS job queue system.

This works by using an intermediate script (dask-gateway-jobqueue-launcher) to do the heavy lifting of submitting/killing jobs. The gateway user then needs sudo access for this script for all dask users.

The gateway also needs to be a PBS operator in order to check the job status of all users. This could be worked around, but avoiding being an operator would likely be more expensive as sudo would have to be used for status queries as well.

The configurable fields exposed here were my best attempt at determining which parameters were important for users (balancing flexibility with ease of use). Not being a PBS user myself, I've likely done a poor job at determining what's important.

@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2019

@jcrist do you have any thoughts on how you might leverage the dask-jobqueue infrastructure or community? It would be unfortunate to cause a significant technology split here. Maybe there is some component in specifying jobs that could be shared between the two projects?

cc @jhamman @guillaumeeb

@jcrist
Copy link
Member Author

jcrist commented Jun 7, 2019

I don't see a way to use the same codebase - there's too much work that needs to be done to adequately support multiple users in an efficient and secure manner.

FWIW, this didn't take terribly long to write up, and I suspect much of this could be pulled into a base class to support other schedulers.

Leveraging the community would be useful though, I'm not familiar with these systems and may not be doing things 100% correct. Code review would be quite welcome.

@jcrist jcrist merged commit 9280e6c into master Jun 7, 2019
@jcrist jcrist deleted the add-pbs branch June 7, 2019 21:31
@jcrist jcrist mentioned this pull request Jun 7, 2019
@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2019

I'm not suggesting that you use the dask-jobqueue codebase as is, I'm suggesting that there might be some component that could be designed that both systems could use. The advantage of this is that dask-gateway would then benefit from a bunch of other users. Otherwise I suspect that you'll be needled with hundreds of "oh, my job scheduler uses this special parameter, can you add that too please?" requests. However the dask-jobqueue folks have handled things seems to have handled things.

@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2019

I think the first step here would be for you to look through the dask-jobqueue codebase, talk with the maintainers of that project, and then maybe discuss ways in which job-specification might be generalized.

@jcrist
Copy link
Member Author

jcrist commented Jun 7, 2019

I think the first step here would be for you to look through the dask-jobqueue codebase

I have been using their codebase as inspiration, but suspect that for now it's easier to copy and augment than it is to figure out what should be shared. Shared components could maybe be extracted at a later point once it's clearer what that means.

I would be more than happy to talk to the maintainers of dask-jobqueue, but think figuring out a shared dependency at this point is more effort than it's worth.

@guillaumeeb
Copy link
Member

Ok I took a quick look at the code from this PR, some thoughts:

  • It would be really good if we could share some components, it's a shame that we do some similar things in two different Dask modules.
  • dask-gateway jobqueue system has some advantages or specificities:
    • It it designed to launch jobs from a central place for multiple users, so has some sudo logic in it, this could probably be added to dask-jobqueue.
    • It handles the submission of Dask Scheduler as a separated job, a long awaited feature for dask-jobqueue.
  • dask-jobqueue could help dask-gateway :
    • It handles many options and Scheduler specificities :
      • avoid using qstat too often for ensuring jobs status (we ask our users not to use it automatically more than one per minute)
      • provide HPC good configuration with local-directory and interface (Infiniband) parameters
    • It is already compatible with multiple schedulers

This also looks quite similar to batchspawner from Jupyterhub, which is not surprising if I understand correctly the "Daskhub" nature of dask-gateway?

Could or do you use Dask Cluster objects for other infrastructure gateway? Don't you think we could design both Cluster and dask-gateway to be compatible?

@jcrist
Copy link
Member Author

jcrist commented Jun 8, 2019

Thanks @guillaumeeb for reading through the PR. Responding to a few comments inline:

It it designed to launch jobs from a central place for multiple users, so has some sudo logic in it, this could probably be added to dask-jobqueue.

Why would dask-jobqueueneed support for launching jobs as other users?

avoid using qstat too often for ensuring jobs status (we ask our users not to use it automatically more than one per minute)

I'm not super familiar with these systems. Why the severe restriction (does it result in a noticeable lag on the scheduler)? In this case we make many requests to qstat, but it's a single request for tracking all user's jobs, not one-per-job. The frequency could be turned down, the reason we check at all is to catch failures early rather than waiting for a failed job to timeout on startup.

This also looks quite similar to batchspawner from Jupyterhub, which is not surprising if I understand correctly the "Daskhub" nature of dask-gateway?

This is correct - batchspawner's code was also looked at for inspiration.

Could or do you use Dask Cluster objects for other infrastructure gateway? Don't you think we could design both Cluster and dask-gateway to be compatible?

We don't rely on any of the existing dask cluster implementations inside dask gateway. The gateway server library (dask-gateway-server) doesn't depend on dask at all (requirements here), it only handles authentication and process management. I think we could potentially find a way to share code, but right now I think focusing on that right now is the wrong solution. The requirements for a dask-gateway ClusterManager and an existing dask Cluster implementation are sufficiently different that many things have to be rewritten (we have to support multiple users securely, save state to a database, handle authentication, robust failure handling, etc...). Think of it like ipyparallel - that codebase already had logic for launching jobs on many job queue systems, but it didn't make sense for dask-jobqueue to rely on it for starting jobs.


For now I plan to expand support for the various existing dask backends. If at some point it becomes clear how we could share some code we can think about splitting that out then. Of course I welcome any feedback/features/issues/comments you might have - so far this has been a single-person project and I'm definitely not as familiar with job-queue systems as you/the dask-jobqueue team are.

@guillaumeeb
Copy link
Member

Why would dask-jobqueue need support for launching jobs as other users?

I was unclear, I just meant you could configure or extend dask-jobqueue to launch job with sudo.

Why the severe restriction (does it result in a noticeable lag on the scheduler)?

Exactly, it can overwhelm the scheduler if a lot of users do frequent qstat. We tell our users to avoid more than one per minute in their program. This is probably very conservative, but it prevent missuses.

it's a single request for tracking all user's jobs, not one-per-job.

OK, that's a good thing!

The requirements for a dask-gateway ClusterManager and an existing dask Cluster implementation are sufficiently different that many things have to be rewritten (we have to support multiple users securely, save state to a database, handle authentication, robust failure handling, etc...)

I think we were trying to define this ClusterManager in dask-distributed (dask/distributed#2235). I agree this is not the same as what you describe, but this is a step towards mutualisation.

Anyway, I'm happy to try to address this later on (and I just don't have the time to give you support right now), let's just keep in mind that there could be some generalization.

@jcrist jcrist mentioned this pull request Mar 31, 2020
9 tasks
kyprifog pushed a commit to kyprifog/dask-gateway that referenced this pull request Oct 16, 2020
Co-authored-by: Jacob Tomlinson <jacobtomlinson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants