-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offloading jobs to Kubernetes #1902
Comments
It's great to hear someone else is working on this, we're also very interested! Is the source hosted somewhere we could see the approach? Does the current 'container' requirement tag (example: https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/catDocker.xml#L4) convey enough information for a job runner to create the necessary config files? |
Well, our proof of concept is simply a wrapper to kubectl, nothing fancy. I was starting now to look at Galaxy code, but I don't have anything yet to show (just forked the project). As you might be aware, Kubernetes is normally abbreviated k8s, so I'll use that downwards. Considering a tool that initially looks like this (no k8s integration): <tool id="upps_blankfilter" name="BlankFilter_Regular" version="0.1.0">
<requirements>
<requirement type="package">Rscript</requirement>
</requirements>
<command><![CDATA[
Rscript BlankFilter.r "$input1" "$output1"
]]></command>
<inputs>
<param type="data" name="input1" format="xls" />
</inputs>
<outputs>
<data name="output1" format="xls" />
</outputs>
<help><![CDATA[
TODO: Fill in help.
]]></help>
</tool> The same tool, when modified to be able to use this wrapper, looks like this (but bear in mind that this is what we are moving away from): <tool id="upps_blankfilter" name="BlankFilter" version="0.1.0">
<requirements>
<requirement type="package">submit_k8s_job</requirement>
</requirements>
<stdio>
<exit_code range="1:" />
</stdio>
<command><![CDATA[
submit_k8s_jobs
-j blankfilter
-n blankfilter
-c blankfilter
--cimgrepos docker-registry.local:50000
--cimgowner phnmnl
--cimgname ex-blankfilter
--cimgver latest
--volpath /mnt/glusterfs
--volname glusterfsvol
--glusterfspath scratch
--
"$input1" "$output1"
]]></command>
<inputs>
<param type="data" name="input1" format="xls" />
</inputs>
<outputs>
<data name="output1" format="xls" />
</outputs>
<help><![CDATA[
TODO: Fill in help.
]]></help>
</tool> So in the command part, before the -- , you find all the definitions necessary for setting up the job on k8s. After the -- you find all the arguments passed to the tool running inside the container. You can get the bigger picture from this internal demo I wrote about this: But as you see, this is a lot of tool modification, which is what we want to move away from. Adding the - exec: blankfilter
containers:
- image: blankfilter_container
image_version: 16
image_owner: phnmnl
- exec: tool2
containers:
- image: cont_for_tool2
... So that when the k8s galaxy job runner (which doesn't exist yet) encounters a command being sent where the exec is But yes, answering your question, what you have there in the requirement would suffice, supposing that the either the executable matches the image's entrypoint and you pass the rest as arguments; or that the given executable can be run within the image. |
To follow up on this, our understanding (@RJMW, @korseby, @sneumann and me) so far (for our use case: shared fs available) is that we should implement a kubernetes job runner in libs/galaxy/jobs/runners which inherits from AsyncJobRunner, and uses pykube to interface with the k8s REST API. In addition to this, we should add a entry in the <plugin id="k8s" type="runner" load="galaxy.jobs.runners.kubernetes:kubernetes">
<param id="k8s_config_path">/path/to/kubeconfig</param>
...
</plugin> a destination for each docker container that we want to use: <destination id="blankfilter-container" runner="k8s">
<param id="repo">docker-registry.lan:80000</param>
<param id="owner">bfcreator</param>
<param id="image">ex-blankfilter</param>
<param id="tag">latest</param>
</destination> and then pair the tool to the destination and runner in the same file within <tool id="blankfilter" destination="blankfilter-container"/> Would this be the correct way of adding this feature to galaxy? We have some code written here. Thanks for the feedback @dannon ! |
For me this sounds like to correct way of doing it. Awesome work! Don't forget to register for a talk at next GCC :) |
Thanks @bgruening! We will continue on this track then, I would certainly be interested in joining efforts with @abdulrahmanazab as we are only starting this. |
@pcm32 Sorry for the lag here, most of us are at a meeting this week. I like the approach you've outlined above. |
No problem @dannon! and thanks for confirming that this is the way to go. |
@bgruening , @dannon I have now something that works here for Kubernetes (k8s) and galaxy. I still haven't tried all borderline cases (restarting failed jobs, etc), but the basic functionality is there (it submits the jobs to k8s, monitors progress, signals jobs when done or failed and fills in stdout/stderr files). I have made some assumptions though and some things need to improve yet. My questions to you guys are:
Currently installation doesn't work out of the box because my pull requests to pykube haven't been accepted yet, but I'll work on that to finally make a pull request here. Thanks for the feedback! |
@pcm32 this sounds awesome. Can't wait to try this! In related news we managed to convince HTCondor to submit Docker containers. According to @abdulrahmanazab this scales better than kubernetes. The PR is here: #2278 |
@bgruening, @dannon I'm still waiting for my PR to be accepted at pykube (the library I use to communicate with k8s through its REST API) before I can make a pull request here. Is it acceptable to make the pull request here in galaxy if it means that I need to add a github source for the pykube package on the requirements file (doing something like this) instead of the currently official pip install of pykube? I would fix this once my changes are pulled at pykube and a new release is made available on the pip repo. |
@pcm32 Galaxy has already shipped modified python packages in the past. So I think this is possible. You can also label your PR as WIP so we can search for testers and reviewers. |
@pcm32 this can be closed correct! Hope your talk went well! |
Thanks! Yes we can close it! |
Hi there,
We are interested in using Kubernetes to offload jobs to. This means that each job is executed within a docker container. For our use case, we should be able to rely on a shared filesystem, as we intend to run galaxy within kubernetes (as a pod). I was wondering how should I proceed? Should I just try to write a job runner? Or should I go through a more complex route with Pulsar/LWR? I understand that Pulsar/LWR is the option when there is no shared filesystem, but if there is, then I should refrain from using those.
Basically, with Kubernetes to send a job, you need to write a json or yaml file defining the job, where a crucial part is to define which container image will run it. Different galaxy tools normally will use different docker images; we have many tools that we want to use containerized already. So here, within the galaxy use case, we need to provide a mapping from tools to docker images (that is, which tool is executed within which container). So I wonder how this tool-container mapping fits within the normal galaxy structure (were would you place something like this, considering that this might be dependent on the local kubernetes cluster you're using). It could well be fetched from a container in the cluster holding those definitions/mappings.
We currently have a proof of principle working, with an older galaxy version, but this entails modifying the galaxy tool's wrappers, which is of course a bad idea. It essentially interfaces with kubectl (the cli provided by Kubernetes) to send jobs and wait for completion/failure. But we want move to a more mainstream usage of galaxy.
I did look around for something like this, but couldn't find anything.
What would be the best way to proceed in terms of implementation? Write a job runner or something else? Thanks!
The text was updated successfully, but these errors were encountered: