New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes resources requests/limits support #4751

Merged
merged 6 commits into from Dec 18, 2017

Conversation

Projects
None yet
3 participants
@pcm32
Member

pcm32 commented Oct 4, 2017

This PR introduces support for setting memory/cpu requests and limits for Kubernetes based jobs, as explained in https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/.

This functionality permits that deployed containers for jobs can be throttled to use certain amount of resources, avoiding any potential cluster chokes on high loads. I have example job_resource_params_conf.xml, job_conf.xml and rules/k8s_destinations.py files that I could add to this PR for a more complete implementation, where we set different variable levels of resource usage through dynamic destinations, plus the ability for the user to override this with specific values for CPU and memory. It is just not clear to me where I should put these example files, please let me know.

Thanks to advice by @bgruening, this PR includes as well functionality which uses the resubmit capability if the job/pod failed due to reaching an imposed memory limit. This doesn't apply to CPU usage as k8s/docker manages to throttle CPU usage without actually needing to kill the job (as it happens to jobs that go beyond the memory allocated).

@galaxybot galaxybot added the triage label Oct 4, 2017

@galaxybot galaxybot added this to the 18.01 milestone Oct 4, 2017

@pcm32 pcm32 changed the title from Kubernetes resources requests/limits support to [WIP] Kubernetes resources requests/limits support Oct 10, 2017

@pcm32

This comment has been minimized.

Member

pcm32 commented Oct 16, 2017

While currently the code signals that the job failed due to memory limit reached, the resubmit functionality doesn't seem to be triggered. My dynamic destinations for those test jobs look like:

<destination id="dynamic-k8s-tiny" runner="dynamic">
		<param id="type">python</param>
		<param id="function">k8s_wrapper_tiny</param>
		<resubmit condition="memory_limit_reached" destination="dynamic-k8s-small"/>
</destination>
<destination id="dynamic-k8s-small" runner="dynamic">
		<param id="type">python</param>
		<param id="function">k8s_wrapper_small</param>
		<resubmit condition="memory_limit_reached" destination="dynamic-k8s-medium"/>
</destination>
<destination id="dynamic-k8s-medium" runner="dynamic">
		<param id="type">python</param>
		<param id="function">k8s_wrapper_medium</param>
		<resubmit condition="memory_limit_reached" destination="dynamic-k8s-large"/>
</destination>
<destination id="dynamic-k8s-large" runner="dynamic">
		<param id="type">python</param>
		<param id="function">k8s_wrapper_large</param>
		<resubmit condition="memory_limit_reached" destination="dynamic-k8s-xlarge"/>
</destination>
<destination id="dynamic-k8s-xlarge" runner="dynamic">
		<param id="type">python</param>
		<param id="function">k8s_wrapper_xlarge</param>
</destination>

Is there anything else needed to trigger this behaviour? Thanks!

@pcm32 pcm32 changed the title from [WIP] Kubernetes resources requests/limits support to Kubernetes resources requests/limits support Oct 16, 2017

job_state.fail_message = "More pods failed than allowed. See stdout for pods details."
error_file.close()
job_state.running = False
self.mark_as_failed(job_state)

This comment has been minimized.

@jmchilton

jmchilton Nov 1, 2017

Member

I'm seeing that none of the other runners call this directly anymore (except the chronos one that I think was based on this one here). So you should probably not be calling this directly - I think you want to make sure you can call self.fail_job() and let it handle those details. It is what triggers the resubmission. Can you rework that to happen?

This comment has been minimized.

@pcm32

pcm32 Nov 1, 2017

Member

Sure! Will probably get back to this next week. Thanks for the review John!

This comment has been minimized.

@pcm32

pcm32 Nov 21, 2017

Member

Thanks for this input John! I'm looking at the other runners, but I'm struggling a bit as they seem to be doing different things:

  • condor: on the check_watched_items, is doing:
cjs.fail_message = "Cluster could not complete job"
self.work_queue.put((self.fail_job, cjs))
continue

but under a difference case doing, it does set the cjs.failed var but uses self.finish_job instead:

if job_failed:
     log.debug("(%s/%s) job failed" % (galaxy_id_tag, job_id))
     cjs.failed = True
     self.work_queue.put((self.finish_job, cjs))
  • GoDocker: this one resembles the k8s one, takes care of creating a log file and then using the self.marked_as_failed(job_state) approach

  • PBS: it overrides fail_job and does something like:

pbs_job_state.job_wrapper.fail(pbs_job_state.fail_message)
  • slurm:
if drmaa_state == self.drmaa_job_states.FAILED:
       ajs.fail_message += '\nPlease click the bug icon to report this problem if you need help.'
       ajs.stop_job = False
       self.work_queue.put((self.fail_job, ajs))
       return
  • drmaa: uses fail_job, but doesn't set ajs.* variables
elif self.runner_params[state_param] == model.Job.states.ERROR:
       log.warning("(%s/%s) job will now be errored", galaxy_id_tag, external_job_id)
       self.work_queue.put((self.fail_job, ajs))

So I'm a bit confused, which is the correct construct to use? Is it:

ajs.failed = True
self.work_queue.put(self.fail_job, ajs)

To add to this confusion, the figure at these Galaxy runner guidelines which I recently came accross state in figure for Stage 3 to use mark_as_failed method inside check_watched_item. Could you please clarify and point to desired examples please? Thanks!

This comment has been minimized.

@jmchilton

jmchilton Dec 12, 2017

Member

Fair enough - if this part is working for you we can keep it as is. We need to work toward figuring out a best practice and get that resubmission logic used everywhere but the state of things is clearly all over the map. Thanks for looking into this though - it is nice we are at least catching the reason and such - this is really nice!

This comment has been minimized.

@pcm32

pcm32 Dec 12, 2017

Member

One detail though, I would like to have the resubmission working. You said that there is a particular something that needs to be invoked for that, how should be done? Currently the resubmission is not working.

This comment has been minimized.

@pcm32

pcm32 Dec 12, 2017

Member

Well, I guess we can leave that for later as well for another PR.

This comment has been minimized.

@jmchilton

jmchilton Dec 12, 2017

Member

Digging into it a bit more - it does look like it should work to me the way you are doing it.

I'd put some logging in around this line in lib/galaxy/jobs/runners/__init__.py - make sure it is getting there and that job_state.runner_state is correct.

self._handle_runner_state('failure', job_state)

I'd also try with completely static job destinations - if the problem is that resubmission isn't working because of dynamic job destinations - we should create an issue and address that. There are some ideas of potential problems and how to address them in on this PR from last week (#5139).

Regardless - once we merge this as is - can you create an issue so we can track this further?

@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 1, 2017

Can you split this into two PRs - one with the memory stuff and one with the resources? We can continue to talk about the issues together in this PR if you wish - but I'd prefer two PRs to discuss the two different changes.

I commented on the resubmission problem in a commit - hopefully it is helpful. I'd like to see the resources stuff reworked a bit. If you want to define a dynamic destination that takes some default and compares it against some user supplied value - that is super fantastic but I think the separation of concerns should be such that the runner takes a defined interface (so supplied resource request and limits) and the dynamic destination checks the input and supplies some default for the runner if it isn't there.

In other words - the runner should just take what it is given and define what it consumes in job_conf.xml.sample. The branching default logic:

 +
 +        if 'limits_memory' in job_destinantion.params:
 +            return self.__transform_memory_value(job_destinantion.params['limits_memory'])
 +        return self.runner_params['k8s_default_limits_memory']
 +

should be in the dynamic destination. If you want to define the defaults with destination parameters like this and consume them in your dynamic destination - that is fine - but it is deployment specific and shouldn't be documented in job_conf.xml.sample.

Does that make some sense in the abstract? Are there details that I am missing that makes this separation of concerns not appropriate or viable?

If you want to reuse some logic for parsing those resources (e.g. __transform_cpu_value) across the runner and the dynamic destination function - I'd just create like utility class I think maybe galaxy.jobs.runners.util.k8s and import the functionality in both places.

Overall though - this is super exciting. I'm so glad you are maturing this runner so rapidly - everyone is going to catch up with you one day I think and a year or two from now I suspect many people will be using this fantastic runner! Thanks.

@pcm32

This comment has been minimized.

Member

pcm32 commented Nov 1, 2017

Unfortunately disentangling the memory and CPU additions from the commits would be complicated (to have two separate PRs). Maybe if you want @jmchilton we can discuss the two topics separately on separate issues and then reflect the conclusions with additional commits here? Would that do for what you have in mind?

@pcm32

This comment has been minimized.

Member

pcm32 commented Nov 21, 2017

the separation of concerns should be such that the runner takes a defined interface (so supplied resource request and limits) and the dynamic destination checks the input and supplies some default for the runner if it isn't there.

Ok, I think I'm doing partly that with some missing pieces that are not part of the PR. So on our containers I place these dynamic destinations logic, which are in turn taken care at our job_conf.xml using things like these and then set for the tool here.

Because I moved into dynamic destinations and I cannot use the destination any longer for setting the container, I had to add some logic in the dynamic destination code to sort that out.

The branching default logic: ... should be in the dynamic destination.

Yes, currently I do this on the dynamic destination to combine user set values with defaults. The defaults that you see in the runner, read from the plugin settings, I'm not really using them I think.

Let me know if this is what you're after or if I'm not getting exactly what you suggest.

If you want to reuse some logic for parsing those resources (e.g. __transform_cpu_value) across the runner and the dynamic destination function - I'd just create like utility class I think maybe galaxy.jobs.runners.util.k8s and import the functionality in both places.

I think I only need this at the ingestion point of the data provided by the user settable sliders... at the destinations I sort of control the formatting of the resources, but could be done.

Thanks again for all your time on helping us make this better.

@pcm32

This comment has been minimized.

Member

pcm32 commented Nov 21, 2017

but it is deployment specific and shouldn't be documented in job_conf.xml.sample

Ok, so I'll remove these mentions from the sample then.

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Dec 12, 2017

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Dec 12, 2017

@jmchilton

This comment has been minimized.

Member

jmchilton commented Dec 12, 2017

Reviewing your comments it looks like we are on the same page. I've opened a PR to clarify what I think it should look like - #4751 (comment).

If that doesn't work and you need help pushing those defaults into you destination instead of doing in the runner - let me know.

Merge pull request #19 from jmchilton/feature/k8s-resources-requests-…
…limits

Rework k8s resource request parameters and docs

@jmchilton jmchilton merged commit 21dc271 into galaxyproject:dev Dec 18, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment