Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes runner: improves handling of failed jobs #2559

Merged
merged 10 commits into from Jul 11, 2016

Conversation

Projects
None yet
4 participants
@pcm32
Copy link
Member

commented Jun 27, 2016

This pull requests addresses some original issues related to jobs failure within the Kubernetes runner. This PR fixes:

  • Parameters for pod retrials set in the job_conf.xml not being used.
  • Handles exception when pod is no longer available or its logs are not longer available.
  • Kubernetes jobs now get correctly down-scaled to zero when terminating a job, without errors.
  • Overrides parent fail_job method in the Kubernetes runner to save the stderr/stdout of the pods (the result of kubectl logs pods/<pod-id>), which is lost when the job fails. Previous behaviour didn't allow the user to see the reason why the pod failed (ie, the error of the program running inside the container).

On the last point, unfortunately, Kubernetes currently doesn't allow to tell the difference between stderr and stdout, and galaxy rightly limits to 32k the amount of data that is will save on the database for this purpose, so we still might end up in scenarios in which we lose the reason of the failure.

I have tested this functionality extensively with pipelines that have jobs failing, current version on the dev branch doesn't work well on this respect.

I didn't realise that pull #2528 wasn't merged yet, so unfortunately this pull includes 2 commits from that other pull.

@galaxybot galaxybot added the triage label Jun 27, 2016

@galaxybot galaxybot added this to the 16.07 milestone Jun 27, 2016

@pcm32

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2016

Hi there! I have no way of fixing the current error, I think that this more due to an error while testing it (api test — Build finished. No test results found.), could you please re-run tests so that these can be marked as passed (or I can do something about them if they fail). Thanks!

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

@galaxybot test this

@jmchilton jmchilton merged commit f3e9736 into galaxyproject:dev Jul 11, 2016

4 checks passed

api test Build finished. 220 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 110 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 582 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

Awesome - thanks for the continued work on this!

@ilveroluca ilveroluca deleted the phnmnl:feature/failingPodsNotDetected branch Oct 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.