New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail job if job output collection fails #5078

Merged
merged 4 commits into from Nov 28, 2017

Conversation

Projects
None yet
3 participants
@mvdbeek
Member

mvdbeek commented Nov 27, 2017

We now set a non-zero exit code to actually set a job to failed.
I believe this is a reasonable action, and it avoids green, empty datasets
with log messages like:

galaxy.jobs.runners ERROR 2017-11-14 13:32:06,469 (19020/860246.torque6.curie.fr) Job output not returned from cluster: [Errno 2] No such file or directory: '/data/users/mvandenb/gx
124/tmp_nfs/jwd/019/19020/galaxy_19020.o'
galaxy.jobs.runners DEBUG 2017-11-14 13:32:06,491 (19020/860246.torque6.curie.fr) Unable to cleanup /data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.o: [Errno 2] No suc
h file or directory: '/data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.o'
galaxy.jobs.runners DEBUG 2017-11-14 13:32:06,503 (19020/860246.torque6.curie.fr) Unable to cleanup /data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.e: [Errno 2] No suc
h file or directory: '/data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.e'
galaxy.jobs.runners DEBUG 2017-11-14 13:32:06,513 (19020/860246.torque6.curie.fr) Unable to cleanup /data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.ec: [Errno 2] No su
ch file or directory: '/data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.ec'
Fail job if job output collection fails
We now set a non-zero exit code to actually set a job to failed.
I belive this is a reasonable action, and it avoids green, empty datasets
with log messages like:
```
galaxy.jobs.runners ERROR 2017-11-14 13:32:06,469 (19020/860246.torque6.curie.fr) Job output not returned from cluster: [Errno 2] No such file or directory: '/data/users/mvandenb/gx
124/tmp_nfs/jwd/019/19020/galaxy_19020.o'
galaxy.jobs.runners DEBUG 2017-11-14 13:32:06,491 (19020/860246.torque6.curie.fr) Unable to cleanup /data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.o: [Errno 2] No suc
h file or directory: '/data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.o'
galaxy.jobs.runners DEBUG 2017-11-14 13:32:06,503 (19020/860246.torque6.curie.fr) Unable to cleanup /data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.e: [Errno 2] No suc
h file or directory: '/data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.e'
galaxy.jobs.runners DEBUG 2017-11-14 13:32:06,513 (19020/860246.torque6.curie.fr) Unable to cleanup /data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.ec: [Errno 2] No su
ch file or directory: '/data/users/mvandenb/gx124/tmp_nfs/jwd/019/19020/galaxy_19020.ec'
```
else:
time.sleep(1)
which_try += 1
try:
# This should be an 8-bit exit code, but read ahead anyway:
exit_code_str = open(job_state.exit_code_file, "r").read(32)
exit_code_str = open(job_state.exit_code_file, "r").read(32) if not exit_code_str else exit_code_str

This comment has been minimized.

@nsoranzo

nsoranzo Nov 27, 2017

Member

This works properly only if exit_code_str is always defined, otherwise this would result in:

NameError: name 'exit_code_str' is not defined

which is caught by the next except, ending up almost always with a value of "0".

This comment has been minimized.

@mvdbeek

mvdbeek Nov 27, 2017

Member

uff, of course. thanks for the catch.

@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 27, 2017

At first glance I don't love this I don't think. This wouldn't cause the datasets to fail if the exit code isn't used a condition to fail the job - which is the old default and a reasonable way to configure even newer tools. I feel like we should somehow more directly fail the job. We could either always fail the job when this exit code is encountered else where (maybe in job finish?) or better yet somehow set the exit code to None and fail all tool-based jobs with None exit codes?

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Nov 28, 2017

Alright, I have done some manual testing, and this properly fails jobs for tools that would previously end up as OK.
screen shot 2017-11-28 at 12 19 08

Add new runner state for 'Job output not returned from cluster'
This would allow adimns to write resubmit conditions for this specific failure.
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Nov 28, 2017

With the last commit admins could even setup job resubmit conditions for this specific failure.

@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 28, 2017

Thanks so much for the contribution @mvdbeek - this is awesome.

@jmchilton jmchilton merged commit fac6589 into galaxyproject:dev Nov 28, 2017

7 checks passed

api test Build finished. 317 tests run, 5 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 162 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 58 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
selenium test Build finished. 100 tests run, 1 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Nov 28, 2017

@erasche that may be of interest to the failures you were seeing

@mvdbeek mvdbeek deleted the mvdbeek:fail_if_job_output_collection_fails branch Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment