Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry job submissions in ShellJobRunner #4599

Merged
merged 2 commits into from Sep 13, 2017

Conversation

Projects
None yet
3 participants
@mvdbeek
Copy link
Member

commented Sep 12, 2017

If I submit a large amount of jobs in a very short time I sometimes see

galaxy.jobs.runners.cli ERROR 2017-09-11 05:03:02,404 (17633) submission
failed (stderr): qsub: submit error (Invalid credential)

in my logs for a minor subset of jobs (2 out of 120 in the last instance
that this happened).
In this case waiting a little while and trying again solves the problem.
By default we will try submitting the job 3 times, and sleep 10 seconds
before trying again.

Trying this today I was able to provoke the qsub error once while running filter1 on 250 small input files, and everything seems to have worked smoothly.

Retry job submissions in ShellJobRunner
If I submit a large amount of jobs in a very short time I sometimes see

```
galaxy.jobs.runners.cli ERROR 2017-09-11 05:03:02,404 (17633) submission
failed (stderr): qsub: submit error (Invalid credential)
```

in my logs for a minor subset of jobs (2 out of 120 in the last instance
that this happened).
In this case waiting a little while and trying again solves the problem.
By default we will try submitting the job 3 times, and sleep 10 seconds
before trying again.

@galaxybot galaxybot added this to the 17.09 milestone Sep 12, 2017

@mvdbeek mvdbeek modified the milestones: 18.01, 17.09 Sep 12, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2017

Set the milestone to 18.01, but if someone want's to review/merge that wouldn't be a problem.

return cmd_out.returncode, cmd_out.stdout
stdout = '(%s) submission failed (stdout): %s' % (galaxy_id_tag, cmd_out.stdout)
stderr = '(%s) submission failed (stderr): %s' % (galaxy_id_tag, cmd_out.stderr)
log_func = log.warning if retry > 0 else log.error

This comment has been minimized.

Copy link
@erasche

erasche Sep 13, 2017

Member

I'd make this log.debug, personally. I.e. "please don't bother me with this if it will work eventually, I only care if it fails at the end"

This comment has been minimized.

Copy link
@erasche

erasche Sep 13, 2017

Member

also given that you have exactly the same if case below retry > 0, it seems odd to have log_func with different behaviours that precisely match the below branching. Might as well just replace it with the full log....() calls

@erasche
Copy link
Member

left a comment

Looks good to me other than logging stuff :)

return cmd_out.returncode, cmd_out.stdout
stdout = '(%s) submission failed (stdout): %s' % (galaxy_id_tag, cmd_out.stdout)
stderr = '(%s) submission failed (stderr): %s' % (galaxy_id_tag, cmd_out.stderr)
log_func = log.warning if retry > 0 else log.error

This comment has been minimized.

Copy link
@erasche

erasche Sep 13, 2017

Member

also given that you have exactly the same if case below retry > 0, it seems odd to have log_func with different behaviours that precisely match the below branching. Might as well just replace it with the full log....() calls

log_func = log.warning if retry > 0 else log.error
if retry > 0:
log_func("%s, retrying in %s seconds" % (stdout, timeout))
log_func("%s, retrying in %s seconds" % (stderr, timeout))

This comment has been minimized.

Copy link
@erasche

erasche Sep 13, 2017

Member

please use log_func("...", stdout, timeout) instead. xref https://docs.python.org/2/library/logging.html#logging.Logger.debug

Decrease log-level for resubmission msg
and specifiy log level directly in if/else clause. Thx @erasche.

@mvdbeek mvdbeek force-pushed the mvdbeek:make_qsub_submissions_more_robust branch from fa0e409 to ecd8b90 Sep 13, 2017

@erasche erasche merged commit 11d4380 into galaxyproject:dev Sep 13, 2017

6 checks passed

api test Build finished. 292 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 161 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 45 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details

@erasche erasche modified the milestones: 17.09, 18.01 Sep 13, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

commented Sep 13, 2017

Thanks for the review @erasche!

@mvdbeek mvdbeek deleted the mvdbeek:make_qsub_submissions_more_robust branch Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.