Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement exponential backoff retry mechanism for transport tasks #1837

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Aug 1, 2018

Fixes #1834

JobProcesses have various tasks the need to execute that require
a transport, which can then fail for various reasons due to the
command executed over the transport excepting. Examples are the
submission of a job calculation as well as updating its scheduler
state. These may fail for reasons that do not necessarily mean that
the job is irrecoverably lost, such as the internet connection being
temporarily unavailable or the scheduler simply not responding.
Instead of putting the process in an excepted state, the engine
should automatically retry at a later stage.

Here we implement the exponential_backoff_retry utility, which is a
coroutine that can wrap another function or coroutine and will try
to run it, and rerun it when an exception is caught. When an
exception is caught as many times as the maximum number of allowed
attempts, the exception is re-raised.

This is implemented in the various transport tasks that are called
by the Waiting state of the JobProcess class:

  • task_submit_job: submit the calculation
  • task_update_job: update the scheduler state
  • task_retrieve_job: retrieve the files of the completed calc
  • task_kill_job: kill the job through the scheduler

These are now wrapped in the exponential_backoff_retry coroutine,
which will give the process some leeway when they fail for reasons
that may often resolve themselves, when given the time.

@sphuber sphuber requested a review from muhrin August 1, 2018 17:40
@sphuber sphuber force-pushed the fix_1834_exponential_backoff_retry_transport_task branch from 2a0713e to 6ac3fb5 Compare August 2, 2018 08:26
@codecov-io
Copy link

codecov-io commented Aug 2, 2018

Codecov Report

Merging #1837 into develop will increase coverage by 0.03%.
The diff coverage is 10%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1837      +/-   ##
===========================================
+ Coverage    66.69%   66.73%   +0.03%     
===========================================
  Files          317      317              
  Lines        32407    32406       -1     
===========================================
+ Hits         21613    21625      +12     
+ Misses       10794    10781      -13
Impacted Files Coverage Δ
aiida/transport/plugins/local.py 81.21% <100%> (ø) ⬆️
aiida/orm/implementation/sqlalchemy/group.py 87.62% <100%> (+0.06%) ⬆️
aiida/daemon/execmanager.py 8.6% <5.26%> (+0.88%) ⬆️
aiida/backends/djsite/db/models.py 76.23% <0%> (+0.88%) ⬆️
aiida/backends/djsite/globalsettings.py 86.84% <0%> (+5.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db14d57...3bd18b4. Read the comment docs.

JobProcesses have various tasks the need to execute that require
a transport, which can then fail for various reasons due to the
command executed over the transport excepting. Examples are the
submission of a job calculation as well as updating its scheduler
state. These may fail for reasons that do not necessarily mean that
the job is unrecoverably lost, such as the internet connection being
temporarily unavailable or the scheduler simply not responding.
Instead of putting the process in an excepted state, the engine
should automatically retry at a later stage.

Here we implement the exponential_backoff_retry utility, which is a
coroutine that can wrap another function or coroutine and will try
to run it, and rerun it when an exception is caught. When an
exception is caught as many times as the maximum number of allowed
attempts, the exception is reraised.

This is implemented in the various transport tasks that are called
by the Waiting state of the JobProcess class:

 * task_submit_job: submit the calculation
 * task_update_job: update the scheduler state
 * task_retrieve_job: retrieve the files of the completed calc
 * task_kill_job: kill the job through the scheduler

These are now wrapped in the exponential_backoff_retry coroutine,
which will give the process some leeway when they fail for reasons
that may often resolve themselves, when given the time.
@sphuber sphuber force-pushed the fix_1834_exponential_backoff_retry_transport_task branch from 6ac3fb5 to 3bd18b4 Compare August 2, 2018 09:52
Copy link
Contributor

@muhrin muhrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@sphuber sphuber merged commit 5ed5f6e into aiidateam:develop Aug 2, 2018
@sphuber sphuber deleted the fix_1834_exponential_backoff_retry_transport_task branch August 2, 2018 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement an automatic exponential backoff retry mechanism for transport tasks
3 participants