Recoverability

sandal edited this page Dec 6, 2011 · 8 revisions

While most errors cannot be recovered from automatically, there are a handful of relatively common issues which can often be worked around by simply trying again. Robust software makes an effort to recover from this sort of error in a way that is transparent to the user.

The resque-retry plugin provides us with a nice example of a recoverability tool. Using it, it becomes possible to retry a particular backgrounded job several times until it either succeeds or the retry limit is met. The README for resque-retry provides excellent examples of just how flexible it is about all of this, but here are a couple of my favorites:

# Try a job up to 10 times with a two minute delay between attempts

class DeliverWebHook
  extend Resque::Plugins::Retry
  @queue = :web_hooks

  @retry_limit = 10
  @retry_delay = 120

  def self.perform(url, hook_id, hmac_key)
    heavy_lifting
  end
end
# try a job repeatedly, increasing the delay exponentially
# (i.e. 6s, 1min, 10mins, 1hr, 3hr, 6hr)

class DeliverSMS
  extend Resque::Plugins::ExponentialBackoff
  @queue = :mt_messages

  def self.perform(mt_id, mobile_number, message)
    heavy_lifting
  end
end

# This approach limits the likelihood that retries themselves 
# will contribute to making a server load problem worse, and 
# reduces the chance that in a failure in the way your 
# application is interacting with a service will cause rapid 
# retries that could potentially cause failure cascades
# retry with different delays based on different errors, do not
# retry at all if error is not specifically listed.

class DeliverSMS
  extend Resque::Plugins::Retry
  @queue = :mt_messages

  @retry_exceptions = { NetworkError => 30, SystemCallError => 120 }

  def self.perform(mt_id, mobile_number, message)
    heavy_lifting
  end
end

Whether you have a particular need for Resque based recoverability support or not, this plugin should give you a good overview of the kinds of ways in which we can recover from intermittent failures. In pure Ruby programs, it is possible to roll our own features similar to the ones provided by this library using Ruby's rescue and retry keywords along with the Timeout standard library. However, keep in mind that recoverability is only practical in a relatively limited set of scenarios, and that it introduces a fair bit of complexity that may increase maintenance overhead in your projects and make debugging harder.


Turn the page if you're taking the linear tour, or feel free to jump around via the sidebar.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.