Job Errors

KrakenOverlord edited this page Dec 21, 2018 · 6 revisions

Jobs are software will fail for many reasons: bugs, network issues, etc. Faktory recognizes this and provides automatic error handling for all jobs by default.

The Faktory worker process fetches a job and executes it.

  1. If the job does not raise an error, it is considered a success. The worker will ACK it to report success.
  2. If the job does raise an error, the worker will send FAIL with error information to Faktory. This kicks off the error process.

The Process

Faktory provides retries with exponential backoff. This means that Faktory will retry the job N times, each time waiting a little more time for the next retry. By default Faktory will retry a job 25 times, which provides for retries over 21 days. In other words, if this is your software bug, you have three weeks to deploy a fix. You deploy a fix, the job executes successfully, everyone is happy.

The wait formula is:

15 + count ^ 4 + (rand(30) * (count + 1))
  • 15 establishes a minimum wait time.
  • count^4 is our exponential, the first retry will be 0, the 20th retry will 20^4 (160,000 sec), or about two days.
  • rand(30) gives us a random "smear". Sometimes people enqueue 1000s of jobs at one time, which all fail for the same reason. This ensures we don't retry 1000s of jobs all at the exact same time and take down a system.

Job Death

After retrying N times, Faktory assumes the job will continue to fail forever and will stop retrying. It moves the job into the Dead Set. Jobs in the Dead Set are not touched by Faktory but can be manually executed from the Web UI. If you have a fix which takes a while to develop, you can trigger a retry after deploying the fix.


How do I configure the number of retries?

Set "retry": 6 in the job payload, where 6 is the chosen retry count. After that count, the job will go to the Dead Set as normal.

How do I disable retry completely?

Set "retry": 0 in the job payload. The job will be discarded if it fails. Set "retry": -1 if you want failed jobs to be saved to the Dead set.

Do worker crashes trigger retries?

Yes, any jobs left over by a worker crash will cause Faktory to re-enqueue the job after the job reservation times out. This is treated identical to a FAIL.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.