Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pseudo random jitter can cause dead lock #4316

Closed
2 tasks done
codingjoe opened this issue Oct 7, 2017 · 3 comments
Closed
2 tasks done

pseudo random jitter can cause dead lock #4316

codingjoe opened this issue Oct 7, 2017 · 3 comments

Comments

@codingjoe
Copy link

codingjoe commented Oct 7, 2017

Checklist

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
  • I have verified that the issue exists against the master branch of Celery.

Versions affected: None (not yet released)
On master: Yes, added in 0d5b840

Steps to reproduce

celery/celery/utils/time.py

Lines 390 to 404 in d59518f

def get_exponential_backoff_interval(
factor,
retries,
maximum,
full_jitter=False
):
"""Calculate the exponential backoff wait time."""
# Will be zero if factor equals 0
countdown = factor * (2 ** retries)
# Full jitter according to
# https://www.awsarchitectureblog.com/2015/03/backoff.html
if full_jitter:
countdown = random.randrange(countdown + 1)
# Adjust according to maximum wait time and account for negative values.
return max(0, min(maximum, countdown))

Image you have two workers that pull the same task from a queue at the same time and fail, do tue a lock for example. If you would retry and backoff, we have the jitter that should prevent those tasks to be executed at the same time and thus preventing a deadlock. But random.seed will default to the current system time, if there is no better function provided. This can mean that two if two workers would fail at the same milli-second, they would in fact end up in a dead lock, since the jitter will be the same, since the seed is the same.

Expected behavior

In python 3.5 we have secrets, which would prevent this. In Python 2 we have to find an alternative. If we have a unique identifier for a worker, we can add this to the epoch as a new and unique seed.

Actual behavior

Two or more tasks could be dead locked because of the pseudo random jitter.

codingjoe referenced this issue Oct 7, 2017
* Add options for exponential backoff with task autoretry

* Add test for exponential backoff

* closer to a fixed test

* Move autoretry backoff functionality inside run wrapper

* Add a test for jitter

* Correct for semantics of `random.randrange()`

`random.randrange()` treats the argument it receives as just *outside* the
bound of possible return values. For example, if you call
`random.randrange(2)`, you might get 0 or 1, but you'll never get 2.
Since we want to allow the `retry_jitter` parameter to occasionally apply no
jitter at all, we need to add one to the value we pass to `randrange()`,
so that there's a chance that we receive that original value back.

* Put side_effect on patch lines

* Fix flake8

* Add celery.utils.time.get_exponential_backoff_interval

* Use exponential backoff calculation from utils in task

* Update docs around retry_jitter

* Remove unnecessary random.choice patching

* Update task auto-retry documentation

* PEP8: remove unused import

* PEP8: remove trailing whitespace

* PEP8: Fix E123 warning
@georgepsarakis
Copy link
Contributor

As far as I understand from the Python random module documentation, the majority of systems will support os.urandom which will be used internally in random.seed. The race condition you are describing sounds possible but to me seems highly improbable. If you have observed this behavior in an actual deployment of Celery, please describe the exact details (OS, Python version, task example).

@codingjoe
Copy link
Author

I agree, this is a super duper edge case ;) I also didn't run into it yet, but if you have billions of tasks this might just happen.
Feel free to close it.

@auvipy
Copy link
Member

auvipy commented Aug 14, 2018

will reopen if the issue has a real effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants