New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CC-1059 Changed ES connector to use exponential backoff with jitter #116
Conversation
With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover. A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter. This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value larger than the `retry.backoff.ms` initial backoff value and that maximum time. Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for `max.retries` this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.
This supersedes the previous PRs. Note that this will not merge cleanly onto the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
I wonder whether after each retry we should increase the minimum backoff as well. Example of ranges between min and max among which we pick a random variable:
- 1st retry: [0.5, 1]
- 2nd retry: [1,2]
- 3rd retry: [2,4]
I don't know if you've seen evidence that jitter works better when the range is not sliding but just growing. But if it's not sliding, then there's a good change that for a few retries, you won't back-off enough ... Just a thought.
Based on which policy you'll end up implementing, the assertion in the unit test might need some fixing.
As referenced in the JavaDocs of the new So, I'm disinclined to use anything more complicated than the jitter approach used in this PR, especially if there has been no testing or comparison of the approach. |
The current logic reuses the
One variation I did consider was using 0 as the lower bound for the random calculation rather than the retry backoff time. IOW,
The upper limit would still be the exponential backoff calculated using the Given the use of jitter, I'm not sure the non-zero backoff is necessary, and in fact it may be counter-productive as it might artificially and unnecessarily increase the delay beyond what is needed by the ES backend. Using zero for the lower bound allows the retry logic to sometimes wait for less than the |
I've implemented the 0 lower bound as mentioned above, and this definitely results in a more even distribution of retry backoff times. For example, when using a lower bound of However, using 0 as the lower bound means that the backoff time for the first retry is a random number in the range A good improvement, IMO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great.
Yes, I've read the blog but didn't dive into the details of the different policies and keeping it simple is nice here. Starting with a random interval from the 1st retry sounds also better. Thanks for the PR.
With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.
A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.
This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value less than that maximum time.
Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for
max.retries
this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.