CC-1059 Changed ES connector to use exponential backoff with jitter #116

rhauch · 2017-09-06T21:14:06Z

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.

A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.

This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value less than that maximum time.

Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for max.retries this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover. A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter. This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value larger than the `retry.backoff.ms` initial backoff value and that maximum time. Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for `max.retries` this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.

rhauch · 2017-09-06T21:15:59Z

This supersedes the previous PRs. Note that this will not merge cleanly onto the 3.4.x branch.

kkonstantine

Looks good overall!

I wonder whether after each retry we should increase the minimum backoff as well. Example of ranges between min and max among which we pick a random variable:

1st retry: [0.5, 1]
2nd retry: [1,2]
3rd retry: [2,4]

I don't know if you've seen evidence that jitter works better when the range is not sliding but just growing. But if it's not sliding, then there's a good change that for a few retries, you won't back-off enough ... Just a thought.

Based on which policy you'll end up implementing, the assertion in the unit test might need some fixing.

rhauch · 2017-09-07T15:37:06Z

As referenced in the JavaDocs of the new RetryUtil, the basic logic is similar to the "full jitter" approach described in this AWS blog post. That post also describes two other jitter approaches, and how they do perform differently. Ironically, the simple "full jitter" case performed quite well and is simple and intuitive. And, as the post shows, using "full jitter" completes the work in significantly less time and with fewer total number of requests (less than half) compared to simple exponential backoff. Adding jitter to exponential backoff is clearly better than exponential backoff, but there are different ways of adding jitter and the benefits of more complex jitter approaches do not seem to outweigh the additional complexity.

So, I'm disinclined to use anything more complicated than the jitter approach used in this PR, especially if there has been no testing or comparison of the approach.

rhauch · 2017-09-07T15:56:49Z

The current logic reuses the retry.backoff.ms as an initial delay, effectively computing the delay before retrying as

random(retryBackoffMs, exponentialBackoffMs)

One variation I did consider was using 0 as the lower bound for the random calculation rather than the retry backoff time. IOW,

random(0, exponentialBackoffMs)

The upper limit would still be the exponential backoff calculated using the retry.backoff.ms.

Given the use of jitter, I'm not sure the non-zero backoff is necessary, and in fact it may be counter-productive as it might artificially and unnecessarily increase the delay beyond what is needed by the ES backend. Using zero for the lower bound allows the retry logic to sometimes wait for less than the retry.backoff.ms minimum. The big advantage of using 0 for a lower bound is that it's a bit easier to understand and a bit easier to control the backoff logic since the retry.backoff.ms is used only in one way rather than two.

…tion

rhauch · 2017-09-07T16:27:56Z

I've implemented the 0 lower bound as mentioned above, and this definitely results in a more even distribution of retry backoff times. For example, when using a lower bound of retry.backoff.ms, the backoff time before the first retry always happened to be a constant value. Obviously if this happens to many different tasks at once, it might still create a thundering herd, at least for the first retry.

However, using 0 as the lower bound means that the backoff time for the first retry is a random number in the range [0, ${retry.backoff.ms}], which means even the first retry for many tasks is more evenly distributed. (Yes, many of these might still fail if the ES backend is still overloaded, but some may some may succeed and distributing the retries might prevent the ES backend from experiencing a sudden cluster of requests at the same time.)

A good improvement, IMO.

kkonstantine

Great.

Yes, I've read the blog but didn't dive into the details of the different policies and keeping it simple is nice here. Starting with a random interval from the 1st retry sounds also better. Thanks for the PR.

rhauch added 3 commits September 6, 2017 15:02

CC-1059 Updated documentation of connector configuration

59a569b

CC-1059 Added documentation about exponential backoff with jitter

495fd13

This was referenced Sep 6, 2017

CC-1059 Changed ES connector to use exponential backoff with jitter #114

Closed

CC-1059 Changed ES connector to use exponential backoff with jitter #115

Closed

rhauch requested a review from kkonstantine September 6, 2017 21:15

kkonstantine reviewed Sep 7, 2017

View reviewed changes

CC-1059 Now using 0 as the lower bound for the random backoff calcula…

16fc108

…tion

kkonstantine approved these changes Sep 7, 2017

View reviewed changes

rhauch merged commit 6990d14 into confluentinc:3.3.x Sep 8, 2017

rhauch mentioned this pull request Sep 29, 2017

Error 500 Timeout #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC-1059 Changed ES connector to use exponential backoff with jitter #116

CC-1059 Changed ES connector to use exponential backoff with jitter #116

rhauch commented Sep 6, 2017 •

edited

rhauch commented Sep 6, 2017

kkonstantine left a comment

rhauch commented Sep 7, 2017

rhauch commented Sep 7, 2017

rhauch commented Sep 7, 2017

kkonstantine left a comment

CC-1059 Changed ES connector to use exponential backoff with jitter #116

CC-1059 Changed ES connector to use exponential backoff with jitter #116

Conversation

rhauch commented Sep 6, 2017 • edited

rhauch commented Sep 6, 2017

kkonstantine left a comment

Choose a reason for hiding this comment

rhauch commented Sep 7, 2017

rhauch commented Sep 7, 2017

rhauch commented Sep 7, 2017

kkonstantine left a comment

Choose a reason for hiding this comment

rhauch commented Sep 6, 2017 •

edited