Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-1059 Changed ES connector to use exponential backoff with jitter #116

Merged
merged 4 commits into from Sep 8, 2017

Conversation

rhauch
Copy link
Member

@rhauch rhauch commented Sep 6, 2017

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.

A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.

This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value less than that maximum time.

Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for max.retries this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.

A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.

This new algorithm computes the normal maximum time to wait for a particular retry attempt using exponential backoff and then choosing a random value larger than the `retry.backoff.ms` initial backoff value and that maximum time.

Since this exponential algorithm breaks down after a large number of retry attempts, rather than adding a constraint for `max.retries` this change simply uses a practical (and arbitrary) absolute upper limit on the backoff time of 24 hours, and it logs a warning if this upper limit is exceeded.
@rhauch
Copy link
Member Author

rhauch commented Sep 6, 2017

This supersedes the previous PRs. Note that this will not merge cleanly onto the 3.4.x branch.

Copy link
Member

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

I wonder whether after each retry we should increase the minimum backoff as well. Example of ranges between min and max among which we pick a random variable:

  • 1st retry: [0.5, 1]
  • 2nd retry: [1,2]
  • 3rd retry: [2,4]

I don't know if you've seen evidence that jitter works better when the range is not sliding but just growing. But if it's not sliding, then there's a good change that for a few retries, you won't back-off enough ... Just a thought.

Based on which policy you'll end up implementing, the assertion in the unit test might need some fixing.

@rhauch
Copy link
Member Author

rhauch commented Sep 7, 2017

As referenced in the JavaDocs of the new RetryUtil, the basic logic is similar to the "full jitter" approach described in this AWS blog post. That post also describes two other jitter approaches, and how they do perform differently. Ironically, the simple "full jitter" case performed quite well and is simple and intuitive. And, as the post shows, using "full jitter" completes the work in significantly less time and with fewer total number of requests (less than half) compared to simple exponential backoff. Adding jitter to exponential backoff is clearly better than exponential backoff, but there are different ways of adding jitter and the benefits of more complex jitter approaches do not seem to outweigh the additional complexity.

So, I'm disinclined to use anything more complicated than the jitter approach used in this PR, especially if there has been no testing or comparison of the approach.

@rhauch
Copy link
Member Author

rhauch commented Sep 7, 2017

The current logic reuses the retry.backoff.ms as an initial delay, effectively computing the delay before retrying as

random(retryBackoffMs, exponentialBackoffMs)

One variation I did consider was using 0 as the lower bound for the random calculation rather than the retry backoff time. IOW,

random(0, exponentialBackoffMs)

The upper limit would still be the exponential backoff calculated using the retry.backoff.ms.

Given the use of jitter, I'm not sure the non-zero backoff is necessary, and in fact it may be counter-productive as it might artificially and unnecessarily increase the delay beyond what is needed by the ES backend. Using zero for the lower bound allows the retry logic to sometimes wait for less than the retry.backoff.ms minimum. The big advantage of using 0 for a lower bound is that it's a bit easier to understand and a bit easier to control the backoff logic since the retry.backoff.ms is used only in one way rather than two.

@rhauch
Copy link
Member Author

rhauch commented Sep 7, 2017

I've implemented the 0 lower bound as mentioned above, and this definitely results in a more even distribution of retry backoff times. For example, when using a lower bound of retry.backoff.ms, the backoff time before the first retry always happened to be a constant value. Obviously if this happens to many different tasks at once, it might still create a thundering herd, at least for the first retry.

However, using 0 as the lower bound means that the backoff time for the first retry is a random number in the range [0, ${retry.backoff.ms}], which means even the first retry for many tasks is more evenly distributed. (Yes, many of these might still fail if the ES backend is still overloaded, but some may some may succeed and distributing the retries might prevent the ES backend from experiencing a sudden cluster of requests at the same time.)

A good improvement, IMO.

Copy link
Member

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great.

Yes, I've read the blog but didn't dive into the details of the different policies and keeping it simple is nice here. Starting with a random interval from the 1st retry sounds also better. Thanks for the PR.

@rhauch rhauch merged commit 6990d14 into confluentinc:3.3.x Sep 8, 2017
@rhauch rhauch mentioned this pull request Sep 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants