CC-1059 Changed ES connector to use exponential backoff with jitter #115

rhauch · 2017-09-05T21:14:25Z

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.

A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.

We need at least two configuration parameters to control this approach. Since retry.backoff.ms already exists, we can co-opt it to define the minimum time to wait during retries, but we need another configuration property to define the maximum time to wait. (We could expose parameters to control the exponential part of the equation, but that’s unnecessarily complicated.) This new algorithm computes the normal exponential backoff based upon the retry.backoff.ms (the minimum value) and the max.retry.backoff.ms value, bounding the result to be within these two values, and then choosing a random value within that range.

Note that to maintain backward compatibility, we always wait for exactly the minimum time if it is equal to or exceeds the maximum time to wait. This might happen if an existing connector configuration defines the retry.backoff.ms value but doesn’t set the newer max.retry.backoff.ms. Note the default value of max.retry.backoff.ms is 10 seconds and hopefully larger than most values of retry.backoff.ms that might be used.

This PR is for 3.4.x and master; see #114 for the PR for 3.3.x.

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover. A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter. We need at least two configuration parameters to control this approach. Since `retry.backoff.ms` already exists, we can co-opt it to define the minimum time to wait during retries, but we need another configuration property to define the maximum time to wait. (We could expose parameters to control the exponential part of the equation, but that’s unnecessarily complicated.) This new algorithm computes the normal exponential backoff based upon the `retry.backoff.ms` (the minimum value) and the `max.retry.backoff.ms` value, bounding the result to be within these two values, and then choosing a random value within that range. Note that to maintain backward compatibility, we always wait for exactly the minimum time if it is equal to or exceeds the maximum time to wait. This might happen if an existing connector configuration defines the `retry.backoff.ms` value but doesn’t set the newer `max.retry.backoff.ms`. Note the default value of `max.retry.backoff.ms` is 10 seconds and hopefully larger than most values of `retry.backoff.ms` that might be used.

kkonstantine

Can't see immediately why a separate PR was required for 3.4.x/master vs 3.3.x. I'd think the rebasing pain wouldn't be that high, but I might be forgetting something.

For the moment, I've left my comments on #114 and they should be ported here if you choose to merge using two PRs.

rhauch · 2017-09-06T14:15:38Z

I'll address your comments in both PRs. Unfortunately, a commit in 3.4.x reformatted the bulk of the ElasticsearchSinkConnectorConfig.java, so a single PR cannot be cleanly merged onto the 3.4.x and 3.3.x branches.

kkonstantine · 2017-09-06T14:22:29Z

Yes, these are the changes introduced by checkstyle. Up to you. I had to adjust during rebasing in the past, but I've noticed that if the new changes are done with the new style in mind, maybe it's not too bad. The problem with the other PR is that the new additions do not conform the the new style (or that's what I thought while reviewing).

rhauch · 2017-09-06T14:30:23Z

I can backport the formatting changes, and then try to come up with a single PR. BTW, the other PR passes the checkstyle validation.

kkonstantine · 2017-09-06T15:24:53Z

Whatever is easier for you. Which checkstyle validation is passing? There are no rich checkstyle rules (if any) in 3.3.x. I think I noticed a case of var=val that if checkstyle works it should have warned about the absence of white space around =. (But this is also in src/test).

rhauch · 2017-09-06T21:14:42Z

Superseded by #116

rhauch mentioned this pull request Sep 5, 2017

CC-1059 Changed ES connector to use exponential backoff with jitter #114

Closed

rhauch requested a review from kkonstantine September 5, 2017 21:30

kkonstantine reviewed Sep 6, 2017

View reviewed changes

rhauch closed this Sep 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC-1059 Changed ES connector to use exponential backoff with jitter #115

CC-1059 Changed ES connector to use exponential backoff with jitter #115

rhauch commented Sep 5, 2017 •

edited

kkonstantine left a comment

rhauch commented Sep 6, 2017

kkonstantine commented Sep 6, 2017

rhauch commented Sep 6, 2017

kkonstantine commented Sep 6, 2017

rhauch commented Sep 6, 2017

CC-1059 Changed ES connector to use exponential backoff with jitter #115

CC-1059 Changed ES connector to use exponential backoff with jitter #115

Conversation

rhauch commented Sep 5, 2017 • edited

kkonstantine left a comment

Choose a reason for hiding this comment

rhauch commented Sep 6, 2017

kkonstantine commented Sep 6, 2017

rhauch commented Sep 6, 2017

kkonstantine commented Sep 6, 2017

rhauch commented Sep 6, 2017

rhauch commented Sep 5, 2017 •

edited