Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-1059 Changed ES connector to use exponential backoff with jitter #115

Closed
wants to merge 1 commit into from

Conversation

rhauch
Copy link
Member

@rhauch rhauch commented Sep 5, 2017

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.

A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.

We need at least two configuration parameters to control this approach. Since retry.backoff.ms already exists, we can co-opt it to define the minimum time to wait during retries, but we need another configuration property to define the maximum time to wait. (We could expose parameters to control the exponential part of the equation, but that’s unnecessarily complicated.) This new algorithm computes the normal exponential backoff based upon the retry.backoff.ms (the minimum value) and the max.retry.backoff.ms value, bounding the result to be within these two values, and then choosing a random value within that range.

Note that to maintain backward compatibility, we always wait for exactly the minimum time if it is equal to or exceeds the maximum time to wait. This might happen if an existing connector configuration defines the retry.backoff.ms value but doesn’t set the newer max.retry.backoff.ms. Note the default value of max.retry.backoff.ms is 10 seconds and hopefully larger than most values of retry.backoff.ms that might be used.

This PR is for 3.4.x and master; see #114 for the PR for 3.3.x.

With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.

A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.

We need at least two configuration parameters to control this approach. Since `retry.backoff.ms` already exists, we can co-opt it to define the minimum time to wait during retries, but we need another configuration property to define the maximum time to wait. (We could expose parameters to control the exponential part of the equation, but that’s unnecessarily complicated.) This new algorithm computes the normal exponential backoff based upon the `retry.backoff.ms` (the minimum value) and the `max.retry.backoff.ms` value, bounding the result to be within these two values, and then choosing a random value within that range.

Note that to maintain backward compatibility, we always wait for exactly the minimum time if it is equal to or exceeds the maximum time to wait. This might happen if an existing connector configuration defines the `retry.backoff.ms` value but doesn’t set the newer `max.retry.backoff.ms`. Note the default value of `max.retry.backoff.ms` is 10 seconds and hopefully larger than most values of `retry.backoff.ms` that might be used.
Copy link
Member

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't see immediately why a separate PR was required for 3.4.x/master vs 3.3.x. I'd think the rebasing pain wouldn't be that high, but I might be forgetting something.

For the moment, I've left my comments on #114 and they should be ported here if you choose to merge using two PRs.

@rhauch
Copy link
Member Author

rhauch commented Sep 6, 2017

I'll address your comments in both PRs. Unfortunately, a commit in 3.4.x reformatted the bulk of the ElasticsearchSinkConnectorConfig.java, so a single PR cannot be cleanly merged onto the 3.4.x and 3.3.x branches.

@kkonstantine
Copy link
Member

Yes, these are the changes introduced by checkstyle. Up to you. I had to adjust during rebasing in the past, but I've noticed that if the new changes are done with the new style in mind, maybe it's not too bad. The problem with the other PR is that the new additions do not conform the the new style (or that's what I thought while reviewing).

@rhauch
Copy link
Member Author

rhauch commented Sep 6, 2017

I can backport the formatting changes, and then try to come up with a single PR. BTW, the other PR passes the checkstyle validation.

@kkonstantine
Copy link
Member

Whatever is easier for you. Which checkstyle validation is passing? There are no rich checkstyle rules (if any) in 3.3.x. I think I noticed a case of var=val that if checkstyle works it should have warned about the absence of white space around =. (But this is also in src/test).

@rhauch
Copy link
Member Author

rhauch commented Sep 6, 2017

Superseded by #116

@rhauch rhauch closed this Sep 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants