New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CC-1059 Changed ES connector to use exponential backoff with jitter #115
Conversation
With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover. A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter. We need at least two configuration parameters to control this approach. Since `retry.backoff.ms` already exists, we can co-opt it to define the minimum time to wait during retries, but we need another configuration property to define the maximum time to wait. (We could expose parameters to control the exponential part of the equation, but that’s unnecessarily complicated.) This new algorithm computes the normal exponential backoff based upon the `retry.backoff.ms` (the minimum value) and the `max.retry.backoff.ms` value, bounding the result to be within these two values, and then choosing a random value within that range. Note that to maintain backward compatibility, we always wait for exactly the minimum time if it is equal to or exceeds the maximum time to wait. This might happen if an existing connector configuration defines the `retry.backoff.ms` value but doesn’t set the newer `max.retry.backoff.ms`. Note the default value of `max.retry.backoff.ms` is 10 seconds and hopefully larger than most values of `retry.backoff.ms` that might be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't see immediately why a separate PR was required for 3.4.x/master vs 3.3.x. I'd think the rebasing pain wouldn't be that high, but I might be forgetting something.
For the moment, I've left my comments on #114 and they should be ported here if you choose to merge using two PRs.
I'll address your comments in both PRs. Unfortunately, a commit in 3.4.x reformatted the bulk of the |
Yes, these are the changes introduced by checkstyle. Up to you. I had to adjust during rebasing in the past, but I've noticed that if the new changes are done with the new style in mind, maybe it's not too bad. The problem with the other PR is that the new additions do not conform the the new style (or that's what I thought while reviewing). |
I can backport the formatting changes, and then try to come up with a single PR. BTW, the other PR passes the checkstyle validation. |
Whatever is easier for you. Which checkstyle validation is passing? There are no rich checkstyle rules (if any) in 3.3.x. I think I noticed a case of |
Superseded by #116 |
With lots of ES connector tasks all hitting the ES backend, when the ES backend becomes overloaded all of the tasks will experience timeouts (possibly at nearly the same time) and thus retry. Prior to this change, all tasks would use the same constant backoff time and would thus all retry at about the same point in time and possibly overwhelming the ES backend. This is known as a thundering herd, and when many attempts fail it takes a long time and many attempts to recover.
A solution to this problem is to use expontential backoff to give the ES backend time to recover, except that this alone doesn’t really reduce the thundering herd problem. To solve both problems we use expontential backoff but with jitter, which is a randomization of the sleep times for each of the attempts. This PR adds exponential backoff with jitter.
We need at least two configuration parameters to control this approach. Since
retry.backoff.ms
already exists, we can co-opt it to define the minimum time to wait during retries, but we need another configuration property to define the maximum time to wait. (We could expose parameters to control the exponential part of the equation, but that’s unnecessarily complicated.) This new algorithm computes the normal exponential backoff based upon theretry.backoff.ms
(the minimum value) and themax.retry.backoff.ms
value, bounding the result to be within these two values, and then choosing a random value within that range.Note that to maintain backward compatibility, we always wait for exactly the minimum time if it is equal to or exceeds the maximum time to wait. This might happen if an existing connector configuration defines the
retry.backoff.ms
value but doesn’t set the newermax.retry.backoff.ms
. Note the default value ofmax.retry.backoff.ms
is 10 seconds and hopefully larger than most values ofretry.backoff.ms
that might be used.This PR is for
3.4.x
andmaster
; see #114 for the PR for3.3.x
.