Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: jittered backoff implementation #3791

Merged
merged 5 commits into from
Jul 11, 2018

Conversation

ramaraochavali
Copy link
Contributor

Signed-off-by: Rama rama.rao@salesforce.com

Description:
Implements fully jittered exponential backoff algorithm and refactors router retry timer to use this.
Risk Level: Low
Testing: Added automated tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch follow-up PR for #3758 with jittered implementation. PTAL.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @ramaraochavali, a few comments and we can merge. Would be good to make this the default backoff for everything as per @mattklein123 earlier comment.

uint64_t JitteredBackOffStrategy::computeNextInterval() {
current_retry_++;
uint32_t multiplier = (1 << current_retry_) - 1;
uint64_t new_interval = random_.random() % (base_interval_ * multiplier);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the behavioral tradeoffs between doing what you have here and something more like (1 << current_retry_) * base_interval + random.random() % some_fixed_interval? I.e. you still do exponential backoff, but jsut add a small jitter around the point that you backoff to?

I think what you have is fine actually, reading https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/, but it would be worth exploring the tradeoff in the comments here and review.

Copy link
Contributor Author

@ramaraochavali ramaraochavali Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have compared both approaches, Looks like the one I have has more randomness.
Here is output
(1 << current_retry_) * base_interval + random.random() % some_fixed_interval (fixed interval as 100)
1082
2069
4079
8054
16010
30000

random_.random() % (base_interval_ * (1 << current_retry_) - 1)
102
1065
954
1084
7196
28940
23291
30000
9979


uint64_t JitteredBackOffStrategy::computeNextInterval() {
current_retry_++;
uint32_t multiplier = (1 << current_retry_) - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please use const for these intermediate variables.


public:
/**
* Use this constructor if max_interval need not be enforced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we need this? I think it is always sensible ot add some bound.

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
current_retry_++;
uint32_t multiplier = (1 << current_retry_) - 1;
// for retries that take longer, multiplier may overflow and become zero.
if (multiplier == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch for retries that take longer, noticed that multiplier is becoming zero because of overflow and hence added this check to reset. Also added a test to verify this. LMK if there are any better options here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_, stop incrementing current_retry_.

@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
// Merge in the route policy.
retry_on_ |= route_policy.retryOn();
retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries());
uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25);
backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I have added base * 10 as the maximum interval (bound). I think it should be sufficient. LMK if you think otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, maybe add a comment or constant.

@ramaraochavali
Copy link
Contributor Author

@htuch made the JitterBackOff default for management server connections also and removed the earlier ExponentialBackOff. Left couple of questions, PTAL.

current_retry_++;
uint32_t multiplier = (1 << current_retry_) - 1;
// for retries that take longer, multiplier may overflow and become zero.
if (multiplier == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_, stop incrementing current_retry_.

current_interval_ = new_interval > max_interval_ ? max_interval_ : new_interval;
uint64_t JitteredBackOffStrategy::nextBackOffMs() {
current_retry_++;
uint32_t multiplier = (1 << current_retry_) - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why - 1?

Copy link
Contributor Author

@ramaraochavali ramaraochavali Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I just took it from the existing implementation. Not sure why it was like that but one thing comes to my mind is 1 << current_retry_ always returns number that is power of 2 - does it have to do any thing with it - I think -1 would make it more random may be? I just left it like that as it is not big deal. let me know if you want me to change..

DoRetryCallback callback_;
Event::TimerPtr retry_timer_;
Upstream::ResourcePriority priority_;
BackOffStrategyPtr backoff_strategy_ptr_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm not a fan of putting _ptr in the variable name unless it is really needed to make things clearer (e.g. to disambiguate). Here, I would just call this backoff_strategy_. I think you'll find this consistent with other Envoy code (e.g. retry_timer_ above).

@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
// Merge in the route policy.
retry_on_ |= route_policy.retryOn();
retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries());
uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: const

@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
// Merge in the route policy.
retry_on_ |= route_policy.retryOn();
retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries());
uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25);
backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, maybe add a comment or constant.

Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch made changes. PTAL.

@htuch htuch merged commit b1f870a into envoyproxy:master Jul 11, 2018
@ramaraochavali ramaraochavali deleted the fix/jitter_backoff branch July 12, 2018 02:59
snowp added a commit to snowp/envoy that referenced this pull request Jul 12, 2018
* origin/master:
  config: making v2-config-only a boolean flag (envoyproxy#3847)
  lc trie: add exclusive flag. (envoyproxy#3825)
  upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783)
  test: deflaking header_integration_test (envoyproxy#3849)
  http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776)
  common: minor doc updates (envoyproxy#3845)
  fix master build (envoyproxy#3844)
  logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842)
  test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843)
  common: jittered backoff implementation (envoyproxy#3791)
  format: run buildifier on .bzl files. (envoyproxy#3824)
  Support mutable metadata for endpoints (envoyproxy#3814)
  test: deflaking a test, improving debugability (envoyproxy#3829)
  Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834)
  Add hard-coded /hot_restart_version test (envoyproxy#3832)
  healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816)

Signed-off-by: Snow Pettersen <snowp@squareup.com>
htuch pushed a commit that referenced this pull request Aug 12, 2018
…ction failures (#4108)

I changed the retry strategy of the HdsDelegate on stream/connection failures. Instead of retrying every set number of seconds, we now use a jittered backoff strategy, as in #3791.

Risk Level:
Low

This is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants