common: jittered backoff implementation #3791

ramaraochavali · 2018-07-04T12:50:48Z

Signed-off-by: Rama rama.rao@salesforce.com

Description:
Implements fully jittered exponential backoff algorithm and refactors router retry timer to use this.
Risk Level: Low
Testing: Added automated tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali · 2018-07-04T12:51:20Z

@htuch follow-up PR for #3758 with jittered implementation. PTAL.

htuch

Awesome @ramaraochavali, a few comments and we can merge. Would be good to make this the default backoff for everything as per @mattklein123 earlier comment.

htuch · 2018-07-09T19:03:39Z

source/common/common/backoff_strategy.cc

+uint64_t JitteredBackOffStrategy::computeNextInterval() {
+  current_retry_++;
+  uint32_t multiplier = (1 << current_retry_) - 1;
+  uint64_t new_interval = random_.random() % (base_interval_ * multiplier);


What are the behavioral tradeoffs between doing what you have here and something more like (1 << current_retry_) * base_interval + random.random() % some_fixed_interval? I.e. you still do exponential backoff, but jsut add a small jitter around the point that you backoff to?

I think what you have is fine actually, reading https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/, but it would be worth exploring the tradeoff in the comments here and review.

I have compared both approaches, Looks like the one I have has more randomness.
Here is output
(1 << current_retry_) * base_interval + random.random() % some_fixed_interval (fixed interval as 100)
1082
2069
4079
8054
16010
30000

random_.random() % (base_interval_ * (1 << current_retry_) - 1)
102
1065
954
1084
7196
28940
23291
30000
9979

htuch · 2018-07-09T19:04:14Z

source/common/common/backoff_strategy.cc

+
+uint64_t JitteredBackOffStrategy::computeNextInterval() {
+  current_retry_++;
+  uint32_t multiplier = (1 << current_retry_) - 1;


Nit: please use const for these intermediate variables.

htuch · 2018-07-09T19:04:46Z

source/common/common/backoff_strategy.h

+
+public:
+  /**
+   * Use this constructor if max_interval need not be enforced.


How come we need this? I think it is always sensible ot add some bound.

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali · 2018-07-10T15:56:22Z

source/common/common/backoff_strategy.cc

+  current_retry_++;
+  uint32_t multiplier = (1 << current_retry_) - 1;
+  // for retries that take longer, multiplier may overflow and become zero.
+  if (multiplier == 0) {


@htuch for retries that take longer, noticed that multiplier is becoming zero because of overflow and hence added this check to reset. Also added a test to verify this. LMK if there are any better options here.

I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_, stop incrementing current_retry_.

ramaraochavali · 2018-07-10T15:57:53Z

source/common/router/retry_state_impl.cc

@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
  // Merge in the route policy.
  retry_on_ |= route_policy.retryOn();
  retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries());
+  uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25);
+  backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_);


Here I have added base * 10 as the maximum interval (bound). I think it should be sufficient. LMK if you think otherwise.

Seems reasonable, maybe add a comment or constant.

ramaraochavali · 2018-07-10T15:59:05Z

@htuch made the JitterBackOff default for management server connections also and removed the earlier ExponentialBackOff. Left couple of questions, PTAL.

htuch · 2018-07-10T16:38:27Z

source/common/common/backoff_strategy.cc

+  current_retry_++;
+  uint32_t multiplier = (1 << current_retry_) - 1;
+  // for retries that take longer, multiplier may overflow and become zero.
+  if (multiplier == 0) {


I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_, stop incrementing current_retry_.

htuch · 2018-07-10T16:39:50Z

source/common/common/backoff_strategy.cc

-    current_interval_ = new_interval > max_interval_ ? max_interval_ : new_interval;
+uint64_t JitteredBackOffStrategy::nextBackOffMs() {
+  current_retry_++;
+  uint32_t multiplier = (1 << current_retry_) - 1;


good point, I just took it from the existing implementation. Not sure why it was like that but one thing comes to my mind is 1 << current_retry_ always returns number that is power of 2 - does it have to do any thing with it - I think -1 would make it more random may be? I just left it like that as it is not big deal. let me know if you want me to change..

htuch · 2018-07-10T16:43:36Z

source/common/router/retry_state_impl.h

  DoRetryCallback callback_;
  Event::TimerPtr retry_timer_;
  Upstream::ResourcePriority priority_;
+  BackOffStrategyPtr backoff_strategy_ptr_;


FWIW, I'm not a fan of putting _ptr in the variable name unless it is really needed to make things clearer (e.g. to disambiguate). Here, I would just call this backoff_strategy_. I think you'll find this consistent with other Envoy code (e.g. retry_timer_ above).

htuch · 2018-07-10T16:44:24Z

source/common/router/retry_state_impl.cc

@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
  // Merge in the route policy.
  retry_on_ |= route_policy.retryOn();
  retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries());
+  uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25);


Nit: const

htuch · 2018-07-10T16:44:39Z

source/common/router/retry_state_impl.cc

@@ -71,23 +71,19 @@ RetryStateImpl::RetryStateImpl(const RetryPolicy& route_policy, Http::HeaderMap&
  // Merge in the route policy.
  retry_on_ |= route_policy.retryOn();
  retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries());
+  uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25);
+  backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_);


Seems reasonable, maybe add a comment or constant.

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali · 2018-07-11T07:59:53Z

@htuch made changes. PTAL.

* origin/master: config: making v2-config-only a boolean flag (envoyproxy#3847) lc trie: add exclusive flag. (envoyproxy#3825) upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783) test: deflaking header_integration_test (envoyproxy#3849) http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776) common: minor doc updates (envoyproxy#3845) fix master build (envoyproxy#3844) logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842) test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843) common: jittered backoff implementation (envoyproxy#3791) format: run buildifier on .bzl files. (envoyproxy#3824) Support mutable metadata for endpoints (envoyproxy#3814) test: deflaking a test, improving debugability (envoyproxy#3829) Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834) Add hard-coded /hot_restart_version test (envoyproxy#3832) healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816) Signed-off-by: Snow Pettersen <snowp@squareup.com>

…ction failures (#4108) I changed the retry strategy of the HdsDelegate on stream/connection failures. Instead of retrying every set number of seconds, we now use a jittered backoff strategy, as in #3791. Risk Level: Low This is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>

jittered backoff implementation

253a2b7

Signed-off-by: Rama <rama.rao@salesforce.com>

mattklein123 assigned htuch Jul 5, 2018

ramaraochavali mentioned this pull request Jul 6, 2018

config: backoff strategy implementation #3758

Merged

htuch suggested changes Jul 9, 2018

View reviewed changes

ramaraochavali added 3 commits July 10, 2018 21:16

make jitter backoff as the default

74fcf71

Signed-off-by: Rama <rama.rao@salesforce.com>

added comment

660a00f

Signed-off-by: Rama <rama.rao@salesforce.com>

formatted

7abda15

Signed-off-by: Rama <rama.rao@salesforce.com>

ramaraochavali commented Jul 10, 2018

View reviewed changes

htuch suggested changes Jul 10, 2018

View reviewed changes

address review comments

f07737b

Signed-off-by: Rama <rama.rao@salesforce.com>

htuch approved these changes Jul 11, 2018

View reviewed changes

htuch merged commit b1f870a into envoyproxy:master Jul 11, 2018

ramaraochavali deleted the fix/jitter_backoff branch July 12, 2018 02:59

htuch mentioned this pull request Jul 12, 2018

[WIP] HdsDelegate can healthcheck an endpoint markatou/envoy#1

Open

markatou mentioned this pull request Aug 10, 2018

Use a jittered backoff strategy for handling HdsDelegate stream/connection failures #4108

Merged

antoniovicente mentioned this pull request Mar 18, 2020

[common] Fix integer overflow error in JitteredBackOffStrategy found by fuzzer. #10417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: jittered backoff implementation #3791

common: jittered backoff implementation #3791

ramaraochavali commented Jul 4, 2018

ramaraochavali commented Jul 4, 2018

htuch left a comment

htuch Jul 9, 2018

ramaraochavali Jul 10, 2018 •

edited

Loading

htuch Jul 9, 2018

htuch Jul 9, 2018

ramaraochavali Jul 10, 2018

htuch Jul 10, 2018

ramaraochavali Jul 10, 2018

htuch Jul 10, 2018

ramaraochavali commented Jul 10, 2018

htuch Jul 10, 2018

htuch Jul 10, 2018

ramaraochavali Jul 11, 2018 •

edited

Loading

htuch Jul 10, 2018

htuch Jul 10, 2018

htuch Jul 10, 2018

ramaraochavali commented Jul 11, 2018

common: jittered backoff implementation #3791

common: jittered backoff implementation #3791

Conversation

ramaraochavali commented Jul 4, 2018

ramaraochavali commented Jul 4, 2018

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali commented Jul 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali commented Jul 11, 2018

ramaraochavali Jul 10, 2018 •

edited

Loading

ramaraochavali Jul 11, 2018 •

edited

Loading