Skip to content

http: add connection-duration jitter, drain-timeout jitter, and drain_percentage to stagger downstream reconnections#45101

Open
Winbobob wants to merge 20 commits into
envoyproxy:mainfrom
Winbobob:zheweihu/f/drain-timeout-jitter
Open

http: add connection-duration jitter, drain-timeout jitter, and drain_percentage to stagger downstream reconnections#45101
Winbobob wants to merge 20 commits into
envoyproxy:mainfrom
Winbobob:zheweihu/f/drain-timeout-jitter

Conversation

@Winbobob
Copy link
Copy Markdown
Member

@Winbobob Winbobob commented May 16, 2026

Commit Message:
http: add connection-duration jitter, drain-timeout jitter, and drain_percentage to stagger downstream reconnections

Additional Description:
Mitigates the thundering-herd reconnect problem when many downstream long-lived HTTP/2 or gRPC connections share the same max_connection_duration and all reach the limit (and therefore drain) at the same instant. Three opt-in knobs:

  1. HttpProtocolOptions.max_connection_duration_jitter (type.v3.Percent):
    extends each connection's duration timer by random(0, base * pct/100).
    Mirrors the existing TCP-proxy field max_downstream_connection_duration_jitter_percentage (from tcp_proxy: Add max_downstream_connection_duration_jitter_percentage #40999).
  2. HttpConnectionManager.drain_timeout_jitter (type.v3.Percent): extends
    the drain grace period between the shutdown-notice GOAWAY and the final
    GOAWAY by random(0, drain_timeout * pct/100).
  3. HttpConnectionManager.drain_percentage (type.v3.Percent, default 100):
    when max_connection_duration fires, only this fraction of eligible
    connections drain this cycle; the rest re-arm the duration timer (with
    fresh jitter, if configured) and wait for the next round. 100 (or unset)
    preserves current behavior; 0 means never drain via this path.

All three fields are individually opt-in.

Risk Level:
Low.

  • Defaults preserve current behavior on all three fields:
    • jitter fields default to unset → no jitter
    • drain_percentage defaults to 100 → drain all same as today
  • No existing config paths or stats are altered.

Testing:
Unit tests and config-parsing tests

Docs Changes:
Proto field comments document semantics, defaults, and interaction with
max_connection_duration / drain_timeout.

Release Notes:
Updated change log.

Platform Specific Features:
None.

Runtime guard:
None.

Fixes #Issue:

API Considerations:

  • max_connection_duration_jitter placed on HttpProtocolOptions (next to
    max_connection_duration) following the TCP proxy precedent. Note: the
    field is currently honored only by the downstream HCM runtime; upstream
    clusters read the same proto but do not apply the jitter today.
  • drain_timeout_jitter and drain_percentage placed on
    HttpConnectionManager because the drain sequence they gate is
    downstream-HCM-only; placing them on HttpProtocolOptions would
    incorrectly imply upstream applicability.
  • All three use type.v3.Percent, which carries [0, 100] PGV validation;
    no additional runtime bounds checking is needed.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #45101 was opened by Winbobob.

see: more, trace.

@Winbobob Winbobob force-pushed the zheweihu/f/drain-timeout-jitter branch 2 times, most recently from 804a7a3 to c72c4b5 Compare May 16, 2026 23:26
@Winbobob
Copy link
Copy Markdown
Member Author

Winbobob commented May 18, 2026

I saw multiple PRs try to implement the jitter for max connection duration timeout. This PR includes that and also adds drain timeout jitter and drain percentage based on a conversation two years ago: #35391. CI passed, cc @mathetake, @wbpcode

@Winbobob Winbobob force-pushed the zheweihu/f/drain-timeout-jitter branch from 98c1040 to 4e69643 Compare May 18, 2026 16:09
Comment thread changelogs/current.yaml Outdated
Winbobob and others added 12 commits May 19, 2026 18:24
…_percentage to HCM

Mitigates the thundering-herd reconnect problem (envoyproxy#35391) when many
long-lived HTTP/2 or gRPC connections share the same max_connection_duration and
all hit it simultaneously.

- HttpProtocolOptions.max_connection_duration_jitter (Percent): extends the
  per-connection duration timer by random(0, base * pct/100). Mirrors TCP proxy's
  max_downstream_connection_duration_jitter_percentage.
- HttpConnectionManager.drain_timeout_jitter (Percent): extends the drain grace
  period the same way, staggering the final GOAWAY across simultaneously-draining
  connections.
- HttpConnectionManager.drain_percentage (Percent, default 100): when
  max_connection_duration fires, only this fraction of eligible connections drain
  this cycle; the rest re-arm the duration timer (with fresh jitter) and wait
  for the next round. 0 = never drain, 100/unset = drain all (current behavior).

Includes interface plumbing in conn_manager_config.h, admin server stubs,
mock/fixture overrides, unit tests for the runtime jitter math, and config
parsing tests.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
The five new tests (DrainTimeoutJitter, DrainTimeoutJitterZeroPercent,
DrainTimeoutNoJitter, DrainPercentageSkip, DrainPercentageDrain) failed under
CI with "leaked mock objects found at program exit" because they never drove
ConnectionManagerImpl to its teardown path -- so the mock codec, mock timers,
and mock dispatcher were never destructed and gmock could not verify their
expectations.

Fix: append `conn_manager_->onEvent(Network::ConnectionEvent::RemoteClose);`
at the end of each test, matching the teardown pattern used elsewhere in
conn_manager_impl_test_2.cc (e.g. L589).

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Switch drain_timer creation in DrainTimeoutJitter/DrainTimeoutJitterZeroPercent/
DrainTimeoutNoJitter/DrainPercentageDrain from a raw 'new Event::MockTimer(...)'
to the setUpTimer() test helper, matching the existing ConnectionDuration test.
The raw form leaves the mock leaked at program exit; the helper registers the
timer with the dispatcher and is cleaned up correctly.

Also apply clang-format fixes to conn_manager_impl.cc and
http_connection_manager/config.cc reported by CI.

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
…pelling

- Restructure 5 new HCM drain tests (DrainTimeoutJitter*, DrainTimeoutNoJitter,
  DrainPercentage*) to drive a real request through startRequest() so codec_ is
  initialized before onConnectionDurationTimeout() runs. The previous shape hit
  the codec_==nullptr branch in doConnectionClose() and tripped UBSan because
  the drain_timer mock was reached through an unexpected path. Use
  FlushWriteAndDelay to mirror the proven ConnectionDuration test.
- Apply clang-format suggested by precheck for conn_manager_impl.cc (drain
  jitter block).
- Add 'teardowns' to spelling dictionary (used in protocol.proto).

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
The new DrainPercentageSkip test sets EXPECT_CALL(*connection_duration_timer,
disableTimer()) but the test no longer drives the connection close path, so
the expectation is never satisfied. Drop it; the meaningful assertions are the
re-armed enableTimer call and the two stat increments.

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Preempt review feedback (cf. wbpcode on PR envoyproxy#44064) by clarifying that
the jitter field on the shared HttpProtocolOptions is currently only
honored by the downstream HTTP connection manager.

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Address review feedback on PR envoyproxy#44064: rather than exposing a separate
maxConnectionDurationJitterPercentage() interface method that callers
must combine with maxConnectionDuration() themselves, treat the jitter
as an internal implementation detail of HttpConnectionManagerConfig.

Each call to maxConnectionDuration() now returns the base duration
extended by a freshly-sampled random amount up to base * jitter / 100.
Callers (ConnectionManagerImpl) arm their timer with whatever value
this returns; the jitter virtual is removed from the interface and from
all mocks/stubs.

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Same treatment as the previous commit's max_connection_duration refactor.
Removes drainTimeoutJitterPercentage() from the ConnectionManagerConfig
interface; HttpConnectionManagerConfig::drainTimeout() now returns a
freshly jittered value on each call. drainPercentage() is kept as a
separate predicate since it is not a transform of any existing value.

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
@Winbobob Winbobob force-pushed the zheweihu/f/drain-timeout-jitter branch from 901d84b to c7a91b4 Compare May 19, 2026 18:24
@Winbobob Winbobob force-pushed the zheweihu/f/drain-timeout-jitter branch from c7a91b4 to 3a6bf0e Compare May 19, 2026 18:33
Comment thread changelogs/current/new_features/http__drain-timeout-jitter.rst Outdated
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Winbobob added 3 commits May 19, 2026 19:34
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
… value stable

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
@Winbobob
Copy link
Copy Markdown
Member Author

Hi @markdroth , could you please review the API change? Thanks a lot!

@naren-13
Copy link
Copy Markdown

hi @Winbobob
When can we expect this feature to be fully implemented ?

Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Left a couple of API points:

  1. Consider using a jitter duration instead of percentage (not a requirement, just want to understand whether using a duration is better or not).
  2. I think that the drain_percentage field will not be easy to use/reason about. Let's decouple that into a different PR.

// <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.drain_timeout>`.
google.protobuf.Duration max_connection_duration = 3;

// Percentage-based jitter for ``max_connection_duration``. If set, the actual connection duration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why is percent based better than an explicit max-jitter-value?
I assume that max-jitter-value is easier to control and reason about, so I'm wondering why the percentage based approach is preferred here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  1. to match the existing behavior in tcp proxy: tcp_proxy: Add max_downstream_connection_duration_jitter_percentage #40999
  2. the value of type.v3.Percent will be [0,100], no extra validations needed

Hope it makes sense.


// When :ref:`max_connection_duration
// <envoy_v3_api_field_config.core.v3.HttpProtocolOptions.max_connection_duration>` fires,
// only this percentage of eligible connections are drained in the current round. For the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a round?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be rephrased as "current connection will have this percentage of chance to be drained". But as you suggested, I will remove the drain_percentage from this PR.

// 5000 milliseconds (5 seconds) if this option is not specified.
google.protobuf.Duration drain_timeout = 12;

// Percentage-based jitter for ``drain_timeout``. If set, the actual drain grace period
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation that explains all the timeouts, to take into account the new jitters.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated docs/root/faq/configuration/timeouts.rst

// only this percentage of eligible connections are drained in the current round. For the
// remaining connections, the connection duration timer is reset (with jitter, if
// configured via ``max_connection_duration_jitter``) and they wait for the next round.
// This bounds the number of simultaneous reconnects when many long-lived connections share
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes the understanding of max number of long-lived connections at a given time challenging to reason about.
Can you please explain why working in "rounds"/"cycles" is better here, and how are they defined?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rounds/cycles mean: if this time this connection does not get a chance to be drained, its duration timer will be reset and wait until the timer is fired and try to drain again.

Winbobob and others added 3 commits May 21, 2026 17:04
Decouple drain_percentage into a separate follow-up PR as it is
hard to reason about independently. Removes the proto field, HCM
config parsing, conn_manager_impl logic, interface method, stats
counter, all tests, mocks, and changelog entry.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
…r in timeouts FAQ

Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants