http: add connection-duration jitter, drain-timeout jitter, and drain_percentage to stagger downstream reconnections#45101
Conversation
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
804a7a3 to
c72c4b5
Compare
|
I saw multiple PRs try to implement the jitter for max connection duration timeout. This PR includes that and also adds drain timeout jitter and drain percentage based on a conversation two years ago: #35391. CI passed, cc @mathetake, @wbpcode |
98c1040 to
4e69643
Compare
…_percentage to HCM Mitigates the thundering-herd reconnect problem (envoyproxy#35391) when many long-lived HTTP/2 or gRPC connections share the same max_connection_duration and all hit it simultaneously. - HttpProtocolOptions.max_connection_duration_jitter (Percent): extends the per-connection duration timer by random(0, base * pct/100). Mirrors TCP proxy's max_downstream_connection_duration_jitter_percentage. - HttpConnectionManager.drain_timeout_jitter (Percent): extends the drain grace period the same way, staggering the final GOAWAY across simultaneously-draining connections. - HttpConnectionManager.drain_percentage (Percent, default 100): when max_connection_duration fires, only this fraction of eligible connections drain this cycle; the rest re-arm the duration timer (with fresh jitter) and wait for the next round. 0 = never drain, 100/unset = drain all (current behavior). Includes interface plumbing in conn_manager_config.h, admin server stubs, mock/fixture overrides, unit tests for the runtime jitter math, and config parsing tests. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
The five new tests (DrainTimeoutJitter, DrainTimeoutJitterZeroPercent, DrainTimeoutNoJitter, DrainPercentageSkip, DrainPercentageDrain) failed under CI with "leaked mock objects found at program exit" because they never drove ConnectionManagerImpl to its teardown path -- so the mock codec, mock timers, and mock dispatcher were never destructed and gmock could not verify their expectations. Fix: append `conn_manager_->onEvent(Network::ConnectionEvent::RemoteClose);` at the end of each test, matching the teardown pattern used elsewhere in conn_manager_impl_test_2.cc (e.g. L589). Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Switch drain_timer creation in DrainTimeoutJitter/DrainTimeoutJitterZeroPercent/ DrainTimeoutNoJitter/DrainPercentageDrain from a raw 'new Event::MockTimer(...)' to the setUpTimer() test helper, matching the existing ConnectionDuration test. The raw form leaves the mock leaked at program exit; the helper registers the timer with the dispatcher and is cleaned up correctly. Also apply clang-format fixes to conn_manager_impl.cc and http_connection_manager/config.cc reported by CI. Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
…pelling - Restructure 5 new HCM drain tests (DrainTimeoutJitter*, DrainTimeoutNoJitter, DrainPercentage*) to drive a real request through startRequest() so codec_ is initialized before onConnectionDurationTimeout() runs. The previous shape hit the codec_==nullptr branch in doConnectionClose() and tripped UBSan because the drain_timer mock was reached through an unexpected path. Use FlushWriteAndDelay to mirror the proven ConnectionDuration test. - Apply clang-format suggested by precheck for conn_manager_impl.cc (drain jitter block). - Add 'teardowns' to spelling dictionary (used in protocol.proto). Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
The new DrainPercentageSkip test sets EXPECT_CALL(*connection_duration_timer, disableTimer()) but the test no longer drives the connection close path, so the expectation is never satisfied. Drop it; the meaningful assertions are the re-armed enableTimer call and the two stat increments. Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Preempt review feedback (cf. wbpcode on PR envoyproxy#44064) by clarifying that the jitter field on the shared HttpProtocolOptions is currently only honored by the downstream HTTP connection manager. Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Address review feedback on PR envoyproxy#44064: rather than exposing a separate maxConnectionDurationJitterPercentage() interface method that callers must combine with maxConnectionDuration() themselves, treat the jitter as an internal implementation detail of HttpConnectionManagerConfig. Each call to maxConnectionDuration() now returns the base duration extended by a freshly-sampled random amount up to base * jitter / 100. Callers (ConnectionManagerImpl) arm their timer with whatever value this returns; the jitter virtual is removed from the interface and from all mocks/stubs. Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Same treatment as the previous commit's max_connection_duration refactor. Removes drainTimeoutJitterPercentage() from the ConnectionManagerConfig interface; HttpConnectionManagerConfig::drainTimeout() now returns a freshly jittered value on each call. drainPercentage() is kept as a separate predicate since it is not a transform of any existing value. Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
901d84b to
c7a91b4
Compare
…45095) Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
c7a91b4 to
3a6bf0e
Compare
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
… value stable Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
|
Hi @markdroth , could you please review the API change? Thanks a lot! |
|
hi @Winbobob |
adisuissa
left a comment
There was a problem hiding this comment.
Thanks.
Left a couple of API points:
- Consider using a jitter duration instead of percentage (not a requirement, just want to understand whether using a duration is better or not).
- I think that the
drain_percentagefield will not be easy to use/reason about. Let's decouple that into a different PR.
| // <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.drain_timeout>`. | ||
| google.protobuf.Duration max_connection_duration = 3; | ||
|
|
||
| // Percentage-based jitter for ``max_connection_duration``. If set, the actual connection duration |
There was a problem hiding this comment.
OOC why is percent based better than an explicit max-jitter-value?
I assume that max-jitter-value is easier to control and reason about, so I'm wondering why the percentage based approach is preferred here.
There was a problem hiding this comment.
Two reasons:
- to match the existing behavior in tcp proxy: tcp_proxy: Add max_downstream_connection_duration_jitter_percentage #40999
- the value of
type.v3.Percentwill be [0,100], no extra validations needed
Hope it makes sense.
|
|
||
| // When :ref:`max_connection_duration | ||
| // <envoy_v3_api_field_config.core.v3.HttpProtocolOptions.max_connection_duration>` fires, | ||
| // only this percentage of eligible connections are drained in the current round. For the |
There was a problem hiding this comment.
It should be rephrased as "current connection will have this percentage of chance to be drained". But as you suggested, I will remove the drain_percentage from this PR.
| // 5000 milliseconds (5 seconds) if this option is not specified. | ||
| google.protobuf.Duration drain_timeout = 12; | ||
|
|
||
| // Percentage-based jitter for ``drain_timeout``. If set, the actual drain grace period |
There was a problem hiding this comment.
Please update the documentation that explains all the timeouts, to take into account the new jitters.
There was a problem hiding this comment.
Just updated docs/root/faq/configuration/timeouts.rst
| // only this percentage of eligible connections are drained in the current round. For the | ||
| // remaining connections, the connection duration timer is reset (with jitter, if | ||
| // configured via ``max_connection_duration_jitter``) and they wait for the next round. | ||
| // This bounds the number of simultaneous reconnects when many long-lived connections share |
There was a problem hiding this comment.
I think this makes the understanding of max number of long-lived connections at a given time challenging to reason about.
Can you please explain why working in "rounds"/"cycles" is better here, and how are they defined?
There was a problem hiding this comment.
Rounds/cycles mean: if this time this connection does not get a chance to be drained, its duration timer will be reset and wait until the timer is fired and try to drain again.
Decouple drain_percentage into a separate follow-up PR as it is hard to reason about independently. Removes the proto field, HCM config parsing, conn_manager_impl logic, interface method, stats counter, all tests, mocks, and changelog entry. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
…r in timeouts FAQ Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Signed-off-by: Winbobob <zhewei.hu33@gmail.com>
Commit Message:
http: add connection-duration jitter, drain-timeout jitter, and drain_percentage to stagger downstream reconnections
Additional Description:
Mitigates the thundering-herd reconnect problem when many downstream long-lived HTTP/2 or gRPC connections share the same
max_connection_durationand all reach the limit (and therefore drain) at the same instant. Three opt-in knobs:HttpProtocolOptions.max_connection_duration_jitter(type.v3.Percent):extends each connection's duration timer by
random(0, base * pct/100).Mirrors the existing TCP-proxy field
max_downstream_connection_duration_jitter_percentage(from tcp_proxy: Add max_downstream_connection_duration_jitter_percentage #40999).HttpConnectionManager.drain_timeout_jitter(type.v3.Percent): extendsthe drain grace period between the shutdown-notice GOAWAY and the final
GOAWAY by
random(0, drain_timeout * pct/100).HttpConnectionManager.drain_percentage(type.v3.Percent, default 100):when
max_connection_durationfires, only this fraction of eligibleconnections drain this cycle; the rest re-arm the duration timer (with
fresh jitter, if configured) and wait for the next round. 100 (or unset)
preserves current behavior; 0 means never drain via this path.
All three fields are individually opt-in.
Risk Level:
Low.
drain_percentagedefaults to 100 → drain all same as todayTesting:
Unit tests and config-parsing tests
Docs Changes:
Proto field comments document semantics, defaults, and interaction with
max_connection_duration/drain_timeout.Release Notes:
Updated change log.
Platform Specific Features:
None.
Runtime guard:
None.
Fixes #Issue:
API Considerations:
max_connection_duration_jitterplaced onHttpProtocolOptions(next tomax_connection_duration) following the TCP proxy precedent. Note: thefield is currently honored only by the downstream HCM runtime; upstream
clusters read the same proto but do not apply the jitter today.
drain_timeout_jitteranddrain_percentageplaced onHttpConnectionManagerbecause the drain sequence they gate isdownstream-HCM-only; placing them on
HttpProtocolOptionswouldincorrectly imply upstream applicability.
type.v3.Percent, which carries[0, 100]PGV validation;no additional runtime bounds checking is needed.