listener_manager: implement filter-chain drain-close callback by jronak · Pull Request #44932 · envoyproxy/envoy

jronak · 2026-05-07T21:01:52Z

Commit Message: listener_manager: implement filter-chain drain-close callback
Additional Description: Today, Envoy network filters cannot reliably react to drain events. addOnDrainCloseCb on the network filter factory context is a no-op, so filters miss server shutdown, listener removal via LDS, and in-place filter chain replacement. This change implements addOnDrainCloseCb allowing the network filters to setup/register drain callbacks where the callbacks to the filters are invoked with a random drain time to spread the cleanups.

As a follow-up, we will use this hook in tcp_proxy to drain live TCP connections gradually over the configured drain window instead of letting them sit until forced shutdown.

Risk Level: Low
Testing: Integration and Unit
Docs Changes: Updated
Release Notes: NA
Platform Specific Features: NA

wbpcode · 2026-05-08T09:03:34Z

See #44567 for TCP connection draining.

NOTE, for now we prefer to let the connection to check the drain flag actively (same with HTTP's design) rather than to the callbacks because:

The connections distribute on different workers and the callbacks registration requires to take the thread safety into account.
the callbacks need to be called in correct workers.

wbpcode · 2026-05-08T09:04:44Z

/wait

Wire addOnDrainCloseCb through listener and per-filter-chain factory contexts so drain-close signaling is consistent for server shutdown, LDS listener removal, and in-place filter chain updates. Per-listener drain managers are children of the server drain manager; listener factory context forwards registration to the listener drain manager; each per-filter-chain context keeps local callbacks and registers with the parent drain decision once, with startDraining() for chains removed in-place. Add unit tests for cascade, idempotency, reuse vs removal, and late registration. Follow-up: use this hook in tcp_proxy to drain live connections with jitter over the drain window in a separate change. Test: bazel test //test/common/listener_manager:filter_chain_manager_impl_test Signed-off-by: Ronak Jain <ronakjainc@gmail.com>

jronak · 2026-05-09T14:06:27Z

See #44567 for TCP connection draining.

NOTE, for now we prefer to let the connection to check the drain flag actively (same with HTTP's design) rather than to the callbacks because:

The connections distribute on different workers and the callbacks registration requires to take the thread safety into account.

the callbacks need to be called in correct workers.

Ah nice #44567, I have something similar in our internal fork, sharing a couple of things that didn't work great with the flag-check approach in case it's worth keeping the callback:

Active connections work fine with the flag, but closing them all roughly together creates a thundering herd on the newer instance during host-restart or same instance on rbac policy changes for us that means a burst of downstream/upstream TLS handshakes. I added random jitter with timer in the tcp-proxy (internally) up to the drain time and that helps, but still a chunk ends up closing near the end since connections may not be active at the moment drain starts and they'd pick their jittered time at some random start point.
Idle connections are a bigger problem, kafka/redis clients hold large pools with long idle times, so they never notice the flag and get force-closed at the end of the drain window causing similar issues as they immediately reconnect.

I'm thinking with the callback manager, callback is pushed into connections impacted by drain with it's delay value and tcp-proxy can setup the connection closure after the delay.

Also sorry, idk why I thought CallbackManager was thread aware. Updated the PR to manage callbacks to be thread aware, lmk if it still makes sense to proceed.

kyessenov · 2026-05-12T17:14:31Z

@wbpcode waiting for your response.
/wait

wbpcode · 2026-05-21T01:24:33Z

Active connections work fine with the flag, but closing them all roughly together creates a thundering herd on the newer instance during host-restart or same instance on rbac policy changes for us that means a burst of downstream/upstream TLS handshakes. I added random jitter with timer in the tcp-proxy (internally) up to the drain time and that helps, but still a chunk ends up closing near the end since connections may not be active at the moment drain starts and they'd pick their jittered time at some random start point.

I guess the graceful draining should be helpful for this cases because every connections will get different drain flag based on the a time based percentage?

Idle connections are a bigger problem, kafka/redis clients hold large pools with long idle times, so they never notice the flag and get force-closed at the end of the drain window causing similar issues as they immediately reconnect.

I think this is common problem for inactive connections. My previous initial plan is to design a new timer to check flag periodically if necessary, which could keep the drain flag as the unique way to know the drain status. But I am open to this.

I'm thinking with the callback manager, callback is pushed into connections impacted by drain with it's delay value and tcp-proxy can setup the connection closure after the delay.

A drain manager should be better if we can have a great design to avoid the impact to core code. Let me take a full passthrough to the Envoy's current drain manager design to see the performance and code impact.

And sorry for my delay! 🙇 Slack me directly if I missed your message again!

kyessenov requested a review from wbpcode May 7, 2026 22:26

wbpcode self-assigned this May 8, 2026

repokitteh-read-only Bot added the waiting label May 8, 2026

jronak force-pushed the ronakj/drain-callbacks branch from 258c6c2 to 1cff095 Compare May 9, 2026 13:50

repokitteh-read-only Bot removed the waiting label May 9, 2026

repokitteh-read-only Bot added the waiting label May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

listener_manager: implement filter-chain drain-close callback#44932

listener_manager: implement filter-chain drain-close callback#44932
jronak wants to merge 1 commit into
envoyproxy:mainfrom
jronak:ronakj/drain-callbacks

jronak commented May 7, 2026

Uh oh!

wbpcode commented May 8, 2026 •

edited

Loading

Uh oh!

wbpcode commented May 8, 2026

Uh oh!

jronak commented May 9, 2026

Uh oh!

kyessenov commented May 12, 2026

Uh oh!

wbpcode commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jronak commented May 7, 2026

Uh oh!

wbpcode commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbpcode commented May 8, 2026

Uh oh!

jronak commented May 9, 2026

Uh oh!

kyessenov commented May 12, 2026

Uh oh!

wbpcode commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wbpcode commented May 8, 2026 •

edited

Loading