tcp proxy: add new option to let tcp proxy check the drain close status#44567
Conversation
Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
|
/retest |
| * All tcp proxy stats. @see stats_macros.h | ||
| */ | ||
| #define ALL_TCP_PROXY_STATS(COUNTER, GAUGE) \ | ||
| COUNTER(downstream_cx_drain_close) \ |
There was a problem hiding this comment.
Is this counter described in a doc?
… dev-fix-tcp-proxy
Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
agrawroh
left a comment
There was a problem hiding this comment.
Just one question, LGTM otherwise
… dev-fix-tcp-proxy
|
/retest |
|
I guess the CI failure have no business with this PR... |
… dev-fix-tcp-proxy
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a check_drain_close configuration option to the TCP proxy filter. When enabled, the filter checks for a drain signal after each read or write operation and closes the downstream connection using FlushWrite if a drain is requested. The changes include updates to the API, documentation, implementation in tcp_proxy.cc, and comprehensive unit tests. A review comment suggests that the drain check in onData should be extended to cover active proxying paths where an upstream connection exists, ensuring consistent behavior.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in TCP proxy feature to proactively honor drain-close decisions during active TCP data flow, reducing the likelihood of reconnect storms at the end of listener drain.
Changes:
- Add
check_drain_closeAPI field and wire it intoTcpProxy::Filterto close downstream connections withFlushWritewhen drain close is requested. - Add a new stat (
downstream_cx_drain_close) and a new local close reason (tcp_proxy_drain_close) for observability. - Add/extend unit tests and update TCP proxy stats documentation.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/mocks/server/factory_context.h | Extend mock factory context with MockListenerInfo to support listener direction-dependent behavior. |
| test/mocks/server/factory_context.cc | Provide default listenerInfo()/direction() behavior for tests. |
| test/extensions/filters/network/tcp_proxy/config_test.cc | Add config parsing test for the new check_drain_close field. |
| test/common/tcp_proxy/tcp_proxy_test.cc | Add unit coverage for drain-close behavior on downstream read/upstream write and for inbound-only scope selection. |
| source/common/tcp_proxy/tcp_proxy.h | Add new stat and config accessors for drain-close decision/scope and the feature flag. |
| source/common/tcp_proxy/tcp_proxy.cc | Implement drain-close check and downstream closure after read/write handling when enabled. |
| envoy/stream_info/stream_info.h | Add TcpProxyDrainClose local close reason string. |
| docs/root/configuration/listeners/network_filters/tcp_proxy_filter.rst | Document the new downstream drain-close counter. |
| changelogs/current.yaml | Add release note entry describing the new TCP proxy drain-close check feature. |
| api/envoy/extensions/filters/network/tcp_proxy/v3/tcp_proxy.proto | Add check_drain_close field and update next-free-field annotation. |
Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
| // the downstream connection is closed with ``FlushWrite``. | ||
| // | ||
| // This is disabled by default for backward compatibility. | ||
| bool check_drain_close = 24; |
There was a problem hiding this comment.
Curious: does HTTP manager do something similar? Does it poll the state of connections or does it register a callback with the drain manager? I feel like the polling method adds non-determinism (e.g. active connections are closed but idle connection remain stale), which makes it hard to use.
There was a problem hiding this comment.
The HCM will poll the state of drain manager at the end of every single request. This also one reason why I choose the polling implementation.
I feel like the polling method adds non-determinism (e.g. active connections are closed but idle connection remain stale), which makes it hard to use.
Yeah. This is known problem/shortage. But callbacks solution will brings huge complexity to core code (callback lifetime management, thread safe, graceful draining and so on). I guess an appropriate idle timeout setting combine the polling should be good enough for most scenarios?
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
Looks good overall; just a few nits
/wait
| // after each read or write. When drain close is requested for the listener's traffic direction, | ||
| // the downstream connection is closed with ``FlushWrite``. | ||
| // | ||
| // This is disabled by default for backward compatibility. |
There was a problem hiding this comment.
Should we enable by default and leave a runtime setting to temporarily change the default. It seems like much better behavior to have this enabled.
There was a problem hiding this comment.
I am not sure if there is a case that the users want to keep the existing TCP connections as long time as possible by setting a long drain duration. Different with HTTP, we have no way to close the connection gracefully.
So, I slightly inclined to keep an option to control it. But I can change the bool to wrapped value so it's would be easier to change the default value in the future. WDYT?
There was a problem hiding this comment.
If you strongly prefer to use runtime guard and enable it by default, I am also fine to that. Feel free to let me know.
There was a problem hiding this comment.
I don't feel strongly, it was just a thought. I'm ok leaving it as is.
There was a problem hiding this comment.
If you used a message here, you could have put a timer-based check as well in the future. I don't think we see TCP keep-alives in Envoy, unlike HTTP, so a connection can be stuck without drain or data without a timer.
Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com>
… dev-fix-tcp-proxy
|
This API is reviewed before. And latest one only change it to wrapped message. |
…us (envoyproxy#44567) Commit Message: tcp proxy: add new option to let tcp proxy check the drain close status Additional Description: To close envoyproxy#44419. There are some other solutions in my mind, like to register a callback for every connection to the drain close manager. But it will introduce huge complexity to ensure all these callbacks are called correctly in correct threads and have impact to our core code tree. So, I finally select the simplest one. This is clean, safe, easy to review/maintain, and kept similar logic with HCM. Risk Level: low. touch core code but guarded by new proto API. Testing: unit. Docs Changes: n/a. Release Notes: added. Platform Specific Features: n/a. --------- Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com> Signed-off-by: Jiawei Wu <wujiawei@google.com>
…us (envoyproxy#44567) Commit Message: tcp proxy: add new option to let tcp proxy check the drain close status Additional Description: To close envoyproxy#44419. There are some other solutions in my mind, like to register a callback for every connection to the drain close manager. But it will introduce huge complexity to ensure all these callbacks are called correctly in correct threads and have impact to our core code tree. So, I finally select the simplest one. This is clean, safe, easy to review/maintain, and kept similar logic with HCM. Risk Level: low. touch core code but guarded by new proto API. Testing: unit. Docs Changes: n/a. Release Notes: added. Platform Specific Features: n/a. --------- Signed-off-by: wbpcode/wangbaiping <wbphub@gmail.com> Signed-off-by: Alireza <alrzazz98@gmail.com>
Commit Message: tcp proxy: add new option to let tcp proxy check the drain close status
Additional Description:
To close #44419.
There are some other solutions in my mind, like to register a callback for every connection to the drain close manager. But it will introduce huge complexity to ensure all these callbacks are called correctly in correct threads and have impact to our core code tree.
So, I finally select the simplest one. This is clean, safe, easy to review/maintain, and kept similar logic with HCM.
Risk Level: low. touch core code but guarded by new proto API.
Testing: unit.
Docs Changes: n/a.
Release Notes: added.
Platform Specific Features: n/a.