Skip to content

Comments

network: add close through network filter_manager support to check filter's close status#38778

Merged
botengyao merged 43 commits intoenvoyproxy:mainfrom
botengyao:network_filter_manager_close
Apr 9, 2025
Merged

network: add close through network filter_manager support to check filter's close status#38778
botengyao merged 43 commits intoenvoyproxy:mainfrom
botengyao:network_filter_manager_close

Conversation

@botengyao
Copy link
Member

@botengyao botengyao commented Mar 18, 2025

To solve the data loss issue during the close sequence especially for the middle external processing network filter in the filter chain. #38779

image

This PR will let the close manage by filter manager for some cases, and also introduce an enum StopIterationAndDontClose for network filter, which a continueClosing() is expected if the filter manger is closing the connection. It will check all the networks' filter status before closing.

Alternative Design

To provide more flexibility, we can split the disable close and StopIteration functionalities, and introduce an disableClose(bool) to change a filter's pending_close_ status.

Key changes in ConnectionImpl

For the read path directly from socket:

  • The RemoteClose will close the socket when the socket is closed when half close is not enabled or both ways are half closed if half close is enabled immediately. With this PR right now the RemoteClose will be managed by the filter manager. Since a filter could return StopIterationAndDontClose, the closeThroughFilerManager() will pause the close until the filter calls continueClosing.

For the write path directly from socket:

  • If it is in the FlushWrite or FlushWriteAndDelay status that is triggered by close(), the socket will close immediately since the close() is managed by the filter manager.
  • For half close cases, if both ways are half closed, the close will also be managed by the filter manager, since the read filter is possible in the close pending status.

For the close() that is initialized by Envoy's connection:

  • We will only let the close manage by the filter manger if it is FlushWrite and FlushWriteAndDelay
  • Abort, NoFlush, and AbortReset will still close the socket immediately.
  • We still want to have the ability to flush the write buffer after the wait, and that is the major reason that we don't directly manage the closeSocket(LocalClose).

For remote early close:

  • If half-close is enabled, this is never activated.
  • If half-close is disabled, there are two scenarios when the close event is monitored:
    1. During the closeInternal(_), which comes from filter manager already.
    2. When an early close is detected while the connection is read-disabled. It is okay to bypass the filter manager's status since there will be data loss even in normal cases.

Key changes in Filter_manager

  • Introduce a StopIterationAndDontClose for filter status and conntinueClosing()
  • Introduce the state_ in filter manager to check the pending close status
  • It supports multiple filter use cases by introducing a counter.

Note:

  • this will not impact normal traffic without the StopIterationAndDontClose.
  • If the connection is pending close, only injectWriteDataToFilterChain and injectReadDataToFilterChain is called.
  • The close is tracked by counting if all network filters have finished the job.

Commit Message:
Additional Description:
Risk Level: Low for normal case, and High if feature is used.
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Boteng Yao <boteng@google.com>
@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #38778 was opened by botengyao.

see: more, trace.

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao botengyao changed the title [DRAFT] network: allow network filter_chain_manager to delay close from Connectionimpl [DRAFT] network: add close through network filter_manager support to check filter's close status Mar 18, 2025
Signed-off-by: Boteng Yao <boteng@google.com>
…ger_close

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao
Copy link
Member Author

botengyao commented Mar 18, 2025

@yanavlasov @KBaichoo this is a PR to let network filter manager manage the close for connections, and introduced a StopIterationAndDontClose and continueClosing status. More tests are pending, but want to get the feedback at current stage. Thank you!

cc @envoyproxy/envoy-maintainers

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao botengyao changed the title [DRAFT] network: add close through network filter_manager support to check filter's close status network: add close through network filter_manager support to check filter's close status Mar 19, 2025
Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao botengyao marked this pull request as ready for review March 19, 2025 19:42
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
…ger_close

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@wbpcode
Copy link
Member

wbpcode commented Apr 1, 2025

I am converting this PR to avoid a new filter status, but once we understand the logic the complexity is acceptable and manageable ;-).

Yeah. I think it may need more comment to these additional branch to explain why for me and future developers. Sorry, I don't want to block this new feature, but help ensure it maintainable for future.

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao
Copy link
Member Author

botengyao commented Apr 1, 2025

I am converting this PR to avoid a new filter status, but once we understand the logic the complexity is acceptable and manageable ;-).

Yeah. I think it may need more comment to these additional branch to explain why for me and future developers. Sorry, I don't want to block this new feature, but help ensure it maintainable for future.

agreed, added more comments. PTAL, and thank you!

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao
Copy link
Member Author

/retest

Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I think I gradually get the code. Although it take some time. Some comments are added. My core concern is still the complexity. I am trying to make it more simpler and without effect the feature you want.

But basically, I think this is fine.

FilterManagerImpl& parent_;
ReadFilterSharedPtr filter_;
bool initialized_{};
bool pending_close_{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the usage of this bool flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a flag to indicate if a filter is gating the close, and this flag will impact the counter and disableClose() will change its status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I got it's usage. But I am not sure if it's necessary to use it. Why cannot we just increment the count when the closeDisable(true) is called?
We have a similar practice on the read, the readDisable(true/false). I think use the similar parttern may be better?

Copy link
Member Author

@botengyao botengyao Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key design here is to simplify the usage and lose the counter check, and we only count and check each filter's status, and the counter will only ++ when the filter status is changing.

  • disableClose(true) marks the filter as unable to close by delaying the closure process.
  • Calling disableClose(true) again has no additional effect, as the closure is already marked as pending.
  • One disableClose(false) can unblock the close, meaning the filter has done its job.

readDisable is kind of ++ and -- mapped for each call due to the buffer scenario and some initialization scenario are very simple. But close scenario can be more complicated depending on many things which it is possible that there is no a strict start and end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableClose(true) marks the filter as unable to close by delaying the closure process.
Calling disableClose(true) again has no additional effect, as the closure is already marked as pending.
One disableClose(false) can unblock the close, meaning the filter has done its job.

I knew this behavior. I was thinking if there are multiple async actions block the closing, I think may be we should call the disableClose(true) multiple time and increment the count. And everytime once an action is done we can call the disableClose(false) to decrement the count.

That's say, I think the disableClose(true) and disableClose(false) should be a pair. wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, and we will lose the counter check by the filter manager through only checking all filters status. We will let the filter itself to maintain the final close check.

…ger_close

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: Boteng Yao <boteng@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #38778 was synchronize by botengyao.

see: more, trace.

Copy link
Member Author

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you @wbpcode for the review. Updated and added more comments.

FilterManagerImpl& parent_;
ReadFilterSharedPtr filter_;
bool initialized_{};
bool pending_close_{false};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a flag to indicate if a filter is gating the close, and this flag will impact the counter and disableClose() will change its status.

Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao
Copy link
Member Author

/retest

Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao
Copy link
Member Author

/retest

Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall with a minor comment. I think we will cut a release next week. I will prefer to merge this after next week as IMO this is a high-risk PR. Thanks for the contribution again. 🌷

Signed-off-by: Boteng Yao <boteng@google.com>
@botengyao
Copy link
Member Author

botengyao commented Apr 8, 2025

LGTM overall with a minor comment. I think we will cut a release next week. I will prefer to merge this after next week as IMO this is a high-risk PR. Thanks for the contribution again

Thanks @wbpcode for the dedicated time to review this PR. This is guarded by a false runtime guard meaning it is not enabled, and I want to bake it some time before enabling it, so I think there should be no-risk, wdyt?

@botengyao
Copy link
Member Author

/retest

@wbpcode
Copy link
Member

wbpcode commented Apr 9, 2025

LGTM overall with a minor comment. I think we will cut a release next week. I will prefer to merge this after next week as IMO this is a high-risk PR. Thanks for the contribution again

Thanks @wbpcode for the dedicated time to review this PR. This is guarded by a false runtime guard meaning it is not enabled, and I want to bake it some time before enabling it, so I think there should be no-risk, wdyt?

sgtm.

Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

@botengyao botengyao merged commit 079e652 into envoyproxy:main Apr 9, 2025
25 checks passed
yanavlasov pushed a commit that referenced this pull request Apr 23, 2025
Added the basic ext_proc gRPC callout functions without the data loss
that was addressed in #38778.
gRPC client was shared with HTTP ext_proc that was templated by
#38898

Pending:
* COUNTER will be in the followup RPs.
* filter_metadata handling is pending.
* log info is pending.
* timeout setting is pending
* flow control is pending

Commit Message:
Additional Description:
Risk Level: low
Testing:

#38721

---------

Signed-off-by: Boteng Yao <boteng@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants