-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance ext_proc filter to support MXN streaming #34942
Enhance ext_proc filter to support MXN streaming #34942
Conversation
…r one received chunk Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
/assign @gbrail @htuch @jmarantz @tyxia @yanavlasov |
@gbrail cannot be assigned to this issue. |
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
/assign @tyxia As codeowner for first pass. |
/wait |
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
/wait (for #36999) |
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Kind Ping! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read the tests yet. Are we hitting all the corner cases?
To answer this I'd look at a coverage map generated by CI, which you can do with a few (non-obvious) clicks.
if (!message_timer_) { | ||
message_timer_ = filter_callbacks_->dispatcher().createTimer(cb); | ||
|
||
if (bodyMode() != ProcessingMode::FULL_DUPLEX_STREAMED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please comment how the next processing step occurs in full duplex mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that comment reflects what the code does (which is clear enough already from reading the code), but not for what the plan for how the next step will be.
return ProcessorState::CallbackState::TrailersCallback; | ||
} | ||
} | ||
|
||
return ProcessorState::CallbackState::Idle; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleHeadersResponse is too big to comprehend. It wil be hard to know whether the change you made might have the desired effect, and no undesired ones.
WDYT of breaking this one up also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are unit tests and integration tests are added special for the code change in this function.
Sure, due to historical reason, there are quite some technical debt in the ext_proc filter state machine code. I added a TODO here. Let's take care of these technical debts in separate follow up PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By technical debt (by that, do you mean "lack of test coverage")?
I think it would be better to get the test coverage really solid before adding a lot of complexity. THis stuff is really complicated to read, and it would help a lot if we at least had confidence in all the code getting covered in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test coverage, we should be good. These are the test cases we added for this new body mode:
Integration tests:
- Server buffer headers, whole body, before sending response.
- Server buffer headers, whole body, and trailers, before sending response.
- Server buffer headers, and certain amount of body, then send send body response without wait of the end of it. At same time new body is coming in, and server continue do this kind of buffer-processing-response for a while. Then eventually trailers come in. Then server sends last chunk body response, and trailer response.
Unit tests:
- Client sends header and body. Server send header response once receive header request, i.e, not waiting for body.
- Client sends header and trailer, no body. Server sends header response after receiving trailer.
- For one HTTP stream , server do MxN processing for some chunks, then do 1x1(i.e, send one response for one request immediately) for some chunks, then do MxN again.
- A couple of server misbehaving test cases
- A couple of Envoy misconfiguration test cases.
These tests are trying to cover different scenarios, like client requests may or may not have body, may or may not have trailer. Server sends header response may or may not wait for body, may or may not wait for trailers, may or may not buffer, et.al.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we hitting all the lines in the coverage report?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of test coverage for ext_proc code in a whole, I recall it meets the Envoy criteria, like >96.3%. And also the ext_proc fuzzer coverage is > 70%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good, but I'd still like you to look at the line coverage and see if we are hitting all your new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, based on this LCOV report, the new code are tested:
https://storage.googleapis.com/envoy-pr/8526639/coverage/source/extensions/filters/http/ext_proc/index.html
@@ -39,6 +39,7 @@ envoy_extension_cc_test( | |||
}), | |||
extension_names = ["envoy.filters.http.ext_proc"], | |||
rbe_pool = "2core", | |||
shard_count = 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really want to know why this test is so slow you need to use mutliple cores and 8 shards. I think I asked this before but didn't see the answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I responded this here: #34942 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed up on that thread but copying here:
I don't have objection to calling it large if it's large.
I just am surprised it takes this long and feel like we must be having some sleeps or something more complex than a unit test usually is, beyond having a number of test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly check the trace log timestamp. it looks the tests themselves are very fast. i.e, <5ms in a local setup. It's the test initialization consumed most of the time (>100ms in the same local setup). And this is the case for existing tests as well, like the very basic SimpestPost test:
TEST_F(HttpFilterTest, SimplestPost) { |
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments regarding how to handle server response/behavior. Thanks for patience
if (!message_timer_) { | ||
message_timer_ = filter_callbacks_->dispatcher().createTimer(cb); | ||
|
||
// Skip starting timer For FULL_DUPLEX_STREAMED body mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without timer, how we are going to handle timeout situation that side stream server doesn't respond?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the side stream server does not respond, the router filter idle timeout will kick in, and destroy ext_proc filter. This will be same as a backend server does not respond to a client request.
ENVOY_LOG(debug, "Applying body response to chunk of data. Size = {}", chunk->length); | ||
MutationUtils::applyBodyMutations(common_response.body_mutation(), chunk_data); | ||
} | ||
bool should_continue = chunk->end_stream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the side stream server doesn't(or JUST forget to) respond with end_stream = true, look like this is not handled and will be stuck there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the side stream does not respond or never send end_of_stream to be true. ext_proc filter will keep waiting for for response, and eventually router filter timeout should kick in and destroy the ext_proc filter. This should be same as if backend server misbehaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this issue #37065 to track the work to add an integration test if the server failed to send response in time and router filter timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This theory doesn't sound solid to me and it seems to make all existing individual filter's error handling pointless.
Besides, tightly coupling side stream error with router/upstream will introduce a bad observability and customer experience, as they are two different errors.
I think we should improve the error handling of this design (maybe extproc can have its own timeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought we agreed that no ext_proc timer for FULL_DUPLEX_STREAMED mode. And let the generic Envoy router idle timer(default 15s: ) to take care of the cases if server is not responding.
And this will make the server be able to buffer more data, and maybe all the way to the end_of_stream received. Adding an ext_proc specific timer will limit this capability.
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@adisuissa I did an upstream merge. It needs your API approval again. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm api
LGTM, as a WIP/good start. I think those open comments above need to be addressed (and some more tests/loadtest) to complete this feature. |
This PR is for issue: #32090. One of the use case is, like compression by the external processing.
This is to let the ext_proc server be able to buffer M request body chunks from Envoy first, processing them, then send N chunks back to Envoy in the STREAMED mode. It also let the server buffer the entire message, i.e, header, body, trailer, before sending back any response.
The ext_proc MXN streaming works this way:
With above config, Envoy will send body to the ext_proc server as they arrival. The server can buffer the entire or partial of the body (M chunks) then streaming the mutated body(may need to split into N chunks), back to Envoy.