-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ext_proc flaky streaming integration test #30253
Fix ext_proc flaky streaming integration test #30253
Conversation
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
…_streamed_test_convert Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
…_streamed_test_convert Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
@@ -47,6 +47,8 @@ class StreamingIntegrationTest : public HttpIntegrationTest, | |||
// This enables a built-in automatic upstream server. | |||
autonomous_upstream_ = true; | |||
proto_config_.set_allow_mode_override(true); | |||
proto_config_.mutable_message_timeout()->set_seconds(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems likely to flake when overloaded. Is there a way to run these tests with mock/fake time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's doable. I am thinking to convert some of these flaky tests in streaming_integration_test.cc into unit tests and put them in filter_test.cc. These will be done in follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use simulated time in integration tests, so IMO this is actionable in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also adding a unit test doesn't really address Harvey's comment that this test would be likely to flake :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting failure_mode_allow to true as below will be able to get the tests passing even timeout or gRPC channel random close happens:
- proto_config_.set_failure_mode_allow(true);
The request will be forwarded as if the ext_proc filter does not exist in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK what impact that setting has on the realism of the test. At first glance it makes me wonder if the test just isn't really checking that much.
The purpose of simulated time tests is to enable robust predictable testing of code with timeouts, including in integration tests. You can force a timeout to occur,, or not occur, based on explicit control of time from the test fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! There are other ext_proc tests to test timeout behavior, like here and other tests around it:
https://github.com/envoyproxy/envoy/blob/5c9cc8460c582f1a895dbffd974e248f0b2911bc/test/extensions/filters/http/ext_proc/ext_proc_integration_test.cc#L1690C56-L1690C56
This streaming_integration_test.cc is not to test timeout scenario. It's to test the ext_proc server mutation behaviors when the ext_proc filter is configured with STREAMED or BUFFERED_PARTIAL mode.
The original test cases are written with the assumption that timeout won't happen and gRPC channel will not close. But in the tests, both of them may happen, which leads to flakiness of the test. So, extending the timeout value from 200ms to 2s will reduce the timeout probability. Set failure_mode_allow to true will catch up if either errors happens, the tests can still pass, like client still receive 200 instead of 500, et.al.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how flaky this file still is after the proposed change, we can consider convering these streamed integration tests to unit tests using mocks, similar to what we have don in this PR: #29940.
Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
…nnel issue Signed-off-by: Yanjun Xiang <yanjunxiang@google.com>
/assign-from @envoyproxy/envoy-maintainers |
@envoyproxy/envoy-maintainers assignee is @lizan |
@yanavlasov could you take a look since this is potentially blocking the release? |
There is one more flaky issue here: sometimes, an ext_proc server response somehow is sent to the Envoy listener port, which trigger the ext_proc filter decodeHeader()/decodeData() code get called again unexpectedly, and cause the test failure. I am still trying to understand how this happened. The happens pretty rare based on my local test tries. |
so my reading of current pr is that its its fairly assuredly an improvement and should be good to land - currently this is one of the only remaining release blockers if remaining issues are known (and esp if we resolve quickly) then it should be easy to backport any fixes, happy to help there tldr - lets land unless we think this will worsen the situation |
[ RUN ] StreamingProtocols/StreamingIntegrationTest.PostAndProcessBufferedRequestBody/IPv4_GoogleGrpc_NoDeferredProcessing
[2023-10-18 04:32:59.387][12][info][testing] [test/integration/fake_upstream.cc:665] starting fake server at 127.0.0.1:36013. UDP=false codec=http2
[2023-10-18 04:32:59.388][12][debug][misc] [test/integration/base_integration_test.cc:194] Setting up file-based LDS
[2023-10-18 04:32:59.391][12][debug][misc] [test/config/utility.cc:1045] Not overriding preset port
[2023-10-18 04:32:59.404][12][debug][misc] [test/integration/base_integration_test.cc:236] Running Envoy with configuration:
static_resources:
...
[2023-10-18 04:32:59.540][337][info][main] [source/server/server.cc:942] starting main dispatch loop
[2023-10-18 04:32:59.551][12][debug][testing] [test/integration/base_integration_test.cc:418] registered 'http' as port ****34783.** <<<<<< 34783 is the listener port**
[2023-10-18 04:32:59.586][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:420] decodeData(1000): end_stream = false
[2023-10-18 04:32:59.586][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:300] onData: Buffering
[2023-10-18 04:32:59.586][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:422] decodeData returning 1
[2023-10-18 04:32:59.586][341][trace][http] [source/common/http/filter_manager.cc:704] [Tags: "ConnectionId":"47","StreamId":"4634651194543927278"] decode data called: filter=envoy.filters.http.ext_proc status=1
[2023-10-18 04:32:59.586][341][trace][http2] [source/common/http/http2/codec_impl.cc:1061] [Tags: "ConnectionId":"47"] about to recv frame type=0, flags=0, stream_id=1
[2023-10-18 04:32:59.586][341][trace][http2] [source/common/http/http2/codec_impl.cc:1087] [Tags: "ConnectionId":"47"] recv frame type=0
[2023-10-18 04:32:59.586][341][trace][http2] [source/common/http/http2/codec_impl.cc:2138] [Tags: "ConnectionId":"47"] track inbound frame type=0 flags=0 length=1000 padding_length=0
[2023-10-18 04:32:59.586][341][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x25feff532fc0 for 300000ms, min is 300000ms
[2023-10-18 04:32:59.586][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:420] decodeData(1000): end_stream = false
[2023-10-18 04:32:59.586][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:300] onData: Buffering
[2023-10-18 04:32:59.586][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:422] decodeData returning 1
[2023-10-18 04:32:59.586][341][trace][http] [source/common/http/filter_manager.cc:704] [Tags: "ConnectionId":"47","StreamId":"4634651194543927278"] decode data called: filter=envoy.filters.http.ext_proc status=1
[2023-10-18 04:32:59.586][341][trace][http2] [source/common/http/http2/codec_impl.cc:989] [Tags: "ConnectionId":"47"] dispatched 16144 bytes
[2023-10-18 04:32:59.586][341][trace][connection] [source/common/network/connection_impl.cc:614] [Tags: "ConnectionId":"47"] socket event: 3
[2023-10-18 04:32:59.586][341][trace][connection] [source/common/network/connection_impl.cc:737] [Tags: "ConnectionId":"47"] write ready
[2023-10-18 04:32:59.586][341][trace][connection] [source/common/network/connection_impl.cc:654] [Tags: "ConnectionId":"47"] read ready. dispatch_buffered_data=0
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/raw_buffer_socket.cc:25] [Tags: "ConnectionId":"47"] read returns: 9
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/raw_buffer_socket.cc:39] [Tags: "ConnectionId":"47"] read error: Resource temporarily unavailable, code: 0
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:954] [Tags: "ConnectionId":"47"] dispatching 9 bytes
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:1061] [Tags: "ConnectionId":"47"] about to recv frame type=0, flags=1, stream_id=1
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:1087] [Tags: "ConnectionId":"47"] recv frame type=0
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:2138] [Tags: "ConnectionId":"47"] track inbound frame type=0 flags=1 length=0 padding_length=0
[2023-10-18 04:32:59.587][341][debug][http] [source/common/http/conn_manager_impl.cc:1177] [Tags: "ConnectionId":"47","StreamId":"4634651194543927278"] request end stream
[2023-10-18 04:32:59.587][341][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x25feff532fc0 for 300000ms, min is 300000ms
[2023-10-18 04:32:59.587][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:420] decodeData(0): end_stream = true
[2023-10-18 04:32:59.587][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:290] Sending request body message
[2023-10-18 04:32:59.587][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:538] Sending a body chunk of 99000 bytes, end_stram true
[2023-10-18 04:32:59.587][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/processor_state.cc:30] Traffic direction INBOUND: 2000 ms timer enabled
[2023-10-18 04:32:59.587][341][trace][grpc] [source/common/grpc/google_async_client_impl.cc:233] Queued message to write (99010 bytes)
[2023-10-18 04:32:59.587][341][trace][grpc] [source/common/grpc/google_async_client_impl.cc:276] Write op dispatched
[2023-10-18 04:32:59.587][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:422] decodeData returning 3
[2023-10-18 04:32:59.587][341][trace][http] [source/common/http/filter_manager.cc:704] [Tags: "ConnectionId":"47","StreamId":"4634651194543927278"] decode data called: filter=envoy.filters.http.ext_proc status=3
[2023-10-18 04:32:59.587][342][trace][grpc] [source/common/grpc/google_async_client_impl.cc:61] completionThread CQ event 3 true
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:989] [Tags: "ConnectionId":"47"] dispatched 9 bytes
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/connection_impl.cc:614] [Tags: "ConnectionId":"48"] socket event: 3
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/connection_impl.cc:737] [Tags: "ConnectionId":"48"] write ready
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/connection_impl.cc:654] [Tags: "ConnectionId":"48"] read ready. dispatch_buffered_data=0
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/raw_buffer_socket.cc:25] [Tags: "ConnectionId":"48"] read returns: 510
[2023-10-18 04:32:59.587][341][trace][connection] [source/common/network/raw_buffer_socket.cc:39] [Tags: "ConnectionId":"48"] read error: Resource temporarily unavailable, code: 0
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:954] [Tags: "ConnectionId":"48"] dispatching 510 bytes
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:1061] [Tags: "ConnectionId":"48"] about to recv frame type=4, flags=1, stream_id=0
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:2138] [Tags: "ConnectionId":"48"] track inbound frame type=4 flags=1 length=0 padding_length=0
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:1087] [Tags: "ConnectionId":"48"] recv frame type=4
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:1061] [Tags: "ConnectionId":"48"] about to recv frame type=1, flags=4, stream_id=1
[2023-10-18 04:32:59.587][341][trace][http2] [source/common/http/http2/codec_impl.cc:2138] [Tags: "ConnectionId":"48"] track inbound frame type=1 flags=4 length=289 padding_length=0
[2023-10-18 04:32:59.588][341][debug][http] [source/common/http/conn_manager_impl.cc:391] [Tags: "ConnectionId":"48"] new stream
[2023-10-18 04:32:59.588][341][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x25feff53f180 for 300000ms, min is 300000ms
[2023-10-18 04:32:59.588][341][trace][http2] [source/common/http/http2/codec_impl.cc:1087] [Tags: "ConnectionId":"48"] recv frame type=1
[2023-10-18 04:32:59.588][341][debug][http] [source/common/http/conn_manager_impl.cc:1194] [Tags: "ConnectionId":"48","StreamId":"3222402010705002608"] request headers complete (end_stream=false):
':path', '/envoy.service.ext_proc.v3.ExternalProcessor/Process'
'**:authority', '127.0.0.1:34783'** **<<<<<< I don't see this message in a good test run. So not sure where it is coming from.**
':method', 'POST'
':scheme', 'http'
'content-type', 'application/grpc'
'te', 'trailers'
'grpc-accept-encoding', 'identity, deflate, gzip'
'user-agent', 'grpc-c++/1.56.2 grpc-c/33.0.0 (linux; chttp2)'
'x-request-id', 'ac11bbc7-5c95-4364-8ec0-fafbe03c4f46'
[2023-10-18 04:32:59.588][341][debug][connection] [./source/common/network/connection_impl.h:98] [Tags: "ConnectionId":"48"] current connecting state: false
[2023-10-18 04:32:59.588][342][trace][grpc] [source/common/grpc/google_async_client_impl.cc:61] completionThread CQ event 2 true
[2023-10-18 04:32:59.588][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:232] **decodeHeaders: end_stream = false** <<<<<<< ext_proc decodeHeaders() is called again here
[2023-10-18 04:32:59.588][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:162] Opening gRPC stream to external processor
[2023-10-18 04:32:59.589][342][trace][grpc] [source/common/grpc/google_async_client_impl.cc:61] completionThread CQ event 0 true
[2023-10-18 04:32:59.589][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/processor_state.cc:30] Traffic direction INBOUND: 2000 ms timer enabled
[2023-10-18 04:32:59.589][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:224] Sending headers message
[2023-10-18 04:32:59.589][341][trace][grpc] [source/common/grpc/google_async_client_impl.cc:233] Queued message to write (375 bytes)
[2023-10-18 04:32:59.589][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:244] decodeHeaders returning 1
[2023-10-18 04:32:59.589][341][trace][http] [source/common/http/filter_manager.cc:572] [Tags: "ConnectionId":"48","StreamId":"3222402010705002608"] decode headers called: filter=envoy.filters.http.ext_proc status=1
[2023-10-18 04:32:59.589][341][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x25feff53f180 for 300000ms, min is 300000ms
[2023-10-18 04:32:59.589][341][trace][http2] [source/common/http/http2/codec_impl.cc:1061] [Tags: "ConnectionId":"48"] about to recv frame type=0, flags=0, stream_id=1
[2023-10-18 04:32:59.589][341][trace][http2] [source/common/http/http2/codec_impl.cc:1087] [Tags: "ConnectionId":"48"] recv frame type=0
[2023-10-18 04:32:59.589][341][trace][http2] [source/common/http/http2/codec_impl.cc:2138] [Tags: "ConnectionId":"48"] track inbound frame type=0 flags=0 length=194 padding_length=0
[2023-10-18 04:32:59.589][341][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x25feff53f180 for 300000ms, min is 300000ms
[2023-10-18 04:32:59.589][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:420] **decodeData(194): end_stream = false** **<<<<<<< ext_proc decodeData() is called again which eventually leads to test failure**
[2023-10-18 04:32:59.589][341][trace][ext_proc] [source/extensions/filters/http/ext_proc/ext_proc.cc:263] Header processing still in progress -- holding body data
[2023-10-18 04:32:59.589][341][debug][ext_proc] [source/extensions/filters/http/ext_proc/processor_state.cc:400] Watermark raised on decoding
....
[2023-10-18 04:33:01.601][341][trace][main] [source/common/event/dispatcher_impl.cc:124] clearing deferred deletion list (size=2)
test/extensions/filters/http/ext_proc/streaming_integration_test.cc:232: Failure
Expected equality of these values:
**body_req.request_body().body().size()
Which is: 194**
total_size
Which is: 99000
Stack trace:
0x2124710: Envoy::Extensions::HttpFilters::ExternalProcessing::StreamingIntegrationTest_PostAndProcessBufferedRequestBody_Test::TestBody()::$_3::operator()()
0x2123db2: std::__invoke_impl<>()
0x2123d42: std::__invoke_r<>()
0x2123c32: std::_Function_handler<>::_M_invoke()
0x220a865: std::function<>::operator()()
0x220a39b: Envoy::Extensions::HttpFilters::ExternalProcessing::ProcessorWrapper::Process()
0x220c028: envoy::service::ext_proc::v3::ExternalProcessor::Service::Service()::$_0::operator()()
0x220bfd5: std::__invoke_impl<>()
0x220bf1f: std::__invoke_r<>()
0x220bddf: std::_Function_handler<>::_M_invoke()
0x221ec42: std::function<>::operator()()
0x221eb97: grpc::internal::BidiStreamingHandler<>::BidiStreamingHandler()::{lambda()#1}::operator()()
0x221eb3d: std::__invoke_impl<>()
0x221eaaa: std::__invoke_r<>()
0x221e85a: std::_Function_handler<>::_M_invoke()
0x222080d: std::function<>::operator()()
0x2220786: grpc::internal::TemplatedBidiStreamingHandler<>::RunHandler()::{lambda()#1}::operator()()
0x221f0d4: grpc::internal::CatchingFunctionHandler<>()
0x221e37f: grpc::internal::TemplatedBidiStreamingHandler<>::RunHandler()
0x326f11a: grpc::Server::SyncRequest::ContinueRunAfterInterception()
0x326ece2: grpc::Server::SyncRequest::Run()
0x326e7f3: grpc::Server::SyncRequestThreadManager::DoWork()
0x3291cf0: grpc::ThreadManager::MainWorkLoop()
0x3291a0c: grpc::ThreadManager::WorkerThread::Run()
0x3292689: grpc::ThreadManager::WorkerThread::WorkerThread()::$_0::operator()()
0x3292665: grpc::ThreadManager::WorkerThread::WorkerThread()::$_0::__invoke()
0x3a0b315: grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::operator()()
0x3a0b1e5: grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::__invoke()
0x7fb6318293ec: (unknown)
[2023-10-18 04:33:03.206][12][debug][init] [source/common/init/watcher_impl.cc:31] init manager RTDS destroyed |
With this change, the flakiness went down to 0.1% based on my local test: runs_per_test = 1000. We can land it first then take time to root cause the above issue. In the worst case, we can completely disable the test target of streaming_integration_test. There are significant amount of STREAMED mode tests added in filter_test.cc recently, which give us pretty good coverage there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of disabling tests, but promptly unbreaking CI and fixing the tests after makes sense.
ext_proc streaming_integration_test is flaky due to:
Solution:
Sometimes, the autonomous fake upstream also leads to flaky test failure during test shutdown and d'tor of autonomous_upstream. The long term solution to avoid the flakiness and re-enable the above tests is converting stream_integration_test to not use test server and autonomous upstream. Instead, just using normal fake upstream as what is used in ext_proc_integration_test.cc. Or maybe even simpler, just convert these into unit tests and put them into filter_test.cc.
Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]