Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuicHttpIntegrationTest fails about 20% of the time under clang-tsan due to FakeUpstream object lifetime problems #16493

Closed
antoniovicente opened this issue May 14, 2021 · 7 comments
Assignees
Labels
area/quic area/test flakes bug priority/high stale stalebot believes this issue/PR has not been touched recently

Comments

@antoniovicente
Copy link
Contributor

antoniovicente commented May 14, 2021

$ bazel test --config=remote --config=clang-tsan --runs_per_test=10 //test/integration:quic_http_integration_test
//test/integration:quic_http_integration_test FAILED in 12 out of 60 in 118.7s

There are several different failures modes, but they boil down to the fake upstream running callbacks in the dispatcher that refer to objects that are already deleted or possibly double deleting objects.

Test cases I have seen fail:
QuicHttpIntegrationTests/QuicHttpIntegrationTest.DownstreamReadDisabledOnGiantPost
QuicHttpIntegrationTests/QuicHttpIntegrationTest.LargeFlowControlOnAndGiantBody

WARNING: ThreadSanitizer: heap-use-after-free (virtual call vs free) (pid=17)
Read of size 8 at 0x7b4800030130 by thread T18:
>0 Envoy::Http::Http1::ConnectionImpl::flushOutput(bool) (quic_http_integration_test+0x62de878)
>1 Envoy::Http::Http1::StreamEncoderImpl::encodeHeadersBase(Envoy::Http::RequestOrResponseHeaderMap const&, std::__1::optional, bool, bool) (quic_http_integration_test+0x62de439)
>2 Envoy::Http::Http1::ResponseEncoderImpl::encodeHeaders(Envoy::Http::ResponseHeaderMap const&, bool) (quic_http_integration_test+0x62e1cf7)
>3 non-virtual thunk to Envoy::Http::Http1::ResponseEncoderImpl::encodeHeaders(Envoy::Http::ResponseHeaderMap const&, bool) (quic_http_integration_test+0x62e1dc2)
>4 Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1::operator()() const (quic_http_integration_test+0x3d486f6)
>5 decltype(std::__1::forward<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1&>(fp)()) std::__1::__invoke<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1&>(Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1&) (quic_http_integration_test+0x3d48590)
>6 void std::__1::__invoke_void_return_wrapper::__call<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1&>(Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1&) (quic_http_integration_test+0x3d484f0)
>7 std::__1::__function::__alloc_func<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1, std::__1::allocator<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1>, void ()>::operator()() (quic_http_integration_test+0x3d48490)
>8 std::__1::__function::__func<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1, std::__1::allocator<Envoy::FakeStream::encodeHeaders(Envoy::Http::HeaderMap const&, bool)::$_1>, void ()>::operator()() (quic_http_integration_test+0x3d46b4f)
>9 std::__1::__function::__value_func<void ()>::operator()() const (quic_http_integration_test+0x3c04b86)
>10 std::__1::function<void ()>::operator()() const (quic_http_integration_test+0x3c04618)
>11 Envoy::Event::DispatcherImpl::runPostCallbacks() (quic_http_integration_test+0x78a908c)
.... many more stack frames...
>26 Envoy::FakeUpstream::threadRoutine() (quic_http_integration_test+0x3d3e1d4)

WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=17)
Write of size 8 at 0x7b4800030130 by thread T23:
>0 Envoy::Network::ConnectionImpl::~ConnectionImpl() (quic_http_integration_test+0x79521cc)
>1 Envoy::Network::ServerConnectionImpl::~ServerConnectionImpl() (quic_http_integration_test+0x797e3df)
>2 Envoy::Network::ServerConnectionImpl::~ServerConnectionImpl() (quic_http_integration_test+0x797d416)
>3 Envoy::Network::ServerConnectionImpl::~ServerConnectionImpl() (quic_http_integration_test+0x797d48f)
>4 virtual thunk to Envoy::Network::ServerConnectionImpl::~ServerConnectionImpl() (quic_http_integration_test+0x797d71a)
>5 std::__1::default_deleteEnvoy::Network::Connection::operator()(Envoy::Network::Connection*) const (quic_http_integration_test+0x6a31b76)
>6 std::__1::unique_ptr<Envoy::Network::Connection, std::__1::default_deleteEnvoy::Network::Connection >::reset(Envoy::Network::Connection*) (quic_http_integration_test+0x6a31a90)
>7 std::__1::unique_ptr<Envoy::Network::Connection, std::__1::default_deleteEnvoy::Network::Connection >::~unique_ptr() (quic_http_integration_test+0x6a23fac)
>8 Envoy::Server::ActiveTcpConnection::~ActiveTcpConnection() (quic_http_integration_test+0x6a19fd1)
>9 Envoy::Server::ActiveTcpConnection::~ActiveTcpConnection() (quic_http_integration_test+0x6a1a0ef)
>10 std::__1::default_deleteEnvoy::Event::DeferredDeletable::operator()(Envoy::Event::DeferredDeletable*) const (quic_http_integration_test+0x4ba2686)
>11 std::__1::unique_ptr<Envoy::Event::DeferredDeletable, std::__1::default_deleteEnvoy::Event::DeferredDeletable >::reset(Envoy::Event::DeferredDeletable*) (quic_http_integration_test+0x4ba2620)
>12 Envoy::Event::DispatcherImpl::clearDeferredDeleteList() (quic_http_integration_test+0x78a38d5)
... lots of stack frames ...
>27 Envoy::FakeUpstream::threadRoutine() (quic_http_integration_test+0x3d3e1d4)

@antoniovicente
Copy link
Contributor Author

antoniovicente commented May 14, 2021

cc @alyssawilk @yanavlasov

@antoniovicente
Copy link
Contributor Author

@htuch

Seems like tests are hitting the following error. Would be good to understand the source of it and if we should consider some adjustments to retransmit logic in cases where the proxy appears overloaded in order to mitigate. Can we gather some info on number of retransmits?

Server: Closing connection: 728740b16f0e6ea2, with error: QUIC_TOO_MANY_RTOS (85), and details: Network blackhole detected

Also, can we disable the flaky tests until this issue is resolved? Right now many PRs are blocked on the high failure rate this issue is causing on the coverage and tsan CIs

@danzh2010
Copy link
Contributor

I couldn't reproduce these in local machine, though I didn't setup RBE.
Following test didn't time out:
$ bazel test test/integration:quic_http_integration_test --test_filter="DownstreamReadDisabledOnGiantPost" --config=clang-tsan --runs_per_test=100

Might be there are some RBE UDP socket setup difference.

Will disable these tests for now.

@antoniovicente
Copy link
Contributor Author

antoniovicente commented May 17, 2021

The timeouts seem to be due to the client QUIC stream being already reset when we call waitForEndStream. I don't understand how this could be happening.

I think you need to setup RBE to repo this issue, please look into using it.

alyssawilk pushed a commit that referenced this issue May 18, 2021
Commit Message: these tests fails under TSAN CI, see #16493
Signed-off-by: Dan Zhang <danzh@google.com>
@danzh2010
Copy link
Contributor

I was able to reproduce the flakiness using RBE, and verified that #16180 fixes the flaky tests.

ntgsx92 pushed a commit to ntgsx92/envoy that referenced this issue May 18, 2021
Commit Message: these tests fails under TSAN CI, see envoyproxy#16493
Signed-off-by: Dan Zhang <danzh@google.com>
Signed-off-by: Sixiang Gu <sgu@twitter.com>
alyssawilk pushed a commit that referenced this issue May 24, 2021
Commit Message: To prevent long event loop when too many UDP packets are in the queue, limit how many packets to read in each event loop. If haven't finished reading, artifacts a READ event to continue in the next event loop.
Additional Description:
Add numPacketsExpectedPerEventLoop() callback to UdpListenerCallback, so that QUIC listener can tell how many packets it wants to read in each loop. The actually number of packets read are still bound by MAX_NUM_PACKETS_PER_EVENT_LOOP (6000).
Quic listener returns numPacketsExpectedPerEventLoop() based on number of connections it has at the moment and the configured envoy::config::listener::QuicProtocolOptions.packets_to_read_to_connection_count_ratio.
Made InjectableSingleton really thread safe.

Risk Level: medium, other than quic listener, other UdpListenerCallbacks return max size_t for numPacketsExpectedPerEventLoop(). This will cause those callbacks to read 6000 packets per READ event.
Testing: added udp listener unit tests.

Fixes #16335 #16278
Part of #16198 #16493
Signed-off-by: Dan Zhang <danzh@google.com>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 18, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

leyao-daily pushed a commit to leyao-daily/envoy that referenced this issue Sep 30, 2021
Commit Message: these tests fails under TSAN CI, see envoyproxy#16493
Signed-off-by: Dan Zhang <danzh@google.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this issue Sep 30, 2021
Commit Message: To prevent long event loop when too many UDP packets are in the queue, limit how many packets to read in each event loop. If haven't finished reading, artifacts a READ event to continue in the next event loop.
Additional Description:
Add numPacketsExpectedPerEventLoop() callback to UdpListenerCallback, so that QUIC listener can tell how many packets it wants to read in each loop. The actually number of packets read are still bound by MAX_NUM_PACKETS_PER_EVENT_LOOP (6000).
Quic listener returns numPacketsExpectedPerEventLoop() based on number of connections it has at the moment and the configured envoy::config::listener::QuicProtocolOptions.packets_to_read_to_connection_count_ratio.
Made InjectableSingleton really thread safe.

Risk Level: medium, other than quic listener, other UdpListenerCallbacks return max size_t for numPacketsExpectedPerEventLoop(). This will cause those callbacks to read 6000 packets per READ event.
Testing: added udp listener unit tests.

Fixes envoyproxy#16335 envoyproxy#16278
Part of envoyproxy#16198 envoyproxy#16493
Signed-off-by: Dan Zhang <danzh@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quic area/test flakes bug priority/high stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

2 participants