Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake in IpVersionsClientType/ExtensionDiscoveryIntegrationTest.ReloadBoth/1 #26254

Closed
alyssawilk opened this issue Mar 22, 2023 · 5 comments · Fixed by #26348
Closed

flake in IpVersionsClientType/ExtensionDiscoveryIntegrationTest.ReloadBoth/1 #26254

alyssawilk opened this issue Mar 22, 2023 · 5 comments · Fixed by #26348
Assignees

Comments

@alyssawilk
Copy link
Contributor

https://dev.azure.com/cncf/envoy/_build/results?buildId=131697&view=logs&j=4e2e4662-2d21-5726-b4bc-9ff5cd6a179b&t=393ef8cf-9eb2-5a8a-e529-cceb42dc347e

[2023-03-21 18:17:38.829][753][critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::ThreadLocal::InstanceImpl::SlotImpl::SlotImpl() [0x315310d]
[2023-03-21 18:17:38.838][753][critical][backtrace] [./source/server/backtrace.h:96] #4: std::__1::__shared_ptr_emplace<>::__on_zero_shared() [0x709c8d7]
[2023-03-21 18:17:38.846][753][critical][backtrace] [./source/server/backtrace.h:96] #5: std::__1::__function::__func<>::destroy() [0x709d1f1]
[2023-03-21 18:17:38.855][753][critical][backtrace] [./source/server/backtrace.h:96] #6: std::__1::__optional_destruct_base<>::
__optional_destruct_base() [0x60d9934]
[2023-03-21 18:17:38.863][753][critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Filter::DynamicFilterConfigProviderImpl<>::ThreadLocalConfig::~ThreadLocalConfig() [0x60db108]
[2023-03-21 18:17:38.871][753][critical][backtrace] [./source/server/backtrace.h:96] #8: std::__1::__shared_ptr_emplace<>::__on_zero_shared() [0x60db01e]
[2023-03-21 18:17:38.880][753][critical][backtrace] [./source/server/backtrace.h:96] #9: std::__1::__function::__func<>::operator()() [0x315081f]
[2023-03-21 18:17:38.888][753][critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x72fca0c]
[2023-03-21 18:17:38.897][753][critical][backtrace] [./source/server/backtrace.h:96] #11: std::__1::__function::__func<>::operator()() [0x730764d]
[2023-03-21 18:17:38.905][753][critical][backtrace] [./source/server/backtrace.h:96] #12: Envoy::Event::SchedulableCallbackImpl::SchedulableCallbackImpl()::$_0::__invoke() [0x7a28609]
[2023-03-21 18:17:38.913][753][critical][backtrace] [./source/server/backtrace.h:96] #13: event_process_active_single_queue [0x8569a63]
[2023-03-21 18:17:38.922][753][critical][backtrace] [./source/server/backtrace.h:96] #14: event_base_loop [0x8557152]
[2023-03-21 18:17:38.930][753][critical][backtrace] [./source/server/backtrace.h:96] #15: Envoy::Event::LibeventScheduler::run() [0x7a25c4e]
[2023-03-21 18:17:38.938][753][critical][backtrace] [./source/server/backtrace.h:96] #16: Envoy::Event::DispatcherImpl::run() [0x72fc611]
[2023-03-21 18:17:38.947][753][critical][backtrace] [./source/server/backtrace.h:96] #17: Envoy::Server::WorkerImpl::threadRoutine() [0x3299d52]
[2023-03-21 18:17:38.955][753][critical][backtrace] [./source/server/backtrace.h:96] #18: std::__1::__function::__func<>::operator()() [0x329ffb9]
[2023-03-21 18:17:38.964][753][critical][backtrace] [./source/server/backtrace.h:96] #19: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x8518f5c]
[2023-03-21 18:17:38.964][753][critical][backtrace] [./source/server/backtrace.h:96] #20: start_thread [0x7f148e39a609]

@alyssawilk
Copy link
Contributor Author

bazel test --config=remote //test/integration:extension_discovery_integration_test --runs_per_test=1000 does repro this one.
@kyessenov can you take a look?

@kyessenov
Copy link
Contributor

This one is super-odd - runOnAllThreads is not posting to a worker thread correctly...

@kyessenov
Copy link
Contributor

The root cause is a logic bug that still allows filter configs to be destroyed on workers, likely the same reason #24574 and other v8 !IsMainThread assertions.

@alyssawilk
Copy link
Contributor Author

Nice debug - thanks for the in-flight fix!

kyessenov added a commit that referenced this issue Mar 30, 2023
Commit Message: Ensure extension config is always deleted on the main thread.
Additional Description: There is an implicit data race with the following interleaving:

TLS runOnAllThreads invoked to clear the extension config.
TLS deleted, a post is invoked on all workers to delete its storage.
TLS runOnAllThreads on workers does nothing because of step 2).
TLS runOnAllThreads runs the completion callback on the main thread.
TLS deletion post invoked on some worker, deleting the last instance of the extension config.
The workaround is to prevent 2) from happening while runOnAllThreads is running. This is done by capturing TLS in the completion callback, and explicitly resetting it and the associated filter config in the callback.

Note that we cannot rely on the main_config shared pointer destruction implicitly. It seems the destructor can still execute on a worker, so we explicitly empty the content in the PR. It's not clear to me why some worker has a copy of the function, but it's hard to track down without AnyInvocable or move-only functions.

Risk Level: low
Testing: run the flakey test 10000 times, all pass
Docs Changes: none
Release Notes: none
Platform Specific Features: none
Fixes: #26254
RiverPhillips pushed a commit to RiverPhillips/envoy that referenced this issue Apr 7, 2023
Commit Message: Ensure extension config is always deleted on the main thread.
Additional Description: There is an implicit data race with the following interleaving:

TLS runOnAllThreads invoked to clear the extension config.
TLS deleted, a post is invoked on all workers to delete its storage.
TLS runOnAllThreads on workers does nothing because of step 2).
TLS runOnAllThreads runs the completion callback on the main thread.
TLS deletion post invoked on some worker, deleting the last instance of the extension config.
The workaround is to prevent 2) from happening while runOnAllThreads is running. This is done by capturing TLS in the completion callback, and explicitly resetting it and the associated filter config in the callback.

Note that we cannot rely on the main_config shared pointer destruction implicitly. It seems the destructor can still execute on a worker, so we explicitly empty the content in the PR. It's not clear to me why some worker has a copy of the function, but it's hard to track down without AnyInvocable or move-only functions.

Risk Level: low
Testing: run the flakey test 10000 times, all pass
Docs Changes: none
Release Notes: none
Platform Specific Features: none
Fixes: envoyproxy#26254

Signed-off-by: River Phillips <riverphillips1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants