Skip to content

RLQS: Improve TlsStore persistence after filter factory deletion#44690

Open
bsurber wants to merge 7 commits into
envoyproxy:mainfrom
bsurber:finish-rlqs-fix
Open

RLQS: Improve TlsStore persistence after filter factory deletion#44690
bsurber wants to merge 7 commits into
envoyproxy:mainfrom
bsurber:finish-rlqs-fix

Conversation

@bsurber
Copy link
Copy Markdown
Contributor

@bsurber bsurber commented Apr 27, 2026

Commit Message: Branch off of #44236 with some cleanup and improved coverage in filter_persistence_test.cc

Overall, this lets filter instances in worker threads hold shared_ptr ownership of TlsStores and the Global RLQS client (GlobalRateLimitClientImpl), guaranteeing their existence even if the filter factory (previous unique owner of both) gets deleted via an xDS push.

Operations posted to the main thread by the GlobalRateLimitClientImpl also now check to make sure they haven't outlived the original global client.

Additional Description:
Risk Level: minor (haven't been reproducible outside of direct coverage testing)
Testing: unit testing
Docs Changes:
Release Notes:
Platform Specific Features:

yanavlasov and others added 5 commits April 2, 2026 14:20
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
…TlsStore deletion outside of the main thread

Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber bsurber marked this pull request as draft April 27, 2026 21:46
Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber
Copy link
Copy Markdown
Contributor Author

bsurber commented May 14, 2026

The only way to get coverage on the non-main-thread deletion check is to block filter execution until a config change pushes through to delete the filter's factory.

From experimentation, I've confirmed that the LDS config updates in filter_persistence_test.cc actually get blocked while the filter is still in-process on the worker thread (done by injecting a blocking callback in my experiments).

That behavior may be specific to the integration test framework or may actually match production, but it doesn't feel great to rely on that assumption. Either way, this does block the full coverage improvement.

@bsurber
Copy link
Copy Markdown
Contributor Author

bsurber commented May 14, 2026

/coverage

@repokitteh-read-only
Copy link
Copy Markdown

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/44690/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #44690 (comment) was created by @bsurber.

see: more, trace.

Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber bsurber marked this pull request as ready for review May 15, 2026 21:59
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/coverage.yaml).
envoyproxy/coverage-shephards assignee is @RyanTheOptimist

🐱

Caused by: #44690 was ready_for_review by bsurber.

see: more, trace.

@RyanTheOptimist
Copy link
Copy Markdown
Contributor

The only way to get coverage on the non-main-thread deletion check is to block filter execution until a config change pushes through to delete the filter's factory.

From experimentation, I've confirmed that the LDS config updates in filter_persistence_test.cc actually get blocked while the filter is still in-process on the worker thread (done by injecting a blocking callback in my experiments).

That behavior may be specific to the integration test framework or may actually match production, but it doesn't feel great to rely on that assumption. Either way, this does block the full coverage improvement.

With an issue like this which involves multiple threads and object lifetime, it seems vital to have complete test coverage. Otherwise we run the risk of discovering crash bugs in prod. Cooking up an integration test to do this seems very wise. if you've been able to get an LDS config update to do this, that sounds great. But are you saying that the new changes in test/extensions/filters/http/rate_limit_quota/filter_persistence_test.cc don't cover it?

@bsurber
Copy link
Copy Markdown
Contributor Author

bsurber commented May 21, 2026

The only way to get coverage on the non-main-thread deletion check is to block filter execution until a config change pushes through to delete the filter's factory.
From experimentation, I've confirmed that the LDS config updates in filter_persistence_test.cc actually get blocked while the filter is still in-process on the worker thread (done by injecting a blocking callback in my experiments).
That behavior may be specific to the integration test framework or may actually match production, but it doesn't feel great to rely on that assumption. Either way, this does block the full coverage improvement.

With an issue like this which involves multiple threads and object lifetime, it seems vital to have complete test coverage. Otherwise we run the risk of discovering crash bugs in prod. Cooking up an integration test to do this seems very wise. if you've been able to get an LDS config update to do this, that sounds great. But are you saying that the new changes in test/extensions/filters/http/rate_limit_quota/filter_persistence_test.cc don't cover it?

Whoops, I just noticed that I accidentally forgot to remove this unhelpful TestDeletingTlsStoreFromWorkerThread test, which was from a previous attempt to get coverage. This attempt couldn't get coverage because main_dispatcher->isThreadSafe() is True for test threads.

So the only way to get coverage is to force the filter to block until it has definitely outlived the filter factory, but the LDS push to delete the filter factory was blocked by the running filter (no stat updates, no change in shared_ptr owner count, etc).

@RyanTheOptimist
Copy link
Copy Markdown
Contributor

I'm apprehensive about the lack of coverage so I'd like to get some additional opinions:
/assign @yanavlasov
/assign @tyxia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants