Fix: main thread: notifyThreads() and worker thread: loadDnsCacheEntryWithForceRefresh() race condition.#44515
Fix: main thread: notifyThreads() and worker thread: loadDnsCacheEntryWithForceRefresh() race condition.#44515yuehaii wants to merge 3 commits into
Conversation
…yWithForceRefresh() race condition. Signed-off-by: hai.yue <20416005+yuehaii@users.noreply.github.com>
|
Hi @yuehaii, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
There was a problem hiding this comment.
Would it be possible for you to adda regression test in the integration test? TSAN tests haven't hit this race meaning that the existing test suite is not comprehensive enough to catch this. In other words, can you find a test case that detects the race with TSAN?
|
a specific test sounds like a good idea - but im not sure its correct that tsan hasnt hit this before |
|
altho that is marked as asan - i added a fix for that last week - not sure if the best/correct/total fix tho |
|
@yuehaii ci failures look real |
|
/gemini review |
|
bot analysis suggests this is a different issue to the one i posted above - but there have been several issues with dfp/dns bot also suggested this needs a bit of work to land, but it does fix a real issue |
There was a problem hiding this comment.
Code Review
This pull request updates the DNS cache implementation to handle race conditions where a host resolution might complete during handle registration. It modifies ThreadLocalHostInfo to store a reference to the event dispatcher and adds logic in loadDnsCacheEntryWithForceRefresh to post deferred notifications if a host is already resolved. A review comment suggests refining the notification logic to prevent sending stale data when ignore_cached_entries is set, ensuring that forced refreshes behave correctly.
… cases Signed-off-by: hai.yue <20416005+yuehaii@users.noreply.github.com>
|
Ping @mattklein123 (and I kicked off the blocked CI) |
|
Format check says there's stuff to be fixed. /wait |
Signed-off-by: hai.yue <20416005+yuehaii@users.noreply.github.com>
|
/gemini retest |
hi @ravenblackx . I have fixed the format issue. Can you please help trigger the CI again? |
Commit Message:
dfp: fix race between DNS resolve completion and handle registration
Additional Description:
There is a race issue in loadDnsCacheEntryWithForceRefresh: the main thread may
call notifyThreads() before the worker thread creates its loadDnsCacheEntryWithForceRefresh.
Fix by re-checking primary_hosts_ under the read lock immediately after
inserting the handle. If the host has already completed its first
resolution at that point, post an onHostMapUpdate directly to the
worker's own dispatcher. This ensures the notification fires after
decodeHeaders() returns and the filter is safely suspended at
StopAllIterationAndWatermark, regardless of main-thread scheduling.
To support the deferred post, ThreadLocalHostInfo is updated to store a
reference to its per-thread Event::Dispatcher (passed through the
tls_slot_.set() lambda).
Verified by the new ParallelRequestsWithFakeResolver integration test,
which reproduces the race by blocking DNS resolution until a second
parallel request is in flight.
Risk Level: Low
Testing: manual testing with below command:
bazel-bin/test/extensions/filters/http/dynamic_forward_proxy/proxy_filter_integration_test --gtest_filter="ParallelRequestsWithFakeResolver" 2>&1
Docs Changes: NA
Release Notes: NA
Platform Specific Features: NA