Skip to content

Fix: main thread: notifyThreads() and worker thread: loadDnsCacheEntryWithForceRefresh() race condition.#44515

Open
yuehaii wants to merge 3 commits into
envoyproxy:mainfrom
yuehaii:main
Open

Fix: main thread: notifyThreads() and worker thread: loadDnsCacheEntryWithForceRefresh() race condition.#44515
yuehaii wants to merge 3 commits into
envoyproxy:mainfrom
yuehaii:main

Conversation

@yuehaii
Copy link
Copy Markdown
Contributor

@yuehaii yuehaii commented Apr 17, 2026

  • Commit Message:

    dfp: fix race between DNS resolve completion and handle registration

  • Additional Description:

    There is a race issue in loadDnsCacheEntryWithForceRefresh: the main thread may
    call notifyThreads() before the worker thread creates its loadDnsCacheEntryWithForceRefresh.

    Fix by re-checking primary_hosts_ under the read lock immediately after
    inserting the handle. If the host has already completed its first
    resolution at that point, post an onHostMapUpdate directly to the
    worker's own dispatcher. This ensures the notification fires after
    decodeHeaders() returns and the filter is safely suspended at
    StopAllIterationAndWatermark, regardless of main-thread scheduling.

    To support the deferred post, ThreadLocalHostInfo is updated to store a
    reference to its per-thread Event::Dispatcher (passed through the
    tls_slot_.set() lambda).

    Verified by the new ParallelRequestsWithFakeResolver integration test,
    which reproduces the race by blocking DNS resolution until a second
    parallel request is in flight.

    Two parallel request A and B with same domain. DFP enabled. Race sample:
    A[worker] --> loadDnsCacheEntryWithForceRefresh and registered notification
    A[main] --> startCacheLoad --> not finish
    B[main] --> startCacheLoad --> firstResolveComplete() == falae, returned at L230 in dns_cache_impl.cc
    A[main] --> startCacheLoad finished --> notified A registration // B not registerd yet
    B[worker] --> loadDnsCacheEntryWithForceRefresh and registered notification // Never got notification till DNS timeout
    A[worker] --> got the DNS notification

  • Risk Level: Low

  • Testing: manual testing with below command:
    bazel-bin/test/extensions/filters/http/dynamic_forward_proxy/proxy_filter_integration_test --gtest_filter="ParallelRequestsWithFakeResolver" 2>&1

    [==========] Running 2 tests from 1 test suite.
    [----------] Global test environment set-up.
    [----------] 2 tests from IpVersions/ProxyFilterIntegrationTest
    [ RUN ] IpVersions/ProxyFilterIntegrationTest.ParallelRequestsWithFakeResolver/IPv4
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
    [ OK ] IpVersions/ProxyFilterIntegrationTest.ParallelRequestsWithFakeResolver/IPv4 (370 ms)
    [ RUN ] IpVersions/ProxyFilterIntegrationTest.ParallelRequestsWithFakeResolver/IPv6
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_quic_always_support_server_preferred_address to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_no_extension_lookup_by_name to: true
    [external/abseil-cpp/absl/flags/internal/flag.cc : 148] RAW: Restore saved value of envoy_reloadable_features_runtime_initialized to: false
    [ OK ] IpVersions/ProxyFilterIntegrationTest.ParallelRequestsWithFakeResolver/IPv6 (170 ms)
    [----------] 2 tests from IpVersions/ProxyFilterIntegrationTest (542 ms total)

    [----------] Global test environment tear-down
    [==========] 2 tests from 1 test suite ran. (542 ms total)
    [ PASSED ] 2 tests.

  • Docs Changes: NA

  • Release Notes: NA

  • Platform Specific Features: NA

…yWithForceRefresh() race condition.

Signed-off-by: hai.yue <20416005+yuehaii@users.noreply.github.com>
@repokitteh-read-only
Copy link
Copy Markdown

Hi @yuehaii, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #44515 was opened by yuehaii.

see: more, trace.

Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for you to adda regression test in the integration test? TSAN tests haven't hit this race meaning that the existing test suite is not comprehensive enough to catch this. In other words, can you find a test case that detects the race with TSAN?

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 20, 2026

a specific test sounds like a good idea - but im not sure its correct that tsan hasnt hit this before

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 20, 2026

#44491

altho that is marked as asan - i added a fix for that last week - not sure if the best/correct/total fix tho

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 20, 2026

@yuehaii ci failures look real

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 20, 2026

/gemini review

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 20, 2026

bot analysis suggests this is a different issue to the one i posted above - but there have been several issues with dfp/dns

bot also suggested this needs a bit of work to land, but it does fix a real issue

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the DNS cache implementation to handle race conditions where a host resolution might complete during handle registration. It modifies ThreadLocalHostInfo to store a reference to the event dispatcher and adds logic in loadDnsCacheEntryWithForceRefresh to post deferred notifications if a host is already resolved. A review comment suggests refining the notification logic to prevent sending stale data when ignore_cached_entries is set, ensuring that forced refreshes behave correctly.

Comment thread source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc Outdated
… cases

Signed-off-by: hai.yue <20416005+yuehaii@users.noreply.github.com>
@yuehaii yuehaii temporarily deployed to external-contributors April 26, 2026 03:56 — with GitHub Actions Inactive
@ravenblackx
Copy link
Copy Markdown
Contributor

Ping @mattklein123 (and I kicked off the blocked CI)

@ravenblackx
Copy link
Copy Markdown
Contributor

Format check says there's stuff to be fixed.

/wait

Signed-off-by: hai.yue <20416005+yuehaii@users.noreply.github.com>
@yuehaii yuehaii had a problem deploying to external-contributors May 1, 2026 13:57 — with GitHub Actions Error
@yuehaii
Copy link
Copy Markdown
Contributor Author

yuehaii commented May 8, 2026

/gemini retest

@yuehaii
Copy link
Copy Markdown
Contributor Author

yuehaii commented May 8, 2026

Format check says there's stuff to be fixed.

/wait

hi @ravenblackx . I have fixed the format issue. Can you please help trigger the CI again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants