Skip to content

fix(dfp): cap DNS failure backoff to bound eviction delay after touch()/alarm race#44540

Closed
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-dns-cache-race-condition
Closed

fix(dfp): cap DNS failure backoff to bound eviction delay after touch()/alarm race#44540
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-dns-cache-race-condition

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 20, 2026

DnsCacheImpl::onReResolveAlarm checks now - last_used > host_ttl to decide between eviction and re-resolve. If touch() lands just before the alarm fires, the host is considered active and startResolve runs. When DNS then fails, the exponential backoff (default ~60s) schedules the next eviction check far beyond last_used + host_ttl, deferring eviction by an unbounded number of cycles. This caused flakes in UseCacheFileShortTtl, StreamPersistAcrossShortTtlResFail, and UseCacheFileShortTtlHostActive under slow builders (MSAN).

Changes

  • dns_cache_impl.ccfinishResolve: On DNS failure, cap the backoff interval at host_ttl - elapsed_since_last_use + 1ms. If the host is already past its eviction TTL when finishResolve runs, schedule a near-immediate check (1ms). The DNS success path is unchanged.

    uint64_t refresh_interval = primary_host_info->failure_backoff_strategy_->nextBackOffMs();
    const auto elapsed_since_last_use =
        now_for_eviction_check - primary_host_info->host_info_->lastUsedTime();
    if (elapsed_since_last_use < host_ttl_) {
      const uint64_t until_eviction_ms = /* host_ttl - elapsed */ + 1;
      refresh_interval = std::min(refresh_interval, until_eviction_ms);
    } else {
      refresh_interval = 1; // already past eviction TTL
    }

    Worst-case eviction delay is now bounded to one DNS query timeout after the race (~7s with default 5s dns_query_timeout + 2s cache_ttl), within the default 10s test budget.

  • dns_cache_impl_test.ccTTLTouchRaceEviction: Regression test using simulated time. Calls touch() at T+1900ms just before the alarm fires at T+2000ms (host_ttl=1s), triggering startResolve. Verifies that on DNS failure the backoff is capped to 901ms (not the uncapped 2000ms), and that eviction fires after that interval.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dl.google.com
    • Triggering command: /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30 (dns block)
    • Triggering command: /home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/home/REDACTED/.cache/envoy-bazel/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/home/REDACTED/. 3c4955e/node_modules/.bin/node (dns block)
  • ftpmirror.gnu.org
    • Triggering command: /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30 (dns block)
  • mirror.bazel.build
    • Triggering command: /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Background

Issue #44426 describes flakes in three DFP integration tests:

  • ProxyFilterIntegrationTest.UseCacheFileShortTtl
  • ProxyFilterIntegrationTest.StreamPersistAcrossShortTtlResFail
  • ProxyFilterIntegrationTest.UseCacheFileShortTtlHostActive

The root cause (as identified in PR #44538 and #44505) is a race in DnsCacheImpl::onReResolveAlarm:

https://github.com/envoyproxy/envoy/blob/main/source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc#L287-L302

void DnsCacheImpl::onReResolveAlarm(const std::string& host) {
  ASSERT(main_thread_dispatcher_.isThreadSafe());
  auto& primary_host = getPrimaryHost(host);
  const std::chrono::steady_clock::duration now_duration =
      main_thread_dispatcher_.timeSource().monotonicTime().time_since_epoch();
  auto last_used_time = primary_host.host_info_->lastUsedTime();
  if ((now_duration - last_used_time) > host_ttl_) {
    removeHost(host, primary_host, true);
  } else {
    startResolve(host, primary_host);
  }
}

If a request's touch() call lands just before the alarm fires, (now - last_used) <= host_ttl, so instead of evicting we go into startResolve. For a hostname that doesn't resolve (test scenario) this means:

  1. Fire DNS query → wait dns_query_timeout (~5s default)
  2. Query fails → schedule next alarm after exponential backoff
  3. On the next alarm: check again, maybe evict

A single failed cycle already costs ~7s (cache_ttl + dns_query_timeout + backoff). Two cycles exceed the default test wait timeout, causing the flake under slow builders (MSAN).

The short-term workaround (PRs #44505, #44538) either widens the wait timeout, bumps cache_ttl, or shortens dns_query_timeout. None address the underlying issue.

Investigation task

Investigate whether the logic in onReResolveAlarm is structurally correct, and propose a fix. Questions to consider:

  1. Is startResolve the right action when touch() ran recently? The host is actively being used, and if the previous resolution succeeded we already have a valid address. Starting a new DNS query gambles a full dns_query_timeout on an operation that may be unnecessary.

  2. When startResolve fails, when does the next eviction check happen? Currently the failure backoff schedules the next re-resolve, not the next eviction check. This couples two orthogonal concerns (re-resolution freshness vs. cache expiry).

  3. Would it be reasonable to re-arm the refresh timer with a short interval (e.g., a fraction of host_ttl) when touch() was recent, rather than kicking off a full DNS query? This decouples eviction from re-resolution.

  4. Is there a cleaner approach, e.g., making eviction run on its own timer independent of the re-resolve alarm?

Deliverable

A PR that:

  1. Documents the race clearly in commit message / PR body.
  2. Implements a fix that removes the dependency on "winning the race" to get eviction to happen within a bounded number of cache_ttl intervals.
  3. Adds a unit test in test/extensions/common/dynamic_forward_proxy/dns_cache_impl_test.cc that simulates the race (e.g., by calling touch() between alarm schedule and fire, or by using simulated time) and verifies eviction happens deterministically.
  4. Ideally, once merged, allows the workarounds in test: Fix dfp/dns flakes #44505 (or whichever lands) to be simplified back to the original form (single waitForCounterGe("host_removed", 1) with default timeout).

Do not change the public config surface (e.g. don't add new proto fields) unless strictly necessary. Prefer a minimal internal logic change.

Related

Note the #44515 PR touches the same files — ensure changes here don't conflict; coordinate rebase if both are in flight.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: so trying to resolve the linked flake issue - you previously suggested me the changes in @envoyproxy/envoy/pull/44505

another you has suggested that is a crap fix and instead suggested #44538

wdyt?

annoyingly - despite that i previously was able to repro this flake i cant now - so not much point in testing against the fix branch

Assistant: Here's my take on the two approaches:

The two fixes

#44538 (jwendell): Bumps dns_cache_ttl_ from 2s → 5s in the three short-TTL tests. One-liner × 3.

#44505 (yours):

  • Sets dns_query_timeout=1 (default is much higher) so each failed re-resolve cycle is short
  • Adds waitForCounterGe("dns_cache.foo.dns_query_failure", 1) as an explicit sync barrier before waitForCounterGe("host_removed", 1)
  • Also drops size = "large" and rbe_pool = "2core" from the BUILD file (looks unrelated/accidental)

Root cause ...

This pull request was created from Copilot chat.

@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #44540 was opened by Copilot.

see: more, trace.

Copilot AI and others added 2 commits April 20, 2026 20:49
…()/alarm race

When touch() lands just before onReResolveAlarm fires, the host is considered
active (now - last_used <= host_ttl) and startResolve runs. Previously, after
a DNS failure the exponential backoff scheduled the next eviction check far
beyond last_used + host_ttl (e.g., default 60s backoff with host_ttl=1s),
causing test flakes under slow builders (MSAN).

Fix: cap the failure-backoff interval at host_ttl - elapsed_since_use + 1ms in
finishResolve so the eviction check fires promptly once the host has been idle
for host_ttl. When DNS fails and the host is already past its eviction TTL
(elapsed >= host_ttl), schedule an immediate check (1ms).

Add regression test TTLTouchRaceEviction that simulates the race by calling
touch() just before the re-resolve alarm fires and verifies the capped backoff
interval and deterministic eviction.

Agent-Logs-Url: https://github.com/envoyproxy/envoy/sessions/58111b85-6936-435c-8eeb-f72bb8224559

Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix race condition in DnsCacheImpl for DFP tests fix(dfp): cap DNS failure backoff to bound eviction delay after touch()/alarm race Apr 20, 2026
Copilot AI requested a review from phlax April 20, 2026 20:51
@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 21, 2026

closing in favour of #44542

@phlax phlax closed this Apr 21, 2026
@phlax phlax deleted the copilot/fix-dns-cache-race-condition branch April 30, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants