fix(dfp): cap DNS failure backoff to bound eviction delay after touch()/alarm race#44540
Closed
Copilot wants to merge 3 commits into
Closed
fix(dfp): cap DNS failure backoff to bound eviction delay after touch()/alarm race#44540Copilot wants to merge 3 commits into
Copilot wants to merge 3 commits into
Conversation
…()/alarm race When touch() lands just before onReResolveAlarm fires, the host is considered active (now - last_used <= host_ttl) and startResolve runs. Previously, after a DNS failure the exponential backoff scheduled the next eviction check far beyond last_used + host_ttl (e.g., default 60s backoff with host_ttl=1s), causing test flakes under slow builders (MSAN). Fix: cap the failure-backoff interval at host_ttl - elapsed_since_use + 1ms in finishResolve so the eviction check fires promptly once the host has been idle for host_ttl. When DNS fails and the host is already past its eviction TTL (elapsed >= host_ttl), schedule an immediate check (1ms). Add regression test TTLTouchRaceEviction that simulates the race by calling touch() just before the re-resolve alarm fires and verifies the capped backoff interval and deterministic eviction. Agent-Logs-Url: https://github.com/envoyproxy/envoy/sessions/58111b85-6936-435c-8eeb-f72bb8224559 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Agent-Logs-Url: https://github.com/envoyproxy/envoy/sessions/58111b85-6936-435c-8eeb-f72bb8224559 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix race condition in DnsCacheImpl for DFP tests
fix(dfp): cap DNS failure backoff to bound eviction delay after touch()/alarm race
Apr 20, 2026
Member
|
closing in favour of #44542 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DnsCacheImpl::onReResolveAlarmchecksnow - last_used > host_ttlto decide between eviction and re-resolve. Iftouch()lands just before the alarm fires, the host is considered active andstartResolveruns. When DNS then fails, the exponential backoff (default ~60s) schedules the next eviction check far beyondlast_used + host_ttl, deferring eviction by an unbounded number of cycles. This caused flakes inUseCacheFileShortTtl,StreamPersistAcrossShortTtlResFail, andUseCacheFileShortTtlHostActiveunder slow builders (MSAN).Changes
dns_cache_impl.cc—finishResolve: On DNS failure, cap the backoff interval athost_ttl - elapsed_since_last_use + 1ms. If the host is already past its eviction TTL whenfinishResolveruns, schedule a near-immediate check (1ms). The DNS success path is unchanged.Worst-case eviction delay is now bounded to one DNS query timeout after the race (~7s with default 5s
dns_query_timeout+ 2scache_ttl), within the default 10s test budget.dns_cache_impl_test.cc—TTLTouchRaceEviction: Regression test using simulated time. Callstouch()at T+1900ms just before the alarm fires at T+2000ms (host_ttl=1s), triggeringstartResolve. Verifies that on DNS failure the backoff is capped to 901ms (not the uncapped 2000ms), and that eviction fires after that interval.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
dl.google.com/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30(dns block)/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/home/REDACTED/.cache/envoy-bazel/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/home/REDACTED/.cache/envoy-bazel/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/home/REDACTED/. 3c4955e/node_modules/.bin/node(dns block)ftpmirror.gnu.org/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30(dns block)mirror.bazel.build/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/bin/java bazel(envoy) --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/build/bazel_root/base/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/embedded_tools/jdk/lib/server:/build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/ -Dfile.encoding=ISO-8859-1 -Duser.country= -Duser.language= -Duser.variant= -Xmx3g -DBAZEL_TRACK_SOURCE_DIRECTORIES=1 -Djavax.net.ssl.trustStore=/tmp/custom-cacerts -Djavax.net.ssl.trustStorePassword=changeit -jar /build/bazel_root/install/fb2a7f6d344d2f4e335882534df59296/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=30(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
Background
Issue #44426 describes flakes in three DFP integration tests:
ProxyFilterIntegrationTest.UseCacheFileShortTtlProxyFilterIntegrationTest.StreamPersistAcrossShortTtlResFailProxyFilterIntegrationTest.UseCacheFileShortTtlHostActiveThe root cause (as identified in PR #44538 and #44505) is a race in
DnsCacheImpl::onReResolveAlarm:https://github.com/envoyproxy/envoy/blob/main/source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc#L287-L302
If a request's
touch()call lands just before the alarm fires,(now - last_used) <= host_ttl, so instead of evicting we go intostartResolve. For a hostname that doesn't resolve (test scenario) this means:dns_query_timeout(~5s default)A single failed cycle already costs ~7s (
cache_ttl+dns_query_timeout+ backoff). Two cycles exceed the default test wait timeout, causing the flake under slow builders (MSAN).The short-term workaround (PRs #44505, #44538) either widens the wait timeout, bumps
cache_ttl, or shortensdns_query_timeout. None address the underlying issue.Investigation task
Investigate whether the logic in
onReResolveAlarmis structurally correct, and propose a fix. Questions to consider:Is
startResolvethe right action whentouch()ran recently? The host is actively being used, and if the previous resolution succeeded we already have a valid address. Starting a new DNS query gambles a fulldns_query_timeouton an operation that may be unnecessary.When
startResolvefails, when does the next eviction check happen? Currently the failure backoff schedules the next re-resolve, not the next eviction check. This couples two orthogonal concerns (re-resolution freshness vs. cache expiry).Would it be reasonable to re-arm the refresh timer with a short interval (e.g., a fraction of
host_ttl) whentouch()was recent, rather than kicking off a full DNS query? This decouples eviction from re-resolution.Is there a cleaner approach, e.g., making eviction run on its own timer independent of the re-resolve alarm?
Deliverable
A PR that:
cache_ttlintervals.test/extensions/common/dynamic_forward_proxy/dns_cache_impl_test.ccthat simulates the race (e.g., by callingtouch()between alarm schedule and fire, or by using simulated time) and verifies eviction happens deterministically.waitForCounterGe("host_removed", 1)with default timeout).Do not change the public config surface (e.g. don't add new proto fields) unless strictly necessary. Prefer a minimal internal logic change.
Related
ProxyFilterIntegrationTest.UseCacheFileShortTtlHostActive/is flakey #44426Note the #44515 PR touches the same files — ensure changes here don't conflict; coordinate rebase if both are in flight.
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: so trying to resolve the linked flake issue - you previously suggested me the changes in @envoyproxy/envoy/pull/44505
another you has suggested that is a crap fix and instead suggested #44538
wdyt?
annoyingly - despite that i previously was able to repro this flake i cant now - so not much point in testing against the fix branch
Assistant: Here's my take on the two approaches:
The two fixes
#44538 (jwendell): Bumps
dns_cache_ttl_from 2s → 5s in the three short-TTL tests. One-liner × 3.#44505 (yours):
dns_query_timeout=1(default is much higher) so each failed re-resolve cycle is shortwaitForCounterGe("dns_cache.foo.dns_query_failure", 1)as an explicit sync barrier beforewaitForCounterGe("host_removed", 1)size = "large"andrbe_pool = "2core"from the BUILD file (looks unrelated/accidental)Root cause ...
This pull request was created from Copilot chat.