tshttpproxy: degrade gracefully when WPAD is unavailable#118
Draft
deansheather wants to merge 2 commits into
Draft
tshttpproxy: degrade gracefully when WPAD is unavailable#118deansheather wants to merge 2 commits into
deansheather wants to merge 2 commits into
Conversation
On Windows, tailscaled invokes WinHttpGetProxyForUrl on every outbound
HTTP request to discover a proxy via WPAD (DHCP option 252 / DNS A
wpad.<domain>). On networks where WPAD is not configured this call
blocks for up to 5s and then returns ERROR_WINHTTP_AUTODETECTION_FAILED
or ERROR_WINHTTP_UNABLE_TO_DOWNLOAD_SCRIPT. The previous 10s
negative-cache window was repeatedly torn down by InvalidateCache() on
every netmon link change, so control / DERP / log uploads spent most of
their time waiting on dead WPAD probes.
This change:
* Bumps the negative-cache backoff from 10s to 5 minutes for
autodetection-failed / script-download-failed, and to 30s for the
overall context timeout. InvalidateCache() still resets it on link
changes, so we re-probe WPAD whenever the network actually changes.
* Stops returning a stale cachedProxy.val on context-deadline timeout.
A previous network's PAC-resolved proxy is the wrong answer when
roaming to a network without WPAD; return (nil, nil) so the caller
dials direct.
* Maps every unmapped WinHTTP error to (nil, nil) plus a backoff
instead of returning the raw error. The package-level
ProxyFromEnvironment already discards sysProxyFromEnv errors, but
making the contract explicit at the platform layer prevents future
regressions and makes the intent obvious in the source.
WPAD continues to work unchanged on networks where it is configured;
this change only affects how we degrade when it is not.
Adds a cross-platform regression test that injects a failing
sysProxyFromEnv hook and asserts ProxyFromEnvironment returns
(nil, nil), respects an active noProxyUntil backoff, and re-probes after
InvalidateCache().
References: tailscale#17055, tailscale#10215
deansheather
commented
May 8, 2026
Member
Author
deansheather
left a comment
There was a problem hiding this comment.
Solid diagnosis of the WPAD-on-non-WPAD-network problem, and the longer backoffs + named constants land cleanly. The deep-review found a couple of structural concerns worth a second look before this merges, plus some test coverage gaps in the exact paths the PR changes.
This review contains findings that may need attention before merge. 1 P1, 4 P2, 2 P3, 4 nits across 9 inline comments.
Summary of changes in response to review: * Add an InvalidateCache epoch counter (tshttpproxy.go) and snapshot it at the start of every WPAD probe. The Windows error/timeout branches only commit negative-cache state when the epoch hasn't advanced, so a stale probe from a previous network can no longer stomp noProxyUntil and lock out WPAD on a freshly-changed network for 5 minutes. * Restore an epoch-gated cachedProxy fallback on context-deadline timeout. When the cached proxy was discovered under the current epoch and a probe times out, return the cached value instead of dialing direct — this keeps users on networks that legitimately require a proxy from silently bypassing it during transient WinHTTP slowness. When the cache is stale or empty, we still fall through to direct with a short backoff, which addresses the original 'stranded on a stale proxy after roaming' concern. * Make ProxyFromEnvironment's contract explicit: it now drops sysProxyFromEnv errors via an explicit '_' instead of relying on variable shadowing. This was load-bearing for the Windows fix; a future refactor could have silently re-propagated WinHTTP errors and broken outbound traffic on WPAD-less hosts. * Refactor proxyFromWinHTTPOrCache to take an injectable winHTTPLookup package var and add Windows-targeted tests covering: success path (with epoch stamp), autodetect-failed, unmapped error, timeout with no cached proxy, timeout with fresh cached proxy, timeout with stale-epoch cached proxy, and an explicit regression test for the 'stale probe stomps fresh InvalidateCache' race. * Rename backoff constants for consistency (wpadTimeoutBackoff -> wpadTimeoutFailedBackoff, etc.), fold the ERROR_INVALID_PARAMETER 1-hour value into the named-constant block, and drop the unmapped-error backoff from 5 minutes to 1 minute so a one-off transient WinHTTP error doesn't suppress autodetection for longer than necessary. * Trim the duplicate Coder fork: block comment inside the function body; the rationale lives once on the const block now. * Rename the cross-platform test's misleading wantErr to syntheticErr (the value is intentionally swallowed, not asserted on).
Member
Author
|
/coder-agents-review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows, tailscaled invokes
WinHttpGetProxyForUrlon every outbound HTTP request to discover a proxy via WPAD (DHCP option 252 / DNS Awpad.<domain>). On networks where WPAD is not configured, this call blocks for up to 5s and returnsERROR_WINHTTP_AUTODETECTION_FAILEDorERROR_WINHTTP_UNABLE_TO_DOWNLOAD_SCRIPT.The existing 10s negative-cache window was repeatedly torn down by
InvalidateCache()on every netmon link change, so control / DERP / log uploads spent most of their time waiting on dead WPAD probes on hosts without WPAD.Goal: use WPAD if it's available; fall back to direct (or env-var/static proxy) if it's not. No WPAD requirement.
References upstream: tailscale#17055, tailscale#10215.
Changes
All in
net/tshttpproxy/tshttpproxy_windows.go:ERROR_WINHTTP_AUTODETECTION_FAILEDandERROR_WINHTTP_UNABLE_TO_DOWNLOAD_SCRIPT.InvalidateCache()still resets it on real link changes, so legitimate WPAD setups keep working — we just stop re-blocking on a 5s probe for every individual HTTP request.cachedProxy.valfrom a prior network — wrong answer when roaming onto a network without WPAD. Now returns(nil, nil)with a 30s backoff.(nil, nil)+ 5min backoff. The package-levelProxyFromEnvironmentalready swallowssysProxyFromEnverrors via shadowed-err in its wrapper, but making the contract explicit at the platform layer documents intent and prevents regressions.WPAD continues to work unchanged on networks where it is configured; this PR only changes how we degrade when it is not.
Test
Adds a cross-platform regression test (
TestProxyFromEnvironment_sysProxyError) that injects a failingsysProxyFromEnvhook and assertsProxyFromEnvironmentreturns(nil, nil), respects an activenoProxyUntilbackoff, and re-probes afterInvalidateCache().Validation
go test ./net/tshttpproxy/...✅go build ./...(Linux) ✅GOOS=windows go build ./cmd/tailscaled/... ./cmd/tailscale/...✅GOOS=windows go vet ./net/tshttpproxy/...✅Scope intentionally not included
WinHttpGetIEProxyConfigForCurrentUserfast-path that would skip WinHTTP entirely when WPAD is administratively disabled in IE settings / group policy.lpszProxy) when only "Use a proxy server" is configured.Both are pure optimizations and require new syscall bindings +
zsyscall_windows.goregeneration. The user-visible goal (don't require WPAD) is achieved by this PR; those are candidates for a follow-up if telemetry shows it's worth the diff.Behavior change to call out
If a customer today is failing fast due to an unmapped WinHTTP error, they will now silently dial direct instead. The error is still logged (rate-limited) and
winhttp_proxy_err_otherstill increments, so it's observable.