You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Single-address origins ignored down-state entirely — select_best_http unconditionally returned
the first (and only) entry (Check state of HostDBInfo #12880).
The "mark host DOWN" threshold was gated by connect_attempts_rr_retries instead of connect_attempts_max_retries (Fix connect attempt retries #13102).
The fail-window clock used client_request_time instead of the actual failure time, so the window
could start in the past or be pre-expired (Fix connect attempt retries #13102).
After select_next_rr(), dst_addr was not updated, so RR "switches" silently kept hitting the
same target (Fix connect attempt retries #13102).
mark_host_failure() ran at end-of-txn against the post-switch dns_info.active, so failures on
A could be charged to B (Fix connect attempt retries #13102).
The combined effect of PRs #12846, #12880, #13092, and #13102 is a coherent model in which each
config has a well-defined role tied to an explicit UP / DOWN / SUSPECT state for the active
target.
Worked examples
All examples use the ATS defaults:
Config
Default
connect_attempts_max_retries
3
connect_attempts_max_retries_suspect_server
1
connect_attempts_rr_retries
3
down_server.cache_time (fail_window)
60s
Example 1 — Single-address origin goes offline
Client sends repeated requests to an origin whose only A record points to a down IP. How many
connect attempts does ATS make before the IP is marked DOWN in HostDB?
10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed transactions).
Why: mark_host_failure() is only called from do_hostdb_update_if_necessary(), which is only
reached at the end of a transaction (via handle_server_connection_not_open). So fail_count
increments once per failed transaction, not once per attempt. The threshold is connect_attempts_rr_retries (default 3) — the wrong config, but that's what 10.1.x uses.
Per-transaction there are 1 + connect_attempts_max_retries = 4 attempts.
Attempt
Txn
fail_count after
marked DOWN?
1
1
0
no
2
1
0
no
3
1
0
no
4
1
1 (end of txn)
no
5
2
1
no
6
2
1
no
7
2
1
no
8
2
2 (end of txn)
no
9
3
2
no
10
3
2
no
11
3
2
no
12
3
3 ≥ 3
yes
And because select_best_http() returns info[0] unconditionally for single-address records
(#12880), every one of those 12 attempts actually hits the network. Even after the IP is marked
DOWN, subsequent requests keep trying it because the single-address code path ignores down-state.
Later behavior — IP marked DOWN after 4 connect attempts (1 failed transaction).
Why: do_hostdb_update_if_necessary() is now called after each failure in handle_response_from_server(), so fail_count increments once per attempt. The threshold is max_retries + 1 = 4 (for an UP target). Per-transaction there are still 1 + max_retries = 4
attempts, so the threshold is reached on the last attempt of the first failed transaction.
Attempt
Txn
fail_count after
marked DOWN?
1
1
1
no
2
1
2
no
3
1
3
no
4
1
4 ≥ 4
yes
Subsequent requests within fail_window (60s): select_best_http() now honors down-state for
single-address records, so HostDB lookup returns nullptr and the SM reports OriginDown
immediately — zero connect attempts.
After fail_window elapses, the entry is SUSPECT. The next request gets connect_attempts_max_retries_suspect_server + 1 = 2 attempts (1 probe + 1 retry). Success → UP;
failure → back to DOWN for another 60s.
Example 2 — Round-robin with one bad backend (A fails, B/C healthy)
Defaults apply: rr_retries = 3, max_retries = 3. A is unresponsive, B and C are fine. Initial
active = A.
10.1.x behavior — client gets 502, and the wrong host is blamed.
Two compounding bugs: (a) after select_next_rr() is called, neither dns_info.addr nor server_info.dst_addr is updated, so the next connect still goes to A's address. (b) mark_host_failure() is called once at end-of-txn against dns_info.active, which by that point is
the post-switch host (B), so B's fail_count is incremented for A's failures.
Attempt
dst_addr (target)
dns_info.active
retry_attempts after
(retry+1) % 3 == 0?
Result
1
A
A
1
no
fail
2
A
A
2
no
fail
3
A
A → switched to B (select_next_rr); dst_addr untouched
3
yes
fail
4
A (still!)
B
3 ≥ 3, give up
n/a
fail → 502 to client
End of txn: mark_host_failure() increments B.fail_count even though B was never tried. Repeated
requests do this 3 times → B gets marked DOWN while A (the actually-broken host) keeps getting
hit on every transaction.
Later behavior — client gets 200; A's failures are correctly recorded against A.
do_hostdb_update_if_necessary() runs after each failure (so the correct dns_info.active is in
scope), and after select_next_rr() the SM explicitly assigns dns_info.addr and server_info.dst_addr to the new target.
Attempt
dst_addr (target)
dns_info.active
A.fail_count after
retry_attempts after
RR switch?
Result
1
A
A
1
1
no
fail
2
A
A
2
2
no
fail
3
A
A
3
3
yes → active=B, dst_addr=B
fail
4
B
B
3 (unchanged)
n/a
n/a
success → 200
A finishes this transaction with fail_count = 3 (not yet at the threshold of max_retries + 1 = 4). The next request that hashes to A bumps it to 4 and marks A DOWN; subsequent
requests within fail_window skip A entirely until the window expires.
If B and C had also been DOWN (no selectable alternate), select_next_rr() returns false and the SM
stays on A using its remaining per-host retry budget instead of giving up — new in #13102.
Later behavior
Failure timestamp is ts_clock::now() at the moment mark_down is called, so the full fail_window is honored regardless of how long the transaction took to decide the host was bad.
The pre-fix code reported a host as DOWN only afterfail_window had elapsed — the opposite of
intent. During the fail window (when the host should be considered blocked) it returned false.
Behavior change. A failed host is now correctly treated as DOWN for the entire fail_window
following last_failure, and returns to UP only after the window has elapsed. Every downstream path
that calls is_down() (retry budget, RR selection, negative-cached check) is impacted.
PR #12880 — Check state of HostDBInfo in select_best_http
Bug. For single-address (non round-robin) records, HostDBRecord::select_best_http() bypassed
the state check:
A single-address host that was marked down would still be returned from HostDB and reused for the
next request.
Behavior change. Single-address origins now honor down_server state the same way RR origins
do: if the only target is DOWN and still within the fail window, HostDB lookup fails (nullptr),
the SM reports OriginDown, and no connection attempt is made until the window expires. The gold
autest dns_host_down was updated accordingly (second request returns 500 instead of 502 once the
first failure marked the IP down).
Not a bug fix — a model/refactor. Introduces an explicit tri-state for upstream health and
reshapes the API around it:
enumclassHostDBInfo::State { UP, DOWN, SUSPECT };
UP — no known failure; normal selection.
DOWN — _last_failure set, now is within fail_window; not eligible.
SUSPECT — fail_window has elapsed; a probe is permitted. A successful response transitions
it to UP via mark_up(); another failure returns it to DOWN.
API renames: is_alive → is_up, mark_active_server_alive → mark_active_server_up. mark_down now takes fail_window. select() is replaced by callers using is_down() directly. last_failure / fail_count become private atomics (_last_failure, _fail_count).
Behavior change. The previous two-state model collapsed SUSPECT into UP once fail_window
expired, which meant a recovering host was treated identically to a never-failed host for retry
sizing. With SUSPECT explicit, callers (notably the retry machinery in #13102) can apply connect_attempts_max_retries_suspect_server to probing traffic and connect_attempts_max_retries
to UP traffic.
Bugs. The retry machinery in HttpTransact::handle_response_from_server() and HttpSM::mark_host_failure() had several interlocking issues:
Wrong threshold used to mark host down.mark_host_failure() called increment_fail_count(..., connect_attempts_rr_retries, ...). The RR switch-over config was
reused as the "mark down" threshold, so a host was marked DOWN after connect_attempts_rr_retries failures instead of connect_attempts_max_retries + 1 total
attempts.
Wrong clock for failure timestamp.do_hostdb_update_if_necessary() recorded the failure at t_state.client_request_time (request-receipt time), not ts_clock::now(). For long requests
the fail window started in the past, sometimes expiring before the failure was even recorded.
RR exhaustion short-circuited all retries. When the connect_attempts_rr_retries boundary
was hit and no other RR member was selectable, the old code gave up, even though the current
target still had per-host retry budget.
connect_attempts_max_retries_down_server had no effect in practice. It was only consulted
through is_server_negative_cached(), which depended on the inverted is_down() from Fix HostDBInfo::is_down condition #12846. In
many cases it was either never applied or applied when it shouldn't have been.
connect_attempts_max_retries_down_server == 0 blocked the probe. A separate code path in do_http_server_open() refused to connect at all when the target was negative-cached and the
config was zero — preventing the SUSPECT-state probe that the fail window was designed to
allow.
Retry config range allowed overflow.[0-255] with uint8_t threshold = max_retries + 1
wraps to 0.
Config name didn't match its actual effect. The "down_server" config never applied to
DOWN-state origins (DOWN-state retries are now hardcoded to 0); it controls the probe budget for
SUSPECT-state origins. Renamed to proxy.config.http.connect_attempts_max_retries_suspect_server
to align with HostDBInfo::State::SUSPECT. The previous proxy.config.http.connect_attempts_max_retries_down_server record is kept and marked
deprecated; if only it is set, its value is mirrored forward into the new record with a warning,
and if both are set the new record wins. The legacy TS_CONFIG_HTTP_CONNECT_ATTEMPTS_MAX_RETRIES_DOWN_SERVER plugin enum continues to work — it
aliases to the same internal field as the new _SUSPECT_SERVER enum.
Behavior change.
Retry budget is now driven by HostDB state (via the new helper HttpTransact::origin_server_connect_attempts_max_retries):
mark_host_failure uses max_retries + 1 as the attempt budget and ts_clock::now() as the
failure time.
RR switch and per-host retry are separated: if RR has no selectable alternate, the SM stays on
the current target and keeps retrying up to the per-host budget rather than giving up.
The do_http_server_open() early-bail on "negative cached + down_server=0" is removed — DOWN vs
SUSPECT is now the single source of truth.
proxy.config.http.connect_attempts_max_retries, …_max_retries_suspect_server, and …_rr_retries clamped to [0-254] to avoid the uint8_t overflow when adding 1 for the
total-attempt count.
A new warning is logged at config reconfigure when connect_attempts_max_retries_suspect_server == 0 and connect_attempts_rr_retries > 0, because
that combination prevents any probe of a recovering (SUSPECT) origin.
proxy.config.http.connect_attempts_max_retries_down_server is now a deprecated alias of …_suspect_server. The struct field OverridableHttpConfigParams::connect_attempts_max_retries_down_server was removed; both
records.yaml entries and both TSOverridableConfigKey enum values write into a single field, so
existing operator configs and plugins keep working while the codebase has one source of truth.
Report of connect attempts bugs
Relevant configs:
proxy.config.http.connect_attempts_max_retriesproxy.config.http.connect_attempts_max_retries_suspect_server(replaces the deprecated
proxy.config.http.connect_attempts_max_retries_down_server—see PR Fix connect attempt retries #13102)
proxy.config.http.connect_attempts_rr_retriesproxy.config.http.down_server.cache_time(thefail_window)PRs:
Net effect between 10.1.x and later
On 10.1.x and earlier, the three
connect_attempts_*configs behaved erratically:connect_attempts_max_retries_down_serverwas effectively unreachable because theis_down()check was inverted (Fix HostDBInfo::is_down condition #12846).
select_best_httpunconditionally returnedthe first (and only) entry (Check state of HostDBInfo #12880).
connect_attempts_rr_retriesinstead ofconnect_attempts_max_retries(Fix connect attempt retries #13102).client_request_timeinstead of the actual failure time, so the windowcould start in the past or be pre-expired (Fix connect attempt retries #13102).
select_next_rr(),dst_addrwas not updated, so RR "switches" silently kept hitting thesame target (Fix connect attempt retries #13102).
mark_host_failure()ran at end-of-txn against the post-switchdns_info.active, so failures onA could be charged to B (Fix connect attempt retries #13102).
(Fix connect attempt retries #13102).
The combined effect of PRs #12846, #12880, #13092, and #13102 is a coherent model in which each
config has a well-defined role tied to an explicit
UP / DOWN / SUSPECTstate for the activetarget.
Worked examples
All examples use the ATS defaults:
connect_attempts_max_retriesconnect_attempts_max_retries_suspect_serverconnect_attempts_rr_retriesdown_server.cache_time(fail_window)Example 1 — Single-address origin goes offline
Client sends repeated requests to an origin whose only A record points to a down IP. How many
connect attempts does ATS make before the IP is marked DOWN in HostDB?
10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed transactions).
Why:
mark_host_failure()is only called fromdo_hostdb_update_if_necessary(), which is onlyreached at the end of a transaction (via
handle_server_connection_not_open). Sofail_countincrements once per failed transaction, not once per attempt. The threshold is
connect_attempts_rr_retries(default3) — the wrong config, but that's what 10.1.x uses.Per-transaction there are
1 + connect_attempts_max_retries = 4attempts.fail_countafterAnd because
select_best_http()returnsinfo[0]unconditionally for single-address records(#12880), every one of those 12 attempts actually hits the network. Even after the IP is marked
DOWN, subsequent requests keep trying it because the single-address code path ignores down-state.
Later behavior — IP marked DOWN after 4 connect attempts (1 failed transaction).
Why:
do_hostdb_update_if_necessary()is now called after each failure inhandle_response_from_server(), sofail_countincrements once per attempt. The threshold ismax_retries + 1 = 4(for an UP target). Per-transaction there are still1 + max_retries = 4attempts, so the threshold is reached on the last attempt of the first failed transaction.
fail_countafterSubsequent requests within
fail_window(60s):select_best_http()now honors down-state forsingle-address records, so HostDB lookup returns
nullptrand the SM reportsOriginDownimmediately — zero connect attempts.
After
fail_windowelapses, the entry is SUSPECT. The next request getsconnect_attempts_max_retries_suspect_server + 1 = 2attempts (1 probe + 1 retry). Success → UP;failure → back to DOWN for another 60s.
Example 2 — Round-robin with one bad backend (A fails, B/C healthy)
Defaults apply:
rr_retries = 3,max_retries = 3. A is unresponsive, B and C are fine. Initialactive = A.
10.1.x behavior — client gets 502, and the wrong host is blamed.
Two compounding bugs: (a) after
select_next_rr()is called, neitherdns_info.addrnorserver_info.dst_addris updated, so the next connect still goes to A's address. (b)mark_host_failure()is called once at end-of-txn againstdns_info.active, which by that point isthe post-switch host (B), so B's
fail_countis incremented for A's failures.(retry+1) % 3 == 0?select_next_rr); dst_addr untouchedEnd of txn:
mark_host_failure()incrementsB.fail_counteven though B was never tried. Repeatedrequests do this 3 times → B gets marked DOWN while A (the actually-broken host) keeps getting
hit on every transaction.
Later behavior — client gets 200; A's failures are correctly recorded against A.
do_hostdb_update_if_necessary()runs after each failure (so the correctdns_info.activeis inscope), and after
select_next_rr()the SM explicitly assignsdns_info.addrandserver_info.dst_addrto the new target.A finishes this transaction with
fail_count = 3(not yet at the threshold ofmax_retries + 1 = 4). The next request that hashes to A bumps it to 4 and marks A DOWN; subsequentrequests within
fail_windowskip A entirely until the window expires.If B and C had also been DOWN (no selectable alternate),
select_next_rr()returns false and the SMstays on A using its remaining per-host retry budget instead of giving up — new in #13102.
Later behavior
ts_clock::now()at the momentmark_downis called, so the fullfail_windowis honored regardless of how long the transaction took to decide the host was bad.Summary of bug-fix PRs
PR #12846 — Fix
HostDBInfo::is_downconditionBug.
HostDBInfo::is_down()had an inverted comparison:The pre-fix code reported a host as DOWN only after
fail_windowhad elapsed — the opposite ofintent. During the fail window (when the host should be considered blocked) it returned
false.Behavior change. A failed host is now correctly treated as DOWN for the entire
fail_windowfollowing
last_failure, and returns to UP only after the window has elapsed. Every downstream paththat calls
is_down()(retry budget, RR selection, negative-cached check) is impacted.PR #12880 — Check state of HostDBInfo in
select_best_httpBug. For single-address (non round-robin) records,
HostDBRecord::select_best_http()bypassedthe state check:
A single-address host that was marked down would still be returned from HostDB and reused for the
next request.
Behavior change. Single-address origins now honor
down_serverstate the same way RR originsdo: if the only target is DOWN and still within the fail window, HostDB lookup fails (
nullptr),the SM reports
OriginDown, and no connection attempt is made until the window expires. The goldautest
dns_host_downwas updated accordingly (second request returns 500 instead of 502 once thefirst failure marked the IP down).
PR #13092 — Clarify
HostDBInfostateNot a bug fix — a model/refactor. Introduces an explicit tri-state for upstream health and
reshapes the API around it:
_last_failureset,nowis withinfail_window; not eligible.fail_windowhas elapsed; a probe is permitted. A successful response transitionsit to UP via
mark_up(); another failure returns it to DOWN.API renames:
is_alive→is_up,mark_active_server_alive→mark_active_server_up.mark_downnow takesfail_window.select()is replaced by callers usingis_down()directly.last_failure/fail_countbecome private atomics (_last_failure,_fail_count).Behavior change. The previous two-state model collapsed SUSPECT into UP once
fail_windowexpired, which meant a recovering host was treated identically to a never-failed host for retry
sizing. With SUSPECT explicit, callers (notably the retry machinery in #13102) can apply
connect_attempts_max_retries_suspect_serverto probing traffic andconnect_attempts_max_retriesto UP traffic.
PR #13102 — Fix connect attempt retries
Bugs. The retry machinery in
HttpTransact::handle_response_from_server()andHttpSM::mark_host_failure()had several interlocking issues:mark_host_failure()calledincrement_fail_count(..., connect_attempts_rr_retries, ...). The RR switch-over config wasreused as the "mark down" threshold, so a host was marked DOWN after
connect_attempts_rr_retriesfailures instead ofconnect_attempts_max_retries + 1totalattempts.
do_hostdb_update_if_necessary()recorded the failure att_state.client_request_time(request-receipt time), notts_clock::now(). For long requeststhe fail window started in the past, sometimes expiring before the failure was even recorded.
connect_attempts_rr_retriesboundarywas hit and no other RR member was selectable, the old code gave up, even though the current
target still had per-host retry budget.
connect_attempts_max_retries_down_serverhad no effect in practice. It was only consultedthrough
is_server_negative_cached(), which depended on the invertedis_down()from Fix HostDBInfo::is_down condition #12846. Inmany cases it was either never applied or applied when it shouldn't have been.
connect_attempts_max_retries_down_server == 0blocked the probe. A separate code path indo_http_server_open()refused to connect at all when the target was negative-cached and theconfig was zero — preventing the SUSPECT-state probe that the fail window was designed to
allow.
[0-255]withuint8_tthreshold =max_retries + 1wraps to 0.
DOWN-state origins (DOWN-state retries are now hardcoded to 0); it controls the probe budget for
SUSPECT-state origins. Renamed to
proxy.config.http.connect_attempts_max_retries_suspect_serverto align with
HostDBInfo::State::SUSPECT. The previousproxy.config.http.connect_attempts_max_retries_down_serverrecord is kept and markeddeprecated; if only it is set, its value is mirrored forward into the new record with a warning,
and if both are set the new record wins. The legacy
TS_CONFIG_HTTP_CONNECT_ATTEMPTS_MAX_RETRIES_DOWN_SERVERplugin enum continues to work — italiases to the same internal field as the new
_SUSPECT_SERVERenum.Behavior change.
HttpTransact::origin_server_connect_attempts_max_retries):UP→connect_attempts_max_retriesDOWN→ 0 (bail out immediately)SUSPECT→connect_attempts_max_retries_suspect_servermark_host_failureusesmax_retries + 1as the attempt budget andts_clock::now()as thefailure time.
the current target and keeps retrying up to the per-host budget rather than giving up.
do_http_server_open()early-bail on "negative cached + down_server=0" is removed — DOWN vsSUSPECT is now the single source of truth.
proxy.config.http.connect_attempts_max_retries,…_max_retries_suspect_server, and…_rr_retriesclamped to[0-254]to avoid theuint8_toverflow when adding 1 for thetotal-attempt count.
connect_attempts_max_retries_suspect_server == 0andconnect_attempts_rr_retries > 0, becausethat combination prevents any probe of a recovering (SUSPECT) origin.
proxy.config.http.connect_attempts_max_retries_down_serveris now a deprecated alias of…_suspect_server. The struct fieldOverridableHttpConfigParams::connect_attempts_max_retries_down_serverwas removed; bothrecords.yaml entries and both
TSOverridableConfigKeyenum values write into a single field, soexisting operator configs and plugins keep working while the codebase has one source of truth.