Skip to content

Connect Attempts Bugs and Fixes #13169

@masaori335

Description

@masaori335

Report of connect attempts bugs

  • affected versions: 10.1.x and smaller

Relevant configs:

  • proxy.config.http.connect_attempts_max_retries
  • proxy.config.http.connect_attempts_max_retries_suspect_server
    (replaces the deprecated proxy.config.http.connect_attempts_max_retries_down_server
    see PR Fix connect attempt retries #13102)
  • proxy.config.http.connect_attempts_rr_retries
  • proxy.config.http.down_server.cache_time (the fail_window)

PRs:

Net effect between 10.1.x and later

On 10.1.x and earlier, the three connect_attempts_* configs behaved erratically:

The combined effect of PRs #12846, #12880, #13092, and #13102 is a coherent model in which each
config has a well-defined role tied to an explicit UP / DOWN / SUSPECT state for the active
target.

Worked examples

All examples use the ATS defaults:

Config Default
connect_attempts_max_retries 3
connect_attempts_max_retries_suspect_server 1
connect_attempts_rr_retries 3
down_server.cache_time (fail_window) 60s

Example 1 — Single-address origin goes offline

Client sends repeated requests to an origin whose only A record points to a down IP. How many
connect attempts does ATS make before the IP is marked DOWN in HostDB?

10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed transactions).

Why: mark_host_failure() is only called from do_hostdb_update_if_necessary(), which is only
reached at the end of a transaction (via handle_server_connection_not_open). So fail_count
increments once per failed transaction, not once per attempt. The threshold is
connect_attempts_rr_retries (default 3) — the wrong config, but that's what 10.1.x uses.
Per-transaction there are 1 + connect_attempts_max_retries = 4 attempts.

Attempt Txn fail_count after marked DOWN?
1 1 0 no
2 1 0 no
3 1 0 no
4 1 1 (end of txn) no
5 2 1 no
6 2 1 no
7 2 1 no
8 2 2 (end of txn) no
9 3 2 no
10 3 2 no
11 3 2 no
12 3 3 ≥ 3 yes

And because select_best_http() returns info[0] unconditionally for single-address records
(#12880), every one of those 12 attempts actually hits the network. Even after the IP is marked
DOWN, subsequent requests keep trying it because the single-address code path ignores down-state.

Later behavior — IP marked DOWN after 4 connect attempts (1 failed transaction).

Why: do_hostdb_update_if_necessary() is now called after each failure in
handle_response_from_server(), so fail_count increments once per attempt. The threshold is
max_retries + 1 = 4 (for an UP target). Per-transaction there are still 1 + max_retries = 4
attempts, so the threshold is reached on the last attempt of the first failed transaction.

Attempt Txn fail_count after marked DOWN?
1 1 1 no
2 1 2 no
3 1 3 no
4 1 4 ≥ 4 yes

Subsequent requests within fail_window (60s): select_best_http() now honors down-state for
single-address records, so HostDB lookup returns nullptr and the SM reports OriginDown
immediately — zero connect attempts.

After fail_window elapses, the entry is SUSPECT. The next request gets
connect_attempts_max_retries_suspect_server + 1 = 2 attempts (1 probe + 1 retry). Success → UP;
failure → back to DOWN for another 60s.

Example 2 — Round-robin with one bad backend (A fails, B/C healthy)

Defaults apply: rr_retries = 3, max_retries = 3. A is unresponsive, B and C are fine. Initial
active = A.

10.1.x behavior — client gets 502, and the wrong host is blamed.

Two compounding bugs: (a) after select_next_rr() is called, neither dns_info.addr nor
server_info.dst_addr is updated, so the next connect still goes to A's address. (b)
mark_host_failure() is called once at end-of-txn against dns_info.active, which by that point is
the post-switch host (B), so B's fail_count is incremented for A's failures.

Attempt dst_addr (target) dns_info.active retry_attempts after (retry+1) % 3 == 0? Result
1 A A 1 no fail
2 A A 2 no fail
3 A A → switched to B (select_next_rr); dst_addr untouched 3 yes fail
4 A (still!) B 3 ≥ 3, give up n/a fail → 502 to client

End of txn: mark_host_failure() increments B.fail_count even though B was never tried. Repeated
requests do this 3 times → B gets marked DOWN while A (the actually-broken host) keeps getting
hit on every transaction.

Later behavior — client gets 200; A's failures are correctly recorded against A.

do_hostdb_update_if_necessary() runs after each failure (so the correct dns_info.active is in
scope), and after select_next_rr() the SM explicitly assigns dns_info.addr and
server_info.dst_addr to the new target.

Attempt dst_addr (target) dns_info.active A.fail_count after retry_attempts after RR switch? Result
1 A A 1 1 no fail
2 A A 2 2 no fail
3 A A 3 3 yes → active=B, dst_addr=B fail
4 B B 3 (unchanged) n/a n/a success → 200

A finishes this transaction with fail_count = 3 (not yet at the threshold of
max_retries + 1 = 4). The next request that hashes to A bumps it to 4 and marks A DOWN; subsequent
requests within fail_window skip A entirely until the window expires.

If B and C had also been DOWN (no selectable alternate), select_next_rr() returns false and the SM
stays on A using its remaining per-host retry budget instead of giving up — new in #13102.

Later behavior

  • Failure timestamp is ts_clock::now() at the moment mark_down is called, so the full
    fail_window is honored regardless of how long the transaction took to decide the host was bad.

Summary of bug-fix PRs

PR #12846 — Fix HostDBInfo::is_down condition

Bug. HostDBInfo::is_down() had an inverted comparison:

// Before
return (last_fail != TS_TIME_ZERO) && (last_fail + fail_window < now);
// After
return (last_fail != TS_TIME_ZERO) && (now <= last_fail + fail_window);

The pre-fix code reported a host as DOWN only after fail_window had elapsed — the opposite of
intent. During the fail window (when the host should be considered blocked) it returned false.

Behavior change. A failed host is now correctly treated as DOWN for the entire fail_window
following last_failure, and returns to UP only after the window has elapsed. Every downstream path
that calls is_down() (retry budget, RR selection, negative-cached check) is impacted.

PR #12880 — Check state of HostDBInfo in select_best_http

Bug. For single-address (non round-robin) records, HostDBRecord::select_best_http() bypassed
the state check:

// Before
best_alive = &info[0];                       // unconditional
// After
if (info[0].select(now, fail_window)) {
  best_alive = &info[0];
}

A single-address host that was marked down would still be returned from HostDB and reused for the
next request.

Behavior change. Single-address origins now honor down_server state the same way RR origins
do: if the only target is DOWN and still within the fail window, HostDB lookup fails (nullptr),
the SM reports OriginDown, and no connection attempt is made until the window expires. The gold
autest dns_host_down was updated accordingly (second request returns 500 instead of 502 once the
first failure marked the IP down).

PR #13092 — Clarify HostDBInfo state

Not a bug fix — a model/refactor. Introduces an explicit tri-state for upstream health and
reshapes the API around it:

enum class HostDBInfo::State { UP, DOWN, SUSPECT };
  • UP — no known failure; normal selection.
  • DOWN_last_failure set, now is within fail_window; not eligible.
  • SUSPECTfail_window has elapsed; a probe is permitted. A successful response transitions
    it to UP via mark_up(); another failure returns it to DOWN.

API renames: is_aliveis_up, mark_active_server_alivemark_active_server_up.
mark_down now takes fail_window. select() is replaced by callers using is_down() directly.
last_failure / fail_count become private atomics (_last_failure, _fail_count).

Behavior change. The previous two-state model collapsed SUSPECT into UP once fail_window
expired, which meant a recovering host was treated identically to a never-failed host for retry
sizing. With SUSPECT explicit, callers (notably the retry machinery in #13102) can apply
connect_attempts_max_retries_suspect_server to probing traffic and connect_attempts_max_retries
to UP traffic.

PR #13102 — Fix connect attempt retries

Bugs. The retry machinery in HttpTransact::handle_response_from_server() and
HttpSM::mark_host_failure() had several interlocking issues:

  1. Wrong threshold used to mark host down. mark_host_failure() called
    increment_fail_count(..., connect_attempts_rr_retries, ...). The RR switch-over config was
    reused as the "mark down" threshold, so a host was marked DOWN after
    connect_attempts_rr_retries failures instead of connect_attempts_max_retries + 1 total
    attempts.
  2. Wrong clock for failure timestamp. do_hostdb_update_if_necessary() recorded the failure at
    t_state.client_request_time (request-receipt time), not ts_clock::now(). For long requests
    the fail window started in the past, sometimes expiring before the failure was even recorded.
  3. RR exhaustion short-circuited all retries. When the connect_attempts_rr_retries boundary
    was hit and no other RR member was selectable, the old code gave up, even though the current
    target still had per-host retry budget.
  4. connect_attempts_max_retries_down_server had no effect in practice. It was only consulted
    through is_server_negative_cached(), which depended on the inverted is_down() from Fix HostDBInfo::is_down condition #12846. In
    many cases it was either never applied or applied when it shouldn't have been.
  5. connect_attempts_max_retries_down_server == 0 blocked the probe. A separate code path in
    do_http_server_open() refused to connect at all when the target was negative-cached and the
    config was zero — preventing the SUSPECT-state probe that the fail window was designed to
    allow.
  6. Retry config range allowed overflow. [0-255] with uint8_t threshold = max_retries + 1
    wraps to 0.
  7. Config name didn't match its actual effect. The "down_server" config never applied to
    DOWN-state origins (DOWN-state retries are now hardcoded to 0); it controls the probe budget for
    SUSPECT-state origins. Renamed to proxy.config.http.connect_attempts_max_retries_suspect_server
    to align with HostDBInfo::State::SUSPECT. The previous
    proxy.config.http.connect_attempts_max_retries_down_server record is kept and marked
    deprecated; if only it is set, its value is mirrored forward into the new record with a warning,
    and if both are set the new record wins. The legacy
    TS_CONFIG_HTTP_CONNECT_ATTEMPTS_MAX_RETRIES_DOWN_SERVER plugin enum continues to work — it
    aliases to the same internal field as the new _SUSPECT_SERVER enum.

Behavior change.

  • Retry budget is now driven by HostDB state (via the new helper
    HttpTransact::origin_server_connect_attempts_max_retries):
    • UPconnect_attempts_max_retries
    • DOWN → 0 (bail out immediately)
    • SUSPECTconnect_attempts_max_retries_suspect_server
  • mark_host_failure uses max_retries + 1 as the attempt budget and ts_clock::now() as the
    failure time.
  • RR switch and per-host retry are separated: if RR has no selectable alternate, the SM stays on
    the current target and keeps retrying up to the per-host budget rather than giving up.
  • The do_http_server_open() early-bail on "negative cached + down_server=0" is removed — DOWN vs
    SUSPECT is now the single source of truth.
  • proxy.config.http.connect_attempts_max_retries, …_max_retries_suspect_server, and
    …_rr_retries clamped to [0-254] to avoid the uint8_t overflow when adding 1 for the
    total-attempt count.
  • A new warning is logged at config reconfigure when
    connect_attempts_max_retries_suspect_server == 0 and connect_attempts_rr_retries > 0, because
    that combination prevents any probe of a recovering (SUSPECT) origin.
  • proxy.config.http.connect_attempts_max_retries_down_server is now a deprecated alias of
    …_suspect_server. The struct field
    OverridableHttpConfigParams::connect_attempts_max_retries_down_server was removed; both
    records.yaml entries and both TSOverridableConfigKey enum values write into a single field, so
    existing operator configs and plugins keep working while the codebase has one source of truth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions