Connect Attempts Bugs and Fixes

# Report of connect attempts bugs

- affected versions: 10.1.x and smaller

Relevant configs:

- `proxy.config.http.connect_attempts_max_retries`
- `proxy.config.http.connect_attempts_max_retries_suspect_server`
  (replaces the deprecated `proxy.config.http.connect_attempts_max_retries_down_server` —
  see PR #13102)
- `proxy.config.http.connect_attempts_rr_retries`
- `proxy.config.http.down_server.cache_time` (the `fail_window`)

PRs:

- https://github.com/apache/trafficserver/pull/12846
- https://github.com/apache/trafficserver/pull/12880
- https://github.com/apache/trafficserver/pull/13092
- https://github.com/apache/trafficserver/pull/13102

## Net effect between 10.1.x and later

On 10.1.x and earlier, the three `connect_attempts_*` configs behaved erratically:

- `connect_attempts_max_retries_down_server` was effectively unreachable because the `is_down()`
  check was inverted (#12846).
- Single-address origins ignored down-state entirely — `select_best_http` unconditionally returned
  the first (and only) entry (#12880).
- The "mark host DOWN" threshold was gated by `connect_attempts_rr_retries` instead of
  `connect_attempts_max_retries` (#13102).
- The fail-window clock used `client_request_time` instead of the actual failure time, so the window
  could start in the past or be pre-expired (#13102).
- After `select_next_rr()`, `dst_addr` was not updated, so RR "switches" silently kept hitting the
  same target (#13102).
- `mark_host_failure()` ran at end-of-txn against the post-switch `dns_info.active`, so failures on
  A could be charged to B (#13102).
- RR exhaustion terminated retries early even when the current target still had per-host budget
  (#13102).

The combined effect of PRs #12846, #12880, #13092, and #13102 is a coherent model in which each
config has a well-defined role tied to an explicit `UP / DOWN / SUSPECT` state for the active
target.

### Worked examples

All examples use the ATS defaults:

| Config                                          | Default |
|-------------------------------------------------|---------|
| `connect_attempts_max_retries`                  | 3       |
| `connect_attempts_max_retries_suspect_server`   | 1       |
| `connect_attempts_rr_retries`                   | 3       |
| `down_server.cache_time` (`fail_window`)        | 60s     |

#### Example 1 — Single-address origin goes offline

Client sends repeated requests to an origin whose only A record points to a down IP. How many
connect attempts does ATS make before the IP is marked DOWN in HostDB?

**10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed transactions).**

Why: `mark_host_failure()` is only called from `do_hostdb_update_if_necessary()`, which is only
reached at the end of a transaction (via `handle_server_connection_not_open`). So `fail_count`
increments **once per failed transaction**, not once per attempt. The threshold is
`connect_attempts_rr_retries` (default `3`) — the wrong config, but that's what 10.1.x uses.
Per-transaction there are `1 + connect_attempts_max_retries = 4` attempts.

| Attempt | Txn | `fail_count` after | marked DOWN? |
|---------|-----|--------------------|--------------|
| 1       | 1   | 0                  | no           |
| 2       | 1   | 0                  | no           |
| 3       | 1   | 0                  | no           |
| 4       | 1   | 1 (end of txn)     | no           |
| 5       | 2   | 1                  | no           |
| 6       | 2   | 1                  | no           |
| 7       | 2   | 1                  | no           |
| 8       | 2   | 2 (end of txn)     | no           |
| 9       | 3   | 2                  | no           |
| 10      | 3   | 2                  | no           |
| 11      | 3   | 2                  | no           |
| **12**  | 3   | 3 ≥ 3              | **yes**      |

And because `select_best_http()` returns `info[0]` unconditionally for single-address records
(#12880), every one of those 12 attempts actually hits the network. Even after the IP is marked
DOWN, subsequent requests keep trying it because the single-address code path ignores down-state.

**Later behavior — IP marked DOWN after 4 connect attempts (1 failed transaction).**

Why: `do_hostdb_update_if_necessary()` is now called after **each** failure in
`handle_response_from_server()`, so `fail_count` increments **once per attempt**. The threshold is
`max_retries + 1 = 4` (for an UP target). Per-transaction there are still `1 + max_retries = 4`
attempts, so the threshold is reached on the last attempt of the first failed transaction.

| Attempt | Txn | `fail_count` after | marked DOWN? |
|---------|-----|--------------------|--------------|
| 1       | 1   | 1                  | no           |
| 2       | 1   | 2                  | no           |
| 3       | 1   | 3                  | no           |
| **4**   | 1   | 4 ≥ 4              | **yes**      |

Subsequent requests within `fail_window` (60s): `select_best_http()` now honors down-state for
single-address records, so HostDB lookup returns `nullptr` and the SM reports `OriginDown`
immediately — **zero connect attempts**.

After `fail_window` elapses, the entry is SUSPECT. The next request gets
`connect_attempts_max_retries_suspect_server + 1 = 2` attempts (1 probe + 1 retry). Success → UP;
failure → back to DOWN for another 60s.

#### Example 2 — Round-robin with one bad backend (A fails, B/C healthy)

Defaults apply: `rr_retries = 3`, `max_retries = 3`. A is unresponsive, B and C are fine. Initial
active = A.

**10.1.x behavior — client gets 502, and the wrong host is blamed.**

Two compounding bugs: (a) after `select_next_rr()` is called, neither `dns_info.addr` nor
`server_info.dst_addr` is updated, so the next connect still goes to A's address. (b)
`mark_host_failure()` is called once at end-of-txn against `dns_info.active`, which by that point is
the *post-switch* host (B), so B's `fail_count` is incremented for A's failures.

| Attempt | dst_addr (target) | dns_info.active | retry_attempts after | `(retry+1) % 3 == 0`? | Result |
|---------|-------------------|-----------------|----------------------|----------------------|--------|
| 1       | A                 | A               | 1                    | no                   | fail   |
| 2       | A                 | A               | 2                    | no                   | fail   |
| 3       | A                 | A → switched to B (`select_next_rr`); dst_addr untouched | 3 | **yes** | fail |
| 4       | A (still!)        | B               | 3 ≥ 3, give up       | n/a                  | fail → 502 to client |

End of txn: `mark_host_failure()` increments `B.fail_count` even though B was never tried. Repeated
requests do this 3 times → **B gets marked DOWN** while A (the actually-broken host) keeps getting
hit on every transaction.

**Later behavior — client gets 200; A's failures are correctly recorded against A.**

`do_hostdb_update_if_necessary()` runs after each failure (so the correct `dns_info.active` is in
scope), and after `select_next_rr()` the SM explicitly assigns `dns_info.addr` and
`server_info.dst_addr` to the new target.

| Attempt | dst_addr (target) | dns_info.active | A.fail_count after | retry_attempts after | RR switch? | Result |
|---------|-------------------|-----------------|--------------------|----------------------|------------|--------|
| 1       | A                 | A               | 1                  | 1                    | no         | fail   |
| 2       | A                 | A               | 2                  | 2                    | no         | fail   |
| 3       | A                 | A               | 3                  | 3                    | **yes** → active=B, dst_addr=B | fail |
| **4**   | **B**             | B               | 3 (unchanged)      | n/a                  | n/a        | **success → 200** |

A finishes this transaction with `fail_count = 3` (not yet at the threshold of
`max_retries + 1 = 4`). The next request that hashes to A bumps it to 4 and marks A DOWN; subsequent
requests within `fail_window` skip A entirely until the window expires.

If B and C had also been DOWN (no selectable alternate), `select_next_rr()` returns false and the SM
stays on A using its remaining per-host retry budget instead of giving up — new in #13102.

**Later behavior**

- Failure timestamp is `ts_clock::now()` at the moment `mark_down` is called, so the full
  `fail_window` is honored regardless of how long the transaction took to decide the host was bad.

## Summary of bug-fix PRs

### PR #12846 — Fix `HostDBInfo::is_down` condition

**Bug.** `HostDBInfo::is_down()` had an inverted comparison:

```cpp
// Before
return (last_fail != TS_TIME_ZERO) && (last_fail + fail_window < now);
// After
return (last_fail != TS_TIME_ZERO) && (now <= last_fail + fail_window);
```

The pre-fix code reported a host as DOWN only *after* `fail_window` had elapsed — the opposite of
intent. During the fail window (when the host should be considered blocked) it returned `false`.

**Behavior change.** A failed host is now correctly treated as DOWN for the entire `fail_window`
following `last_failure`, and returns to UP only after the window has elapsed. Every downstream path
that calls `is_down()` (retry budget, RR selection, negative-cached check) is impacted.

### PR #12880 — Check state of HostDBInfo in `select_best_http`

**Bug.** For single-address (non round-robin) records, `HostDBRecord::select_best_http()` bypassed
the state check:

```cpp
// Before
best_alive = &info[0];                       // unconditional
// After
if (info[0].select(now, fail_window)) {
  best_alive = &info[0];
}
```

A single-address host that was marked down would still be returned from HostDB and reused for the
next request.

**Behavior change.** Single-address origins now honor `down_server` state the same way RR origins
do: if the only target is DOWN and still within the fail window, HostDB lookup fails (`nullptr`),
the SM reports `OriginDown`, and no connection attempt is made until the window expires. The gold
autest `dns_host_down` was updated accordingly (second request returns 500 instead of 502 once the
first failure marked the IP down).

### PR #13092 — Clarify `HostDBInfo` state

**Not a bug fix — a model/refactor.** Introduces an explicit tri-state for upstream health and
reshapes the API around it:

```cpp
enum class HostDBInfo::State { UP, DOWN, SUSPECT };
```

- **UP** — no known failure; normal selection.
- **DOWN** — `_last_failure` set, `now` is within `fail_window`; not eligible.
- **SUSPECT** — `fail_window` has elapsed; a probe is permitted. A successful response transitions
  it to UP via `mark_up()`; another failure returns it to DOWN.

API renames: `is_alive` → `is_up`, `mark_active_server_alive` → `mark_active_server_up`.
`mark_down` now takes `fail_window`. `select()` is replaced by callers using `is_down()` directly.
`last_failure` / `fail_count` become private atomics (`_last_failure`, `_fail_count`).

**Behavior change.** The previous two-state model collapsed SUSPECT into UP once `fail_window`
expired, which meant a recovering host was treated identically to a never-failed host for retry
sizing. With SUSPECT explicit, callers (notably the retry machinery in #13102) can apply
`connect_attempts_max_retries_suspect_server` to probing traffic and `connect_attempts_max_retries`
to UP traffic.

### PR #13102 — Fix connect attempt retries

**Bugs.** The retry machinery in `HttpTransact::handle_response_from_server()` and
`HttpSM::mark_host_failure()` had several interlocking issues:

1. **Wrong threshold used to mark host down.** `mark_host_failure()` called
   `increment_fail_count(..., connect_attempts_rr_retries, ...)`. The RR switch-over config was
   reused as the "mark down" threshold, so a host was marked DOWN after
   `connect_attempts_rr_retries` failures instead of `connect_attempts_max_retries + 1` total
   attempts.
2. **Wrong clock for failure timestamp.** `do_hostdb_update_if_necessary()` recorded the failure at
   `t_state.client_request_time` (request-receipt time), not `ts_clock::now()`. For long requests
   the fail window started in the past, sometimes expiring before the failure was even recorded.
3. **RR exhaustion short-circuited all retries.** When the `connect_attempts_rr_retries` boundary
   was hit and no other RR member was selectable, the old code gave up, even though the current
   target still had per-host retry budget.
4. **`connect_attempts_max_retries_down_server` had no effect in practice.** It was only consulted
   through `is_server_negative_cached()`, which depended on the inverted `is_down()` from #12846. In
   many cases it was either never applied or applied when it shouldn't have been.
5. **`connect_attempts_max_retries_down_server == 0` blocked the probe.** A separate code path in
   `do_http_server_open()` refused to connect at all when the target was negative-cached and the
   config was zero — preventing the SUSPECT-state probe that the fail window was designed to
   allow.
6. **Retry config range allowed overflow.** `[0-255]` with `uint8_t` threshold = `max_retries + 1`
   wraps to 0.
7. **Config name didn't match its actual effect.** The "down_server" config never applied to
   DOWN-state origins (DOWN-state retries are now hardcoded to 0); it controls the probe budget for
   SUSPECT-state origins. Renamed to `proxy.config.http.connect_attempts_max_retries_suspect_server`
   to align with `HostDBInfo::State::SUSPECT`. The previous
   `proxy.config.http.connect_attempts_max_retries_down_server` record is kept and marked
   deprecated; if only it is set, its value is mirrored forward into the new record with a warning,
   and if both are set the new record wins. The legacy
   `TS_CONFIG_HTTP_CONNECT_ATTEMPTS_MAX_RETRIES_DOWN_SERVER` plugin enum continues to work — it
   aliases to the same internal field as the new `_SUSPECT_SERVER` enum.

**Behavior change.**

- Retry budget is now driven by HostDB state (via the new helper
  `HttpTransact::origin_server_connect_attempts_max_retries`):
  - `UP`      → `connect_attempts_max_retries`
  - `DOWN`    → 0 (bail out immediately)
  - `SUSPECT` → `connect_attempts_max_retries_suspect_server`
- `mark_host_failure` uses `max_retries + 1` as the attempt budget and `ts_clock::now()` as the
  failure time.
- RR switch and per-host retry are separated: if RR has no selectable alternate, the SM stays on
  the current target and keeps retrying up to the per-host budget rather than giving up.
- The `do_http_server_open()` early-bail on "negative cached + down_server=0" is removed — DOWN vs
  SUSPECT is now the single source of truth.
- `proxy.config.http.connect_attempts_max_retries`, `…_max_retries_suspect_server`, and
  `…_rr_retries` clamped to `[0-254]` to avoid the `uint8_t` overflow when adding 1 for the
  total-attempt count.
- A new warning is logged at config reconfigure when
  `connect_attempts_max_retries_suspect_server == 0` and `connect_attempts_rr_retries > 0`, because
  that combination prevents any probe of a recovering (SUSPECT) origin.
- `proxy.config.http.connect_attempts_max_retries_down_server` is now a deprecated alias of
  `…_suspect_server`. The struct field
  `OverridableHttpConfigParams::connect_attempts_max_retries_down_server` was removed; both
  records.yaml entries and both `TSOverridableConfigKey` enum values write into a single field, so
  existing operator configs and plugins keep working while the codebase has one source of truth.


Config	Default
`connect_attempts_max_retries`	3
`connect_attempts_max_retries_suspect_server`	1
`connect_attempts_rr_retries`	3
`down_server.cache_time` (`fail_window`)	60s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connect Attempts Bugs and Fixes #13169

Report of connect attempts bugs

Net effect between 10.1.x and later

Worked examples

Example 1 — Single-address origin goes offline

Example 2 — Round-robin with one bad backend (A fails, B/C healthy)

Summary of bug-fix PRs

PR #12846 — Fix `HostDBInfo::is_down` condition

PR #12880 — Check state of HostDBInfo in `select_best_http`

PR #13092 — Clarify `HostDBInfo` state

PR #13102 — Fix connect attempt retries

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attempt	Txn	`fail_count` after	marked DOWN?
1	1	0	no
2	1	0	no
3	1	0	no
4	1	1 (end of txn)	no
5	2	1	no
6	2	1	no
7	2	1	no
8	2	2 (end of txn)	no
9	3	2	no
10	3	2	no
11	3	2	no
12	3	3 ≥ 3	yes

Attempt	dst_addr (target)	dns_info.active	retry_attempts after	`(retry+1) % 3 == 0`?	Result
1	A	A	1	no	fail
2	A	A	2	no	fail
3	A	A → switched to B (`select_next_rr`); dst_addr untouched	3	yes	fail
4	A (still!)	B	3 ≥ 3, give up	n/a	fail → 502 to client

Attempt	dst_addr (target)	dns_info.active	A.fail_count after	retry_attempts after	RR switch?	Result
1	A	A	1	1	no	fail
2	A	A	2	2	no	fail
3	A	A	3	3	yes → active=B, dst_addr=B	fail
4	B	B	3 (unchanged)	n/a	n/a	success → 200

Connect Attempts Bugs and Fixes #13169

Description

Report of connect attempts bugs

Net effect between 10.1.x and later

Worked examples

Example 1 — Single-address origin goes offline

Example 2 — Round-robin with one bad backend (A fails, B/C healthy)

Summary of bug-fix PRs

PR #12846 — Fix HostDBInfo::is_down condition

PR #12880 — Check state of HostDBInfo in select_best_http

PR #13092 — Clarify HostDBInfo state

PR #13102 — Fix connect attempt retries

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

PR #12846 — Fix `HostDBInfo::is_down` condition

PR #12880 — Check state of HostDBInfo in `select_best_http`

PR #13092 — Clarify `HostDBInfo` state