Skip to content

fix(wsclient): surface dominant failure mode in fallback bail message#54

Merged
sanity merged 3 commits into
mainfrom
honest-fallback-errors
May 17, 2026
Merged

fix(wsclient): surface dominant failure mode in fallback bail message#54
sanity merged 3 commits into
mainfrom
honest-fallback-errors

Conversation

@sanity
Copy link
Copy Markdown
Contributor

@sanity sanity commented May 16, 2026

Problem

`get_state_with_legacy_fallback` swallows each probe's actual error into a `tracing::debug!` (silent in release) and bails with the generic `"no state found at current contract key or any of N legacy keys"` regardless of cause. This misrepresents transient gateway slowdowns (every probe times out at 180s) as permanent data loss — operators reading the freenet-stdlib mirror's 2026-05-14 Matrix alerts saw the misleading message and burned investigation time before realizing the contract still existed and the GETs were just timing out during peer-routing discovery.

#52 (Ian's recent NotFound handling in `get_state` + `update_state`) addressed the wire-level "NotFound looks like empty state" angle, but the fallback-level error message is still generic. After the gateway-side root cause fix in freenet/freenet-core#4133, this diagnostic safety net catches any future failure mode of similar shape.

Solution

Add `ProbeOutcome` classification (`Empty` / `NotFound` / `Timeout` / `OtherError`) and aggregate per-probe outcomes. The final bail message distinguishes:

  • All timeouts → `"GET timed out on all N probe(s) after Xs each; gateway or peer routing may be unhealthy"` — clear transient signal.
  • All NotFound / empty → original-style `"no state found"` message preserved for the genuine data-loss case.
  • Mixed → per-probe summary so the operator sees exactly what each probe returned.

Classification reads `get_state`'s `bail!` message strings; pinned by `probe_outcome_classifies_get_state_errors` so a future change to those strings without updating this matcher trips the test rather than silently degrading the diagnostic.

No behavior change for success paths or legacy-migration semantics.

Testing

Four new unit tests in `wsclient::tests`:

  • `probe_outcome_classifies_get_state_errors` (pin guard)
  • `format_fallback_failure_all_timeouts_says_timed_out` (regression for the freenet-stdlib mirror false-alarm case)
  • `format_fallback_failure_all_not_found_says_no_state` (preserves legitimate not-found message)
  • `format_fallback_failure_mixed_surfaces_per_probe_detail` (per-probe summary)

All 65 freenet-git tests pass. fmt + clippy clean.

Release

Version bump `0.1.21 → 0.1.22` on top of `f91b516` (NotFound in `update_state`) which landed after the `release: 0.1.21` commit but before publish. Tag + `cargo publish` after merge.

[AI-assisted - Claude]

sanity added 2 commits May 16, 2026 11:30
`get_state_with_legacy_fallback` swallowed each probe's actual error
into a `tracing::debug!` (silent in release) and then bailed with the
generic "no state found at current contract key or any of N legacy
keys" regardless of cause. This historically misrepresented transient
gateway slowdowns (every probe times out at 180s) as permanent data
loss — operators reading the freenet-stdlib mirror's 2026-05-14 Matrix
alerts saw the misleading message and burned investigation time before
realizing the contract still existed and the GETs were just timing out
during peer-routing discovery.

Add `ProbeOutcome` classification (Empty / NotFound / Timeout /
OtherError) and aggregate per-probe outcomes. The final bail message
distinguishes:

- **All timeouts**: "GET timed out on all N probe(s) after Xs each;
  gateway or peer routing may be unhealthy" — clear transient signal.
- **All NotFound/empty**: original-style "no state found" message
  preserved for the genuine data-loss case.
- **Mixed**: per-probe summary so the operator sees exactly what each
  probe returned ("current key X: timeout; legacy key 0 (Y): NotFound").

The classification reads `get_state`'s `bail!` message strings; pinned
by `probe_outcome_classifies_get_state_errors` so a future change to
those strings without updating this matcher trips the test rather than
silently degrading the diagnostic.

Tests cover (a) classification, (b) all-timeout message, (c) all-
not-found message, (d) mixed-outcome detail. No behavior change for
the success paths or for the legacy-migration fallback semantics —
`get_state_with_legacy_fallback`'s contract still treats any per-probe
error as "try the next legacy key, then bail", just with a much more
useful final message.

Version bump 0.1.21 -> 0.1.22 to cover this on top of the f91b516
follow-up (NotFound in update_state) that landed after the 0.1.21
release commit.

[AI-assisted - Claude]
Skeptical-reviewer PR #54 M1: the string-match classifier could
silently regress if get_state's bail! messages were edited without
updating from_get_state_err -- the test re-asserted the literal
strings rather than driving through the actual production sites.

Extract the prefix/suffix into GET_TIMEOUT_PREFIX and
GET_NOT_FOUND_SUFFIX consts, use them from both the bail sites in
get_state and the classifier in from_get_state_err. The test reuses
the same consts so a future edit to either bail site must also update
the const, and any reader can grep one symbol to find all coupled
sites.

[AI-assisted - Claude]
@sanity
Copy link
Copy Markdown
Contributor Author

sanity commented May 16, 2026

Multi-model review (skeptical) raised one medium-severity concern:

M1 (resolved in 31fa2e2): The string-match classifier could silently regress if `get_state`'s `bail!` messages were edited without updating `from_get_state_err`. The test re-asserted the literal strings rather than driving through the actual production sites, so a drift would leave both classifier and test passing while production regressed to the generic 'mixed outcomes' message.

Fix: extracted `GET_TIMEOUT_PREFIX` and `GET_NOT_FOUND_SUFFIX` as `const`s used at both the `bail!` sites in `get_state` and the classifier in `from_get_state_err`. The test now builds error strings using the same `const`s — so a future edit to either bail message must also update the `const`, and the classifier picks up the new prefix automatically.

Other findings deferred:

  • M2 (NotFound substring matches UPDATE's bail too) — `get_state` is the only caller today; if anyone wires UPDATE into a fallback path later, this would need a more specific match. Acceptable for now.
  • L1 (no distinct Transport variant) — `OtherError` carries the underlying message in mixed-outcomes detail; sufficient until WS-disconnect becomes a common enough failure mode to warrant its own bucket.
  • L2 (empty-state semantics) — preserved existing migration-fallback behavior; no change needed.

[AI-assisted - Claude]

…anch

Codex PR #54 review:

P3 #1: the all-timeout message hardcoded "push aborted to avoid
overwriting unknown state", misleading when get_state_with_legacy_
fallback is called from non-push paths (rescue, fetch_repo_state).
Reword to "operation aborted, state on the network is unknown".

P3 #2: an all-OtherError outcome (e.g. WebSocket closed -> every send
GET fails the same way) was falling into the mixed-outcomes branch
even though the failures are uniform, contradicting the dominant-
outcome contract documented in the function's rustdoc. Add a dedicated
all-transport-errors branch that surfaces the first underlying error
verbatim with a count.

Adds format_fallback_failure_all_transport_errors_says_transport
regression test pinning the new branch.

[AI-assisted - Claude]
@sanity
Copy link
Copy Markdown
Contributor Author

sanity commented May 16, 2026

Codex review found two P3 issues, both fixed in d7e9d27:

  • P3 Wire up WS API publish in freenet-git create #1 (resolved): all-timeout message hardcoded "push aborted" wording. Misleading when called from non-push paths (`rescue`, `fetch_repo_state`). Reworded to "operation aborted, state on the network is unknown".
  • P3 Implement git-remote-freenet helper binary #2 (resolved): all-OtherError outcomes (e.g. WebSocket closed → every `send GET` fails the same way) were falling into the mixed-outcomes branch even though the failures are uniform — contradicting the dominant-outcome contract. Added a dedicated transport-error branch that surfaces the first underlying error verbatim.

`format_fallback_failure_all_transport_errors_says_transport` test added to pin the new branch.

All 31 wsclient tests pass.

[AI-assisted - Claude]

@sanity sanity merged commit 536cf43 into main May 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant