fix(wsclient): surface dominant failure mode in fallback bail message#54
Conversation
`get_state_with_legacy_fallback` swallowed each probe's actual error
into a `tracing::debug!` (silent in release) and then bailed with the
generic "no state found at current contract key or any of N legacy
keys" regardless of cause. This historically misrepresented transient
gateway slowdowns (every probe times out at 180s) as permanent data
loss — operators reading the freenet-stdlib mirror's 2026-05-14 Matrix
alerts saw the misleading message and burned investigation time before
realizing the contract still existed and the GETs were just timing out
during peer-routing discovery.
Add `ProbeOutcome` classification (Empty / NotFound / Timeout /
OtherError) and aggregate per-probe outcomes. The final bail message
distinguishes:
- **All timeouts**: "GET timed out on all N probe(s) after Xs each;
gateway or peer routing may be unhealthy" — clear transient signal.
- **All NotFound/empty**: original-style "no state found" message
preserved for the genuine data-loss case.
- **Mixed**: per-probe summary so the operator sees exactly what each
probe returned ("current key X: timeout; legacy key 0 (Y): NotFound").
The classification reads `get_state`'s `bail!` message strings; pinned
by `probe_outcome_classifies_get_state_errors` so a future change to
those strings without updating this matcher trips the test rather than
silently degrading the diagnostic.
Tests cover (a) classification, (b) all-timeout message, (c) all-
not-found message, (d) mixed-outcome detail. No behavior change for
the success paths or for the legacy-migration fallback semantics —
`get_state_with_legacy_fallback`'s contract still treats any per-probe
error as "try the next legacy key, then bail", just with a much more
useful final message.
Version bump 0.1.21 -> 0.1.22 to cover this on top of the f91b516
follow-up (NotFound in update_state) that landed after the 0.1.21
release commit.
[AI-assisted - Claude]
Skeptical-reviewer PR #54 M1: the string-match classifier could silently regress if get_state's bail! messages were edited without updating from_get_state_err -- the test re-asserted the literal strings rather than driving through the actual production sites. Extract the prefix/suffix into GET_TIMEOUT_PREFIX and GET_NOT_FOUND_SUFFIX consts, use them from both the bail sites in get_state and the classifier in from_get_state_err. The test reuses the same consts so a future edit to either bail site must also update the const, and any reader can grep one symbol to find all coupled sites. [AI-assisted - Claude]
|
Multi-model review (skeptical) raised one medium-severity concern: M1 (resolved in 31fa2e2): The string-match classifier could silently regress if `get_state`'s `bail!` messages were edited without updating `from_get_state_err`. The test re-asserted the literal strings rather than driving through the actual production sites, so a drift would leave both classifier and test passing while production regressed to the generic 'mixed outcomes' message. Fix: extracted `GET_TIMEOUT_PREFIX` and `GET_NOT_FOUND_SUFFIX` as `const`s used at both the `bail!` sites in `get_state` and the classifier in `from_get_state_err`. The test now builds error strings using the same `const`s — so a future edit to either bail message must also update the `const`, and the classifier picks up the new prefix automatically. Other findings deferred:
[AI-assisted - Claude] |
…anch Codex PR #54 review: P3 #1: the all-timeout message hardcoded "push aborted to avoid overwriting unknown state", misleading when get_state_with_legacy_ fallback is called from non-push paths (rescue, fetch_repo_state). Reword to "operation aborted, state on the network is unknown". P3 #2: an all-OtherError outcome (e.g. WebSocket closed -> every send GET fails the same way) was falling into the mixed-outcomes branch even though the failures are uniform, contradicting the dominant- outcome contract documented in the function's rustdoc. Add a dedicated all-transport-errors branch that surfaces the first underlying error verbatim with a count. Adds format_fallback_failure_all_transport_errors_says_transport regression test pinning the new branch. [AI-assisted - Claude]
|
Codex review found two P3 issues, both fixed in d7e9d27:
`format_fallback_failure_all_transport_errors_says_transport` test added to pin the new branch. All 31 wsclient tests pass. [AI-assisted - Claude] |
Problem
`get_state_with_legacy_fallback` swallows each probe's actual error into a `tracing::debug!` (silent in release) and bails with the generic `"no state found at current contract key or any of N legacy keys"` regardless of cause. This misrepresents transient gateway slowdowns (every probe times out at 180s) as permanent data loss — operators reading the freenet-stdlib mirror's 2026-05-14 Matrix alerts saw the misleading message and burned investigation time before realizing the contract still existed and the GETs were just timing out during peer-routing discovery.
#52 (Ian's recent NotFound handling in `get_state` + `update_state`) addressed the wire-level "NotFound looks like empty state" angle, but the fallback-level error message is still generic. After the gateway-side root cause fix in freenet/freenet-core#4133, this diagnostic safety net catches any future failure mode of similar shape.
Solution
Add `ProbeOutcome` classification (`Empty` / `NotFound` / `Timeout` / `OtherError`) and aggregate per-probe outcomes. The final bail message distinguishes:
Classification reads `get_state`'s `bail!` message strings; pinned by `probe_outcome_classifies_get_state_errors` so a future change to those strings without updating this matcher trips the test rather than silently degrading the diagnostic.
No behavior change for success paths or legacy-migration semantics.
Testing
Four new unit tests in `wsclient::tests`:
All 65 freenet-git tests pass. fmt + clippy clean.
Release
Version bump `0.1.21 → 0.1.22` on top of `f91b516` (NotFound in `update_state`) which landed after the `release: 0.1.21` commit but before publish. Tag + `cargo publish` after merge.
[AI-assisted - Claude]