fix(rescue): handle ContractResponse::NotFound in get_state + update_state (closes GET-side 180s hang)#52
Merged
Merged
Conversation
Pre-v0.2.56 the host never surfaced `ContractResponse::NotFound` on the WS API for client-initiated GETs. PR freenet/freenet-core#4076 (in v0.2.56) introduced NotFound as the terminal response when the new task-per-tx GET driver exhausts every peer reachable from the gateway's ring. This client's `get_state` recv loop only matched `GetResponse` and `UpdateNotification`; `NotFound` fell into the `_ => /* log + continue */` fallback, deadlocking the loop until the rescue's outer 180s timeout fired with a misleading "no state found …" error. Concrete symptom: the rescue-demos cron started failing every 12h on the freenet-stdlib and freenet-git history-mode cells after the production gateway upgraded to v0.2.57 on 2026-05-13; the matrix dev-channel alert "rescue-demos run failed (trigger: schedule)" recurred 4× before the cause was nailed down (matrix message dated 2026-05-15 02:37). Fix: extract a unit-testable `dispatch_get_response` helper that classifies each inbound `HostResponse` against the requested `instance_id`, and add a `Terminal(Vec::new())` arm for matching `NotFound`. Empty-bytes propagation slots cleanly into the existing `get_state_with_legacy_fallback` semantics — `Ok(empty)` already means "fall through to the next legacy probe, then bail with 'no state found at current contract key or any of N legacy keys'", so the user-visible error is unchanged but the bail happens in <1s instead of 180s. Demonstrated locally against the live nova gateway: a rescue against `freenet:ZZZZZZZZZZZZ/no-such-repo` returns the expected `error: no state found …` in 0.7s; pre-fix the same command hangs until the 180s timeout. Verified by 5 new unit tests for `dispatch_get_response` covering: the matching `GetResponse` → state path, unrelated-key skip, the new matching `NotFound` → terminal-empty path, unrelated-key NotFound skip, and the general unrelated `SubscribeResponse` skip arm. [AI-assisted - Claude]
[AI-assisted - Claude]
…t error Review feedback on PR #52 from four parallel reviewers (Codex, code-first, skeptical, big-picture) converged on two issues with the initial NotFound fix: 1. Conflating NotFound with `Ok(Vec::new())` was information-losing for callers other than `get_state_with_legacy_fallback`. In particular `get_pack` BLAKE3-verifies returned bytes against an expected pack hash; an empty payload would surface as a misleading "pack content hash mismatch: got af1349…" error instead of the actual "contract not found" cause. 2. `update_state` has the structurally identical `_ => /* ignore */` catch-all that swallows NotFound. freenet-core's task-per-tx UPDATE driver emits NotFound on retry exhaustion the same way GET does, so push paths deadlock the same way until the outer ws timeout fires. Per the "finish-the-fix" rule, the parallel bug belongs in this PR. This commit: - Splits `GetDispatch::Terminal(Vec<u8>)` into `State(Vec<u8>)` + `NotFound`; `get_state` now `bail!`s on NotFound with a clear "contract … not found on the network" message. The pre-existing `Err(_)` arm in `get_state_with_legacy_fallback` keeps the legacy migration semantics intact, so the only behavior change for the fallback caller is a clearer log message. - Adds `dispatch_update_response` mirroring `dispatch_get_response`, with the same NotFound -> bail handling. `update_state` now bails on NotFound for the requested instance_id rather than swallowing the response and timing out. - Adds a regression-guard test `dispatch_get_response_preserves_empty_state_distinct_from_not_found` pinning the design: an actual `GetResponse` with empty state bytes must surface as `State(empty)`, not `NotFound` — so a future refactor that re-collapses the two paths breaks the test rather than silently producing the BLAKE3-mismatch footgun. - Adds 7 new tests for `dispatch_update_response` mirroring the GET coverage matrix. Also runs `cargo fmt --all` to fix the formatting failure the fmt-check workflow flagged on the previous commit (a2489e2). [AI-assisted - Claude]
This was referenced May 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rescue-demosfailure mode: aContractResponse::NotFoundfrom the gateway was hitting the_ => /* log + continue */arms in bothwsclient::get_stateandwsclient::update_state, deadlocking the recv loop until the 180s rescue / push timeout fired with the misleading "no state found at current contract key …" error.dispatch_get_response/dispatch_update_responsehelpers (mirroring the pre-existingdispatch_put_responsepattern) and a 3-variant outcome:State(bytes)/Successfor happy-path,NotFoundfor the regression case,Continuefor unrelated messages.NotFoundfor ourinstance_idsurfaces as a distinctErrfromget_state/update_state, NOT as anOk(Vec::new())sentinel. This preserves the legacy-fallback's existing migration semantics (it already swallowsErr(_)and tries the next probe) while avoiding an information-losing footgun for the otherget_statecaller,get_pack, which BLAKE3-verifies returned bytes and would have surfaced NotFound as"pack content hash mismatch: got af1349…"if NotFound were collapsed to empty bytes.What this fix does NOT cover
The most recent
rescue-demosfailure (2026-05-15 18:51 UTC, run 25935580526) was a PUT-side timeout (put_pack attempt 1/3 failed: timed out waiting for PUT confirmation after 180s), not the GET-NotFound deadlock. After this PR merges, the GET-NotFound class of failures (3 of the last 4 cells) will stop, but the PUT-timeout class may keep firing until that path is investigated separately.The same NotFound-handling bug exists in other clients (river
scripts/add_member.rs, harvestui/src/gateway/response_handler.rs). These are out-of-repo and will be filed as separate tracking issues.Reproducer
Against the live nova gateway, before this PR:
After this PR (same command):
Test plan
dispatch_get_response, 8 fordispatch_update_response) covering: matching response → terminal-success, unrelated-key skip, matchingNotFound→ distinctNotFoundoutcome (regression guards), unrelated-keyNotFoundskip, empty-state-vs-not-found disambiguation (prevents the BLAKE3-mismatch footgun),UpdateNotificationhandling,HostResponse::Okhandling, and unrelatedSubscribeResponseskip armcargo test --workspace— 61+15+25+8+8+19+8+6+4 = 154 tests pass)cargo fmt --all --checkpasses (fixes the fmt-check failure on the first push)freenet-stdlibcontract still completes when state is presentmirror-to-freenet.ymlworkflows infreenet-coreandfreenet-stdlib, plusrescue-demos.ymlin this repo, allcargo install freenet-git --lockedwith no version pin, so they pick up the new release on the next run automatically — no workflow edits needed.[AI-assisted - Claude]