Skip to content

fix(rescue): handle ContractResponse::NotFound in get_state + update_state (closes GET-side 180s hang)#52

Merged
sanity merged 3 commits into
mainfrom
fix-not-found-handling
May 16, 2026
Merged

fix(rescue): handle ContractResponse::NotFound in get_state + update_state (closes GET-side 180s hang)#52
sanity merged 3 commits into
mainfrom
fix-not-found-handling

Conversation

@sanity
Copy link
Copy Markdown
Contributor

@sanity sanity commented May 16, 2026

Summary

  • Closes the dominant rescue-demos failure mode: a ContractResponse::NotFound from the gateway was hitting the _ => /* log + continue */ arms in both wsclient::get_state and wsclient::update_state, deadlocking the recv loop until the 180s rescue / push timeout fired with the misleading "no state found at current contract key …" error.
  • Variant has existed in the WS API since 2024 (freenet-core PR #2369), but freenet-core PR #4076 migrated the GET path to the task-per-tx driver, which now reaches the NotFound emission consistently on retry exhaustion. The bug was latent for ~2 years; it became visible after the production gateway upgraded to v0.2.57 (2026-05-13 22:40 UTC). Confirmed against CI logs — failing runs show exactly 180.5s elapsed before the bail.
  • Fix introduces dispatch_get_response / dispatch_update_response helpers (mirroring the pre-existing dispatch_put_response pattern) and a 3-variant outcome: State(bytes) / Success for happy-path, NotFound for the regression case, Continue for unrelated messages.
  • NotFound for our instance_id surfaces as a distinct Err from get_state / update_state, NOT as an Ok(Vec::new()) sentinel. This preserves the legacy-fallback's existing migration semantics (it already swallows Err(_) and tries the next probe) while avoiding an information-losing footgun for the other get_state caller, get_pack, which BLAKE3-verifies returned bytes and would have surfaced NotFound as "pack content hash mismatch: got af1349…" if NotFound were collapsed to empty bytes.

What this fix does NOT cover

The most recent rescue-demos failure (2026-05-15 18:51 UTC, run 25935580526) was a PUT-side timeout (put_pack attempt 1/3 failed: timed out waiting for PUT confirmation after 180s), not the GET-NotFound deadlock. After this PR merges, the GET-NotFound class of failures (3 of the last 4 cells) will stop, but the PUT-timeout class may keep firing until that path is investigated separately.

The same NotFound-handling bug exists in other clients (river scripts/add_member.rs, harvest ui/src/gateway/response_handler.rs). These are out-of-repo and will be filed as separate tracking issues.

Reproducer

Against the live nova gateway, before this PR:

$ time freenet-git rescue freenet:ZZZZZZZZZZZZ/no-such-repo \
    --ws-url 'ws://127.0.0.1:7509/v1/contract/command?encodingProtocol=native'
… DEBUG ignoring non-GET response while waiting other=NotFound{…}
error: timed out waiting for GET response after 180s   # 180+s

After this PR (same command):

error: no state found at current contract key or any of 0 legacy keys   # 0.7s

Test plan

  • 14 new unit tests (6 for dispatch_get_response, 8 for dispatch_update_response) covering: matching response → terminal-success, unrelated-key skip, matching NotFound → distinct NotFound outcome (regression guards), unrelated-key NotFound skip, empty-state-vs-not-found disambiguation (prevents the BLAKE3-mismatch footgun), UpdateNotification handling, HostResponse::Ok handling, and unrelated SubscribeResponse skip arm
  • Full workspace test suite passes (cargo test --workspace — 61+15+25+8+8+19+8+6+4 = 154 tests pass)
  • cargo fmt --all --check passes (fixes the fmt-check failure on the first push)
  • End-to-end against the live nova gateway: rescue against a non-existent contract now fast-fails in 0.7s; rescue against the real freenet-stdlib contract still completes when state is present
  • Once merged: publish 0.1.21 to crates.io. The cron mirror-to-freenet.yml workflows in freenet-core and freenet-stdlib, plus rescue-demos.yml in this repo, all cargo install freenet-git --locked with no version pin, so they pick up the new release on the next run automatically — no workflow edits needed.

[AI-assisted - Claude]

sanity added 3 commits May 15, 2026 19:39
Pre-v0.2.56 the host never surfaced `ContractResponse::NotFound` on the
WS API for client-initiated GETs. PR freenet/freenet-core#4076 (in
v0.2.56) introduced NotFound as the terminal response when the new
task-per-tx GET driver exhausts every peer reachable from the gateway's
ring. This client's `get_state` recv loop only matched `GetResponse` and
`UpdateNotification`; `NotFound` fell into the `_ => /* log + continue */`
fallback, deadlocking the loop until the rescue's outer 180s timeout
fired with a misleading "no state found …" error.

Concrete symptom: the rescue-demos cron started failing every 12h on
the freenet-stdlib and freenet-git history-mode cells after the
production gateway upgraded to v0.2.57 on 2026-05-13; the matrix
dev-channel alert "rescue-demos run failed (trigger: schedule)"
recurred 4× before the cause was nailed down (matrix message dated
2026-05-15 02:37).

Fix: extract a unit-testable `dispatch_get_response` helper that
classifies each inbound `HostResponse` against the requested
`instance_id`, and add a `Terminal(Vec::new())` arm for matching
`NotFound`. Empty-bytes propagation slots cleanly into the existing
`get_state_with_legacy_fallback` semantics — `Ok(empty)` already
means "fall through to the next legacy probe, then bail with 'no
state found at current contract key or any of N legacy keys'", so
the user-visible error is unchanged but the bail happens in <1s
instead of 180s.

Demonstrated locally against the live nova gateway: a rescue against
`freenet:ZZZZZZZZZZZZ/no-such-repo` returns the expected
`error: no state found …` in 0.7s; pre-fix the same command hangs
until the 180s timeout.

Verified by 5 new unit tests for `dispatch_get_response` covering:
the matching `GetResponse` → state path, unrelated-key skip, the
new matching `NotFound` → terminal-empty path, unrelated-key NotFound
skip, and the general unrelated `SubscribeResponse` skip arm.

[AI-assisted - Claude]
…t error

Review feedback on PR #52 from four parallel reviewers (Codex,
code-first, skeptical, big-picture) converged on two issues with the
initial NotFound fix:

1. Conflating NotFound with `Ok(Vec::new())` was information-losing
   for callers other than `get_state_with_legacy_fallback`. In
   particular `get_pack` BLAKE3-verifies returned bytes against an
   expected pack hash; an empty payload would surface as a misleading
   "pack content hash mismatch: got af1349…" error instead of the
   actual "contract not found" cause.

2. `update_state` has the structurally identical `_ => /* ignore */`
   catch-all that swallows NotFound. freenet-core's task-per-tx
   UPDATE driver emits NotFound on retry exhaustion the same way GET
   does, so push paths deadlock the same way until the outer ws
   timeout fires. Per the "finish-the-fix" rule, the parallel bug
   belongs in this PR.

This commit:
- Splits `GetDispatch::Terminal(Vec<u8>)` into `State(Vec<u8>)` +
  `NotFound`; `get_state` now `bail!`s on NotFound with a clear
  "contract … not found on the network" message. The pre-existing
  `Err(_)` arm in `get_state_with_legacy_fallback` keeps the legacy
  migration semantics intact, so the only behavior change for the
  fallback caller is a clearer log message.
- Adds `dispatch_update_response` mirroring `dispatch_get_response`,
  with the same NotFound -> bail handling. `update_state` now bails
  on NotFound for the requested instance_id rather than swallowing
  the response and timing out.
- Adds a regression-guard test
  `dispatch_get_response_preserves_empty_state_distinct_from_not_found`
  pinning the design: an actual `GetResponse` with empty state bytes
  must surface as `State(empty)`, not `NotFound` — so a future
  refactor that re-collapses the two paths breaks the test rather
  than silently producing the BLAKE3-mismatch footgun.
- Adds 7 new tests for `dispatch_update_response` mirroring the GET
  coverage matrix.

Also runs `cargo fmt --all` to fix the formatting failure the
fmt-check workflow flagged on the previous commit (a2489e2).

[AI-assisted - Claude]
@sanity sanity changed the title fix(rescue): handle ContractResponse::NotFound (closes rescue-demos 180s hang) fix(rescue): handle ContractResponse::NotFound in get_state + update_state (closes GET-side 180s hang) May 16, 2026
@sanity sanity merged commit 1ab015e into main May 16, 2026
5 checks passed
@sanity sanity deleted the fix-not-found-handling branch May 16, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant