Skip to content

fix(relay): avoid websocket writes in stall watchdog#697

Merged
tlongwell-block merged 2 commits into
mainfrom
max/relay-ideal-reconnect-fix
May 21, 2026
Merged

fix(relay): avoid websocket writes in stall watchdog#697
tlongwell-block merged 2 commits into
mainfrom
max/relay-ideal-reconnect-fix

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

@tlongwell-block tlongwell-block commented May 21, 2026

Summary

  • replace the active NIP-01 watchdog probe with passive inbound-idle liveness tracking
  • keep the connection generation guard so stale websocket callbacks cannot mutate/reset the current socket
  • add a deterministic Playwright regression that simulates plugin websocket sends wedging and verifies the watchdog performs no writes while half-open

Why

tauri-plugin-websocket 2.4.2 holds its global ConnectionManager mutex across write.send(...).await. The watchdog probe introduced in #623 writes periodically while idle; during a WARP half-open state that send can park in poll_flush while holding the mutex. Subsequent reconnects cannot register their writer/read loop, so AUTH never reaches JS and the app stays red forever.

The fix removes the trigger: liveness checks never write to the suspect socket. We instead record inbound frames (including relay heartbeat pings) and mark the connection stalled only after 60s with no inbound traffic.

Verification

  • pnpm --dir desktop check
  • pnpm --dir desktop typecheck
  • pnpm --dir desktop build
  • node --test desktop/src/shared/api/relayStallWatchdog.test.mjs desktop/src/shared/api/relayReconnectPolicy.test.mjs desktop/src/shared/api/relayConnectionStateEmitter.test.mjs desktop/src/shared/api/relayClientShared.test.mjs
  • pnpm --dir desktop exec playwright test tests/e2e/relay-reconnect.spec.ts --project=smoke
  • pre-commit hook: desktop-check, desktop-tauri-fmt, mobile-check, rust-fmt, web-check
  • pre-push hook: rust-fmt, web-check, desktop-tauri-fmt, desktop-check, web-build, rust-clippy, rust-tests, mobile-check, desktop-build, mobile-test, desktop-tauri-check

@tlongwell-block tlongwell-block requested a review from a team as a code owner May 21, 2026 14:56
tlongwell-block and others added 2 commits May 21, 2026 11:04
The websocket `Channel` callback closure is bound to a specific
connection attempt. After `resetConnection()` swaps the socket, the
old `Channel` can still deliver buffered/in-flight callbacks against
the new socket — most painfully a stale AUTH challenge from the
dead connection getting answered against the live one, or vice
versa, which keeps us in a reconnect loop.

Tag every `Channel<unknown>` with a monotonic `connectionGeneration`
captured at creation. `handleWsMessage` and `handleAuthChallenge`
drop messages whose generation no longer matches the current
connection. The generation bumps in both `connect()` and
`resetConnection()` so disconnects also invalidate in-flight
callbacks.

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
The previous stall watchdog issued a periodic NIP-01 REQ ("are you
still there?") and treated a missing EOSE as a stalled socket. That
write is the trigger for a much worse failure on Warp / VPN-asleep
half-open sockets: the tauri-plugin-websocket `send` command holds
the global connection-manager mutex across the underlying
`poll_flush`, so a probe parked on a dead socket blocks every
subsequent `connect` from registering its writer in the map. The
replacement socket's read loop never starts → no AUTH challenge ever
reaches JS → 8s AUTH timeout → reset → repeat forever. We've
reproduced this as the WARP wedge.

Remove the active probe entirely. Track inbound activity instead:
`recordInbound()` is called from `handleWsMessage` for every frame
(including relay heartbeat pings, which are observable as Channel
messages even when not surfaced as nostr-protocol payloads). After
60s with no inbound frame at all we declare a stall and call back
into the client, which tears down the socket so the existing
reconnect path runs. The watchdog itself performs zero writes,
which means it cannot trigger the plugin wedge it's trying to
detect.

A deterministic Playwright regression in
`desktop/tests/e2e/relay-reconnect.spec.ts` simulates the plugin
symptom directly by hanging `plugin:websocket|send` in the e2e
bridge and verifies the watchdog does not write into the half-open
socket while reconnect proceeds.

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block force-pushed the max/relay-ideal-reconnect-fix branch from 2ee4346 to 74fee72 Compare May 21, 2026 15:05
@tlongwell-block tlongwell-block merged commit 4373a13 into main May 21, 2026
15 checks passed
@tlongwell-block tlongwell-block deleted the max/relay-ideal-reconnect-fix branch May 21, 2026 16:56
tlongwell-block added a commit that referenced this pull request May 21, 2026
Bring pulse-front-back up to date with main prior to opening a PR.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>

* origin/main: (35 commits)
  feat(sprout-agent): auto-fallback to Databricks OAuth (#699)
  fix(relay): avoid websocket writes in stall watchdog (#697)
  feat(sprout-agent): Databricks provider with OAuth 2.0 PKCE auth (#698)
  Add Ubuntu desktop release artifacts (#693)
  chore(deps): update rust crate tokio to v1.52.3 (#658)
  chore(deps): update all non-major dependencies (#650)
  chore(deps): update rust crate sherpa-onnx to v1.13.2 (#657)
  chore(deps): update dependency nostr-tools to v2.23.5 (#681)
  chore(deps): update tanstack-router monorepo (#659)
  chore(deps): update rust crate dashmap to v6.2.1 (#652)
  chore(deps): update rust crate tower-http to v0.6.11 (#647)
  chore(deps): update rust crate reqwest to v0.13.3 (#639)
  chore(deps): update rust crate sherpa-onnx to v1.12.40 (#640)
  chore(deps): update dependency @tanstack/react-query to v5.100.11 (#635)
  fix(deps): update rust crate sha2 to 0.11 (#665)
  fix(deps): update rust crate bzip2 to 0.6 (#661)
  chore(deps): update rust crate uuid to v1.23.1 (#648)
  chore(deps): update rust crate tauri-plugin-dialog to v2.7.1 (#644)
  chore(deps): update tanstack-router monorepo (#649)
  chore(deps): update rust crate tokio to v1.51.3 (#646)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant