fix(relay): avoid websocket writes in stall watchdog#697
Merged
Conversation
The websocket `Channel` callback closure is bound to a specific connection attempt. After `resetConnection()` swaps the socket, the old `Channel` can still deliver buffered/in-flight callbacks against the new socket — most painfully a stale AUTH challenge from the dead connection getting answered against the live one, or vice versa, which keeps us in a reconnect loop. Tag every `Channel<unknown>` with a monotonic `connectionGeneration` captured at creation. `handleWsMessage` and `handleAuthChallenge` drop messages whose generation no longer matches the current connection. The generation bumps in both `connect()` and `resetConnection()` so disconnects also invalidate in-flight callbacks. Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com> Co-authored-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
The previous stall watchdog issued a periodic NIP-01 REQ ("are you
still there?") and treated a missing EOSE as a stalled socket. That
write is the trigger for a much worse failure on Warp / VPN-asleep
half-open sockets: the tauri-plugin-websocket `send` command holds
the global connection-manager mutex across the underlying
`poll_flush`, so a probe parked on a dead socket blocks every
subsequent `connect` from registering its writer in the map. The
replacement socket's read loop never starts → no AUTH challenge ever
reaches JS → 8s AUTH timeout → reset → repeat forever. We've
reproduced this as the WARP wedge.
Remove the active probe entirely. Track inbound activity instead:
`recordInbound()` is called from `handleWsMessage` for every frame
(including relay heartbeat pings, which are observable as Channel
messages even when not surfaced as nostr-protocol payloads). After
60s with no inbound frame at all we declare a stall and call back
into the client, which tears down the socket so the existing
reconnect path runs. The watchdog itself performs zero writes,
which means it cannot trigger the plugin wedge it's trying to
detect.
A deterministic Playwright regression in
`desktop/tests/e2e/relay-reconnect.spec.ts` simulates the plugin
symptom directly by hanging `plugin:websocket|send` in the e2e
bridge and verifies the watchdog does not write into the half-open
socket while reconnect proceeds.
Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: npub1mprnacetjua2xx3p5eddmhxyk6wv929ymm5py8kd2xfxurxahspqqlgyta <d8473ee32b973aa31a21a65adddcc4b69cc2a8a4dee8121ecd51926e0cddbc02@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
2ee4346 to
74fee72
Compare
tlongwell-block
added a commit
that referenced
this pull request
May 21, 2026
Bring pulse-front-back up to date with main prior to opening a PR. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com> Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co> * origin/main: (35 commits) feat(sprout-agent): auto-fallback to Databricks OAuth (#699) fix(relay): avoid websocket writes in stall watchdog (#697) feat(sprout-agent): Databricks provider with OAuth 2.0 PKCE auth (#698) Add Ubuntu desktop release artifacts (#693) chore(deps): update rust crate tokio to v1.52.3 (#658) chore(deps): update all non-major dependencies (#650) chore(deps): update rust crate sherpa-onnx to v1.13.2 (#657) chore(deps): update dependency nostr-tools to v2.23.5 (#681) chore(deps): update tanstack-router monorepo (#659) chore(deps): update rust crate dashmap to v6.2.1 (#652) chore(deps): update rust crate tower-http to v0.6.11 (#647) chore(deps): update rust crate reqwest to v0.13.3 (#639) chore(deps): update rust crate sherpa-onnx to v1.12.40 (#640) chore(deps): update dependency @tanstack/react-query to v5.100.11 (#635) fix(deps): update rust crate sha2 to 0.11 (#665) fix(deps): update rust crate bzip2 to 0.6 (#661) chore(deps): update rust crate uuid to v1.23.1 (#648) chore(deps): update rust crate tauri-plugin-dialog to v2.7.1 (#644) chore(deps): update tanstack-router monorepo (#649) chore(deps): update rust crate tokio to v1.51.3 (#646) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
tauri-plugin-websocket2.4.2 holds its globalConnectionManagermutex acrosswrite.send(...).await. The watchdog probe introduced in #623 writes periodically while idle; during a WARP half-open state that send can park inpoll_flushwhile holding the mutex. Subsequent reconnects cannot register their writer/read loop, so AUTH never reaches JS and the app stays red forever.The fix removes the trigger: liveness checks never write to the suspect socket. We instead record inbound frames (including relay heartbeat pings) and mark the connection stalled only after 60s with no inbound traffic.
Verification
pnpm --dir desktop checkpnpm --dir desktop typecheckpnpm --dir desktop buildnode --test desktop/src/shared/api/relayStallWatchdog.test.mjs desktop/src/shared/api/relayReconnectPolicy.test.mjs desktop/src/shared/api/relayConnectionStateEmitter.test.mjs desktop/src/shared/api/relayClientShared.test.mjspnpm --dir desktop exec playwright test tests/e2e/relay-reconnect.spec.ts --project=smoke