Skip to content

fix(desktop): actually close relay sockets — plugin:websocket|disconnect does not exist#1481

Merged
wesbillman merged 2 commits into
mainfrom
brain/relay-disconnect-fix
Jul 2, 2026
Merged

fix(desktop): actually close relay sockets — plugin:websocket|disconnect does not exist#1481
wesbillman merged 2 commits into
mainfrom
brain/relay-disconnect-fix

Conversation

@wesbillman

Copy link
Copy Markdown
Collaborator

Problem

Frequent relay disconnects when switching workspaces via the workspace rail (reported by Wes on staging).

Root cause: tauri-plugin-websocket 2.4.2 registers only connect and send commands, but the relay clients invoked plugin:websocket|disconnect in three places to tear down sockets. Every one of those calls rejected with "command not found" and was swallowed by .catch(() => {})no relay socket was ever actually closed in production:

  1. Every workspace switch leaked the previous authenticated relay connection (with live server-side subscriptions).
  2. The workspace-rail unread observer leaked up to one socket per inactive workspace per 30s poll.
  3. Every stall-watchdog reset / reconnect cycle leaked one more.

Observed live on a running app: 92 → 142 established TLS sockets over 65s, including 59 connections to a single relay edge IP. The zombie pile triggered CDN edge resets/throttling (Connection reset without closing handshake, HTTP error: 302 Found on reconnect) — the "frequent relay disconnects" symptom.

Why tests never caught it: the e2e bridge mocked a working plugin:websocket|disconnect handler, so tests exercised a disconnect that doesn't exist in production.

Fix

  • New closeWebSocket() helper that closes the socket the way the plugin's own JS API does: send a Close frame through plugin:websocket|send, which completes the close handshake and drops the connection from the plugin's manager. Fire-and-forget (not awaited) so a hung socket can't block callers via the plugin's global send mutex.
  • Converted all three call sites (relayClientSession.ts ×2, readOnlyRelayClient.ts).
  • Removed the e2e bridge's mock disconnect handler and updated the onboarding e2e assertion to expect the Close frame.
  • Added a regression test that fails if any source file ever invokes plugin:websocket|disconnect again.

Testing

  • Unit tests: 1547/1547 pass; typecheck, biome, file-size checks clean.
  • e2e smoke: 349 passed, 2 failed — both failures (channels.spec.ts:867, scroll-history.spec.ts:289) reproduce identically on a pristine origin/main worktree, i.e. pre-existing flakes unrelated to this change.

…ect does not exist

tauri-plugin-websocket registers only `connect` and `send`, but the relay
clients invoked `plugin:websocket|disconnect` to tear down sockets. Every
call rejected with "command not found" and was swallowed, so no socket was
ever closed: each workspace switch leaked the previous authenticated relay
connection, the workspace-rail unread observer leaked one socket per
inactive workspace per 30s poll, and every reconnect cycle leaked one more.
The zombie pile (observed: 59 sockets to one relay edge IP, growing) then
triggered edge resets/throttling — the "frequent relay disconnects when
switching workspaces" report.

Close the socket the way the plugin's own JS API does: send a Close frame
through `plugin:websocket|send`, which completes the close handshake and
drops the connection from the plugin's manager.

The e2e bridge mocked a working `plugin:websocket|disconnect`, so tests
passed while production silently no-opped. Remove the mock handler and add
a regression test that fails if any source file ever invokes the
nonexistent command again.

Co-authored-by: Brain <21994759fc7a6fa6b965551d35cfd7897d262f2495467f2d78694ddcfa6a5c7e@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Wes <wesbillman@users.noreply.github.com>

@wpfleger96 wpfleger96 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Verified the root cause against the crate source (tauri-plugin-websocket 2.4.2 in Cargo.lock): src/lib.rs:291 registers generate_handler![connect, send] only — there is genuinely no disconnect command, so every plugin:websocket|disconnect invoke was rejecting and getting swallowed. Nothing was closing in production.

The Close-frame fix is correct end-to-end:

  • The { type: "Close", data: { code, reason } } payload matches the plugin's WebSocketMessage::Close(CloseFrame { code: u16, reason: String }) deserialization (lib.rs:139-141, 256-258).
  • The plugin drops the socket from its manager map when the close handshake completes (lib.rs:205-207), exactly as the doc comment states.
  • Fire-and-forget (void, unawaited) is well-justified: ConnectionManager is a single Mutex<HashMap> (lib.rs:72) and send holds that lock across its .await (lib.rs:249-261), so awaiting a Close on a wedged socket could stall other sockets' sends behind the global lock.

The e2e bridge change is consistent — __BUZZ_E2E_COMMAND_PAYLOADS__ is populated at e2eBridge.ts:7338 with the {command, payload} shape the onboarding spec now polls for. The regression guard is a nice touch. Two optional nits inline; neither blocks.

* tauri-plugin-websocket 2.4.2 registers only `connect` and `send` — there is
* no `disconnect` command, so invoking one rejects and the socket leaks. Close
* the way the plugin's own JS API does: send a Close frame; the plugin's read
* loop drops the connection when the close handshake completes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (🟢): The comment says the plugin "drops the connection when the close handshake completes." Worth being precise — the plugin removes the writer entry only when its read loop receives a Message::Close back (lib.rs:205-207), i.e. when the peer echoes our Close. If a relay never echoes it (half-open / dead peer), the entry lingers until the underlying TCP read stream errors out and terminates read.for_each. That's the plugin's own behavior and strictly better than the old code (which closed nothing), but "leak-free" ultimately depends on the peer or TCP layer eventually terminating the read stream, not on our send returning. Not a blocker — just flagging so the comment doesn't over-promise.

type: "Close",
data: { code: 1000, reason },
},
}).then(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (🟢): The rejection handler silently swallows all errors. That's the right default for ConnectionNotFound (socket already gone — nothing to release), but it also hides genuinely unexpected failures (e.g. a malformed payload if the plugin's schema ever changes). Consider a console.debug in the catch so it's greppable during future debugging without adding noise. Optional.

…rejections

Co-authored-by: Brain <21994759fc7a6fa6b965551d35cfd7897d262f2495467f2d78694ddcfa6a5c7e@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Wes <wesbillman@users.noreply.github.com>
@wesbillman wesbillman enabled auto-merge (squash) July 2, 2026 22:46
@wesbillman wesbillman merged commit e70dd1a into main Jul 2, 2026
25 checks passed
@wesbillman wesbillman deleted the brain/relay-disconnect-fix branch July 2, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants