Skip to content

rpc: bound WebSocket write with wsPingInterval timeout#20923

Merged
yperbasis merged 3 commits intomainfrom
lupin012/rpc-ws-write-timeout
May 1, 2026
Merged

rpc: bound WebSocket write with wsPingInterval timeout#20923
yperbasis merged 3 commits intomainfrom
lupin012/rpc-ws-write-timeout

Conversation

@lupin012
Copy link
Copy Markdown
Contributor

@lupin012 lupin012 commented Apr 30, 2026

Without a deadline, a WebSocket write can block indefinitely when the peer stops reading (e.g. server sends StatusMessageTooBig and closes). The blocked write holds the write mutex, preventing the read path from acquiring it to send its own close-frame response — a permanent deadlock that manifests as a 59-minute hang in TestWebsocketLargeCall under -race.

Fix: in websocketCodec.WriteJSON, when the caller provides no context deadline, wrap the context with a wsPingInterval (60 s) timeout before passing it to jsonCodec.WriteJSON. This ensures SetWriteDeadline is set to at most 60 s, matching the dead-connection detection window used by the ping loop.

Without a deadline, wsConnAdapter.encode holds the write mutex
indefinitely when the peer stops reading (e.g. server sends
StatusMessageTooBig and closes). This blocks the read path from
acquiring the write mutex to send its own close-frame response,
causing a permanent deadlock that manifests as a 59-minute hang
in TestWebsocketLargeCall under -race.

Fix: when no explicit write deadline is set, cap the write at
wsPingInterval (60 s) — the same window used by the ping loop to
detect dead connections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lupin012 lupin012 marked this pull request as ready for review April 30, 2026 09:30
@lupin012 lupin012 requested a review from canepat as a code owner April 30, 2026 09:30
@yperbasis yperbasis requested a review from Copilot April 30, 2026 16:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent WebSocket RPC deadlocks caused by a blocked write holding the websocket library’s internal write mutex when the peer stops reading, by ensuring writes have a timeout even when no write deadline is configured.

Changes:

  • Add a fallback write timeout (wsPingInterval) in wsConnAdapter.encode when no write deadline is set.
  • Refactor context/cancel handling so a cancel func is always deferred.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rpc/websocket.go Outdated
lupin012 and others added 2 commits April 30, 2026 20:16
SetWriteDeadline is always called with a non-zero value before encode()
(either from the ctx deadline or defaultWriteTimeout=10min), so the
dl.IsZero() branch in encode() was dead code and the deadlock fix
never ran.

Move the wsPingInterval cap to websocketCodec.WriteJSON where the
"no deadline" decision is made: wrap ctx with a 60s timeout when the
caller provides no deadline, so jsonCodec.WriteJSON uses it instead
of the 10-minute fallback. This ensures a blocked write releases the
write mutex within wsPingInterval, unblocking the read path's
close-frame response.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yperbasis
Copy link
Copy Markdown
Member

Thanks for chasing this — the test has been a real pain. But I think this fix is mistargeted and won't address the actual hang. Two issues:

1. The premise — that wsConnAdapter.encode runs without a deadline — isn't accurate.

Every WebSocket write goes through jsonCodec.WriteJSON (rpc/json.go:249-258), which calls c.conn.SetWriteDeadline(deadline) before invoking encode, falling back to defaultWriteTimeout = 10 * time.Minute (rpc/json.go:42) when the caller's context has no deadline. By the time encode reads a.deadline, it's always non-zero in the WebSocket path, so the dl.IsZero() branch this PR is patching is unreachable from any WebSocket call site. The change is effectively a no-op in production.

(If we genuinely want to tighten the cap from 10 min → 60 s, the right place is defaultWriteTimeout in rpc/json.go:42, not a dead branch in wsConnAdapter.encode.)

2. The hang in run 25155735041 doesn't match a stuck write.

The dump puts the test goroutine in (*requestOp).wait (rpc/client.go:144) — which is past c.send, with the write already returned. Client.dispatch is idle in select (rpc/client.go:597). There are zero goroutines in coder/websocket Read or Write — both returned. That isn't consistent with Conn.Write holding the write mutex for 59 minutes; it's consistent with the dispatch loop having processed readErr but failing to close the in-flight op's resp channel.

What I think is actually happening, and what I've put up in #20932:

  • dispatch's readErr handler (rpc/client.go:609-614) calls conn.close(err, lastOp).
  • cancelAllRequests interprets a non-nil inflightReq as "preserve this one for reconnect" and skips closing its resp.
  • The comment right above that call says the opposite: "all pending requests must be cancelled, including any that might still be considered in-flight."
  • The select between c.readErr and the buffered c.reqSent is non-deterministic. When readErr wins, the in-flight op is orphaned: removed from respWait, lastOp reset to nil on the next iteration, op.resp never closed → op.wait blocks forever.

That race fits the goroutine dump exactly and explains the flake's long history.

Happy to be wrong here — if you can point to a different failing dump where a coder/websocket Write is actually parked for the full duration, I'd love to see it. Otherwise, I'd suggest closing this in favor of #20932.

@yperbasis yperbasis added this pull request to the merge queue May 1, 2026
Merged via the queue into main with commit a32e5c7 May 1, 2026
38 checks passed
@yperbasis yperbasis deleted the lupin012/rpc-ws-write-timeout branch May 1, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants