rpc: bound WebSocket write with wsPingInterval timeout#20923
Conversation
Without a deadline, wsConnAdapter.encode holds the write mutex indefinitely when the peer stops reading (e.g. server sends StatusMessageTooBig and closes). This blocks the read path from acquiring the write mutex to send its own close-frame response, causing a permanent deadlock that manifests as a 59-minute hang in TestWebsocketLargeCall under -race. Fix: when no explicit write deadline is set, cap the write at wsPingInterval (60 s) — the same window used by the ping loop to detect dead connections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent WebSocket RPC deadlocks caused by a blocked write holding the websocket library’s internal write mutex when the peer stops reading, by ensuring writes have a timeout even when no write deadline is configured.
Changes:
- Add a fallback write timeout (
wsPingInterval) inwsConnAdapter.encodewhen no write deadline is set. - Refactor context/cancel handling so a cancel func is always deferred.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
SetWriteDeadline is always called with a non-zero value before encode() (either from the ctx deadline or defaultWriteTimeout=10min), so the dl.IsZero() branch in encode() was dead code and the deadlock fix never ran. Move the wsPingInterval cap to websocketCodec.WriteJSON where the "no deadline" decision is made: wrap ctx with a 60s timeout when the caller provides no deadline, so jsonCodec.WriteJSON uses it instead of the 10-minute fallback. This ensures a blocked write releases the write mutex within wsPingInterval, unblocking the read path's close-frame response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for chasing this — the test has been a real pain. But I think this fix is mistargeted and won't address the actual hang. Two issues: 1. The premise — that Every WebSocket write goes through (If we genuinely want to tighten the cap from 10 min → 60 s, the right place is 2. The hang in run 25155735041 doesn't match a stuck write. The dump puts the test goroutine in What I think is actually happening, and what I've put up in #20932:
That race fits the goroutine dump exactly and explains the flake's long history. Happy to be wrong here — if you can point to a different failing dump where a |
Without a deadline, a WebSocket write can block indefinitely when the peer stops reading (e.g. server sends
StatusMessageTooBigand closes). The blocked write holds the write mutex, preventing the read path from acquiring it to send its own close-frame response — a permanent deadlock that manifests as a 59-minute hang inTestWebsocketLargeCallunder-race.Fix: in
websocketCodec.WriteJSON, when the caller provides no context deadline, wrap the context with awsPingInterval(60 s) timeout before passing it tojsonCodec.WriteJSON. This ensuresSetWriteDeadlineis set to at most 60 s, matching the dead-connection detection window used by the ping loop.