rpc: bound WebSocket write with wsPingInterval timeout by lupin012 · Pull Request #20923 · erigontech/erigon

lupin012 · 2026-04-30T09:30:07Z

Without a deadline, a WebSocket write can block indefinitely when the peer stops reading (e.g. server sends StatusMessageTooBig and closes). The blocked write holds the write mutex, preventing the read path from acquiring it to send its own close-frame response — a permanent deadlock that manifests as a 59-minute hang in TestWebsocketLargeCall under -race.

Fix: in websocketCodec.WriteJSON, when the caller provides no context deadline, wrap the context with a wsPingInterval (60 s) timeout before passing it to jsonCodec.WriteJSON. This ensures SetWriteDeadline is set to at most 60 s, matching the dead-connection detection window used by the ping loop.

Without a deadline, wsConnAdapter.encode holds the write mutex indefinitely when the peer stops reading (e.g. server sends StatusMessageTooBig and closes). This blocks the read path from acquiring the write mutex to send its own close-frame response, causing a permanent deadlock that manifests as a 59-minute hang in TestWebsocketLargeCall under -race. Fix: when no explicit write deadline is set, cap the write at wsPingInterval (60 s) — the same window used by the ping loop to detect dead connections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR aims to prevent WebSocket RPC deadlocks caused by a blocked write holding the websocket library’s internal write mutex when the peer stops reading, by ensuring writes have a timeout even when no write deadline is configured.

Changes:

Add a fallback write timeout (wsPingInterval) in wsConnAdapter.encode when no write deadline is set.
Refactor context/cancel handling so a cancel func is always deferred.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SetWriteDeadline is always called with a non-zero value before encode() (either from the ctx deadline or defaultWriteTimeout=10min), so the dl.IsZero() branch in encode() was dead code and the deadlock fix never ran. Move the wsPingInterval cap to websocketCodec.WriteJSON where the "no deadline" decision is made: wrap ctx with a 60s timeout when the caller provides no deadline, so jsonCodec.WriteJSON uses it instead of the 10-minute fallback. This ensures a blocked write releases the write mutex within wsPingInterval, unblocking the read path's close-frame response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

yperbasis · 2026-04-30T18:24:58Z

Thanks for chasing this — the test has been a real pain. But I think this fix is mistargeted and won't address the actual hang. Two issues:

1. The premise — that wsConnAdapter.encode runs without a deadline — isn't accurate.

Every WebSocket write goes through jsonCodec.WriteJSON (rpc/json.go:249-258), which calls c.conn.SetWriteDeadline(deadline) before invoking encode, falling back to defaultWriteTimeout = 10 * time.Minute (rpc/json.go:42) when the caller's context has no deadline. By the time encode reads a.deadline, it's always non-zero in the WebSocket path, so the dl.IsZero() branch this PR is patching is unreachable from any WebSocket call site. The change is effectively a no-op in production.

(If we genuinely want to tighten the cap from 10 min → 60 s, the right place is defaultWriteTimeout in rpc/json.go:42, not a dead branch in wsConnAdapter.encode.)

2. The hang in run 25155735041 doesn't match a stuck write.

The dump puts the test goroutine in (*requestOp).wait (rpc/client.go:144) — which is past c.send, with the write already returned. Client.dispatch is idle in select (rpc/client.go:597). There are zero goroutines in coder/websocket Read or Write — both returned. That isn't consistent with Conn.Write holding the write mutex for 59 minutes; it's consistent with the dispatch loop having processed readErr but failing to close the in-flight op's resp channel.

What I think is actually happening, and what I've put up in #20932:

dispatch's readErr handler (rpc/client.go:609-614) calls conn.close(err, lastOp).
cancelAllRequests interprets a non-nil inflightReq as "preserve this one for reconnect" and skips closing its resp.
The comment right above that call says the opposite: "all pending requests must be cancelled, including any that might still be considered in-flight."
The select between c.readErr and the buffered c.reqSent is non-deterministic. When readErr wins, the in-flight op is orphaned: removed from respWait, lastOp reset to nil on the next iteration, op.resp never closed → op.wait blocks forever.

That race fits the goroutine dump exactly and explains the flake's long history.

Happy to be wrong here — if you can point to a different failing dump where a coder/websocket Write is actually parked for the full duration, I'd love to see it. Otherwise, I'd suggest closing this in favor of #20932.

lupin012 requested review from AskAlexSharov and yperbasis April 30, 2026 09:30

lupin012 marked this pull request as ready for review April 30, 2026 09:30

lupin012 requested a review from canepat as a code owner April 30, 2026 09:30

yperbasis added flaky test RPC labels Apr 30, 2026

yperbasis requested a review from Copilot April 30, 2026 16:10

Copilot started reviewing on behalf of yperbasis April 30, 2026 16:11 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread rpc/websocket.go Outdated

lupin012 and others added 2 commits April 30, 2026 20:16

Merge branch 'main' into lupin012/rpc-ws-write-timeout

7b09f17

yperbasis approved these changes May 1, 2026

View reviewed changes

yperbasis added this pull request to the merge queue May 1, 2026

Merged via the queue into main with commit a32e5c7 May 1, 2026
38 checks passed

yperbasis deleted the lupin012/rpc-ws-write-timeout branch May 1, 2026 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: bound WebSocket write with wsPingInterval timeout#20923

rpc: bound WebSocket write with wsPingInterval timeout#20923
yperbasis merged 3 commits intomainfrom
lupin012/rpc-ws-write-timeout

lupin012 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

yperbasis commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lupin012 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

yperbasis commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lupin012 commented Apr 30, 2026 •

edited

Loading