rpc: cancel in-flight op on fatal read error to fix Call hang by yperbasis · Pull Request #20932 · erigontech/erigon

yperbasis · 2026-04-30T18:24:36Z

Summary

Fix a long-standing race in Client.dispatch (rpc/client.go) where a fatal readErr racing against reqSent could orphan the in-flight op's resp channel, leaving op.wait blocked forever on a channel nobody would ever close.
The readErr handler was passing lastOp to cancelAllRequests, which deliberately preserves the inflight op's resp — directly contradicting the comment immediately above:
```
// A read error is fatal for the connection, and all pending requests
// must be cancelled, including any that might still be considered in-flight.
```
Pass nil instead so the inflight op is cancelled too. Clear lastOp so a concurrent reconnect doesn't re-register an op whose resp is already closed (a later op.resp <- batch would panic). Nil-guard addRequestOp in the reconnect handler accordingly.

Why

Surfaced as the long-running CI flake #16875 — TestWebsocketLargeCall hanging for ~59 minutes until the test timeout fires. From a recent failing run's goroutine dump:

the test goroutine sat in (*requestOp).wait (rpc/client.go:144) for 58 minutes — i.e., past c.send, with the write already returned;
Client.dispatch was idle in select (rpc/client.go:597) for the same 58 minutes;
there were no goroutines inside coder/websocket Read or Write — both had returned.

The only way to reach that state is: readErr fired, dispatch processed it before/instead of reqSent's lastOp = nil, cancelAllRequests skipped the in-flight op (because lastOp was passed as inflightReq), and so op.resp was never closed. The select between c.readErr and c.reqSent (which is buffered, size 1) is a coin flip — that's the flake.

Test plan

make lint — clean
make erigon integration — builds
go test -race ./rpc/... — full suite passes
go test -race -count=10 -run '^TestWebsocketLargeCall$' ./rpc/ — 10/10 pass
CI race-tests / core-rpc — should be green

🤖 Generated with Claude Code

…n close The readErr handler in dispatch was passing lastOp to cancelAllRequests, which deliberately preserves the inflight op's resp channel — contradicting the comment right above that says all pending requests must be cancelled. When readErr won the select against reqSent, the in-flight op was orphaned: respWait was emptied, lastOp was reset to nil, but op.resp was never closed, so op.wait blocked forever on a channel that no one would ever close or send on. This was the underlying race behind the long-running flake of TestWebsocketLargeCall (#16875). Pass nil to cancelAllRequests on a fatal read error so the inflight op is cancelled along with the rest, and clear lastOp so a concurrent reconnect doesn't re-register an op whose resp is already closed (a later send on it would panic). Nil-guard addRequestOp in the reconnected handler for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes a race in the RPC websocket client dispatch loop where a fatal read error could leave an in-flight request orphaned, causing Call/BatchCall to hang indefinitely waiting on an unclosed op.resp.

Changes:

Treat fatal readErr as cancelling all pending requests (including the in-flight lastOp) by passing nil to conn.close.
Clear lastOp after fatal read error to prevent reconnect from re-registering an op whose resp channel has already been closed.
Add a nil-guard when re-registering lastOp on reconnect.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

After the readErr branch clears lastOp, a write error already in flight on the send goroutine can still arrive on c.reqSent. Without a guard, removeRequestOp(nil) would dereference op.ids and panic. Per Copilot review on #20932. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Giulio2002

LGTM — small, targeted fix for a reconnect/read-error edge case: clear the cancelled in-flight op and guard the later re-register/remove paths to avoid reusing a dead response channel.

yperbasis requested review from canepat and lupin012 as code owners April 30, 2026 18:24

yperbasis mentioned this pull request Apr 30, 2026

rpc: bound WebSocket write with wsPingInterval timeout #20923

Merged

yperbasis added the RPC label Apr 30, 2026

yperbasis requested review from anacrolix and Copilot April 30, 2026 18:37

Copilot started reviewing on behalf of yperbasis April 30, 2026 18:38 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread rpc/client.go

Comment thread rpc/client.go

yperbasis enabled auto-merge April 30, 2026 19:04

yperbasis added the flaky test label Apr 30, 2026

Merge branch 'main' into fix/test-websocket-large-call-hang

19c8127

Giulio2002 approved these changes May 1, 2026

View reviewed changes

yperbasis added this pull request to the merge queue May 1, 2026

Merged via the queue into main with commit ce3a410 May 1, 2026
38 checks passed

yperbasis deleted the fix/test-websocket-large-call-hang branch May 1, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: cancel in-flight op on fatal read error to fix Call hang#20932

rpc: cancel in-flight op on fatal read error to fix Call hang#20932
yperbasis merged 3 commits intomainfrom
fix/test-websocket-large-call-hang

yperbasis commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Giulio2002 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yperbasis commented Apr 30, 2026

Summary

Why

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Giulio2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants