Skip to content

rpc: cancel in-flight op on fatal read error to fix Call hang#20932

Merged
yperbasis merged 3 commits intomainfrom
fix/test-websocket-large-call-hang
May 1, 2026
Merged

rpc: cancel in-flight op on fatal read error to fix Call hang#20932
yperbasis merged 3 commits intomainfrom
fix/test-websocket-large-call-hang

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Summary

  • Fix a long-standing race in Client.dispatch (rpc/client.go) where a fatal readErr racing against reqSent could orphan the in-flight op's resp channel, leaving op.wait blocked forever on a channel nobody would ever close.
  • The readErr handler was passing lastOp to cancelAllRequests, which deliberately preserves the inflight op's resp — directly contradicting the comment immediately above:
    // A read error is fatal for the connection, and all pending requests
    // must be cancelled, including any that might still be considered in-flight.
    
    Pass nil instead so the inflight op is cancelled too. Clear lastOp so a concurrent reconnect doesn't re-register an op whose resp is already closed (a later op.resp <- batch would panic). Nil-guard addRequestOp in the reconnect handler accordingly.

Why

Surfaced as the long-running CI flake #16875TestWebsocketLargeCall hanging for ~59 minutes until the test timeout fires. From a recent failing run's goroutine dump:

  • the test goroutine sat in (*requestOp).wait (rpc/client.go:144) for 58 minutes — i.e., past c.send, with the write already returned;
  • Client.dispatch was idle in select (rpc/client.go:597) for the same 58 minutes;
  • there were no goroutines inside coder/websocket Read or Write — both had returned.

The only way to reach that state is: readErr fired, dispatch processed it before/instead of reqSent's lastOp = nil, cancelAllRequests skipped the in-flight op (because lastOp was passed as inflightReq), and so op.resp was never closed. The select between c.readErr and c.reqSent (which is buffered, size 1) is a coin flip — that's the flake.

Test plan

  • make lint — clean
  • make erigon integration — builds
  • go test -race ./rpc/... — full suite passes
  • go test -race -count=10 -run '^TestWebsocketLargeCall$' ./rpc/ — 10/10 pass
  • CI race-tests / core-rpc — should be green

🤖 Generated with Claude Code

…n close

The readErr handler in dispatch was passing lastOp to cancelAllRequests,
which deliberately preserves the inflight op's resp channel — contradicting
the comment right above that says all pending requests must be cancelled.
When readErr won the select against reqSent, the in-flight op was orphaned:
respWait was emptied, lastOp was reset to nil, but op.resp was never closed,
so op.wait blocked forever on a channel that no one would ever close or
send on. This was the underlying race behind the long-running flake of
TestWebsocketLargeCall (#16875).

Pass nil to cancelAllRequests on a fatal read error so the inflight op is
cancelled along with the rest, and clear lastOp so a concurrent reconnect
doesn't re-register an op whose resp is already closed (a later send on it
would panic). Nil-guard addRequestOp in the reconnected handler for the
same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in the RPC websocket client dispatch loop where a fatal read error could leave an in-flight request orphaned, causing Call/BatchCall to hang indefinitely waiting on an unclosed op.resp.

Changes:

  • Treat fatal readErr as cancelling all pending requests (including the in-flight lastOp) by passing nil to conn.close.
  • Clear lastOp after fatal read error to prevent reconnect from re-registering an op whose resp channel has already been closed.
  • Add a nil-guard when re-registering lastOp on reconnect.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rpc/client.go
Comment thread rpc/client.go
After the readErr branch clears lastOp, a write error already in flight
on the send goroutine can still arrive on c.reqSent. Without a guard,
removeRequestOp(nil) would dereference op.ids and panic.

Per Copilot review on #20932.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Giulio2002 Giulio2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, targeted fix for a reconnect/read-error edge case: clear the cancelled in-flight op and guard the later re-register/remove paths to avoid reusing a dead response channel.

@yperbasis yperbasis added this pull request to the merge queue May 1, 2026
Merged via the queue into main with commit ce3a410 May 1, 2026
38 checks passed
@yperbasis yperbasis deleted the fix/test-websocket-large-call-hang branch May 1, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants