Stop the WSConn :not_connected retry storm in NodeProxy#11
Merged
Conversation
WSConn.send_request/3 returns {:error, :not_connected} both when the
underlying WSConn process is dead and when it is still alive but its
async TCP/TLS/WS handshake has not finished within the 500 ms
Globals.await budget.
Since commit ae7d7a7 NodeProxy.send_request/6 evicted the conn and
asked ensure_connections to spawn a fresh one on every failure. Under
the 'still handshaking' case that fresh WSConn was also mid-handshake,
so the next request hit the same path and the warning
Failed to send request to #PID<...>: ... {:error, :not_connected}
followed by
** (RuntimeError) RPC error in ChainImpl.eth_getBlockByNumber(...):
{:error, :disconnect}
repeated forever instead of healing once the in-flight handshake
completed.
Only evict (and trigger ensure_connections) when the WSConn pid is
actually dead. If it is still alive, leave it in the pool. Either its
handshake finishes on its own, or its own ping watchdog tears it down
and the existing :DOWN monitor in NodeProxy cleans up.
Adds a regression test that covers all three branches of the new
handle_failed_send/2 helper.
Co-authored-by: Dominic Letz <dominicletz@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a more resilient connection management strategy in RemoteChain.NodeProxy by preventing the eager eviction of connections that are still alive but mid-handshake. By checking Process.alive?/1 before removal, the system avoids a 'retry storm' of constant connection respawning. Comprehensive regression tests were added to verify this behavior for both standard and fallback connections. I have no feedback to provide.
Same root cause as the previous commit -- WSConn.send_request/3
returning {:error, :not_connected} for an alive but not-yet-connected
WSConn -- but a complementary symptom:
00:15:45.880 [warning] Failed to send request to #PID<0.109448870.0>: ... {:error, :not_connected}
00:15:45.881 [info] Awaiting undefined key: {RemoteChain.WSConn, #PID<0.109449407.0>}
** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber(...): {:error, :disconnect}
The two log lines reference *different* WSConn pids 1 ms apart: the
first call to NodeProxy.handle_call({:rpc, ...}) randomly picked one
mid-handshake conn and timed out, then RPCCache.rpc/3's retry randomly
picked another mid-handshake conn and timed out again.
NodeProxy.handle_call({:rpc, ...}) used to do
Enum.random(Map.values(state.connections)) without any readiness
filter. Whenever the pool contained one or more handshaking WSConns
(right after start-up, after a fallback was added, after a slow
upstream rejected the previous conn) every random pick paid the full
500 ms Globals.await budget and returned :not_connected, and the
warning kept repeating instead of healing.
Fix:
* WSConn.ready?/1 -- non-blocking, side-effect-free readiness check
based on Globals.get({WSConn, pid}). Unlike send_request/3 it
never registers a Globals waiter, so it produces no
'Awaiting undefined key' / zombie-timeout log noise.
* NodeProxy.pick_connection/1 -- prefers ready WSConns; only falls
back to the full pool when none are ready (preserving the
empty-pool crash so callers still see a well-defined
:disconnect rather than a silent hang).
Adds three regression tests covering pick_connection/1's preference
logic, its all-handshaking fallback, and WSConn.ready?/1 itself
(including a refute_receive proving it does not register a waiter).
Co-authored-by: Dominic Letz <dominicletz@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes two related symptoms with the same root cause: a WSConn whose async TCP/TLS/WS handshake hasn't completed yet returns
{:error, :not_connected}fromRemoteChain.WSConn.send_request/3, andNodeProxyreacts to that as if the connection were dead, perpetuating the failure instead of letting the in-flight handshake finish.Symptoms
Symptom 1 — re-spawn storm
Symptom 2 — request-routing churn
Two distinct WSConn pids in two consecutive log lines: one had
WSConn.send_request/3time out, thenRPCCache.rpc/3's retry picked a second still-handshaking WSConn from the pool and timed out again.In both cases the messages keep repeating without healing.
Root cause
RemoteChain.WSConn.send_request/3returns{:error, :not_connected}in two distinct situations:handle_connect).Globals.await({WSConn, pid}, 500)budget. This is normal during the first few hundred ms afterWSConn.start/3, or any time the upstream RPC node is slow to accept the connection.Two pieces of
NodeProxyconflated case (2) with case (1):NodeProxy.send_request/6treated every failed send as a dead connection: replied{:error, :disconnect}, evicted the conn, and askedensure_connectionsto spawn a fresh WSConn. The fresh WSConn was also mid-handshake → next request hits the same path → loop never heals.NodeProxy.handle_call({:rpc, ...})picked a connection blindly withEnum.random(Map.values(state.connections)), with no readiness filter. When the pool contained any still-handshaking conns (right after start-up, after a fallback was added, after a slow upstream rejected the previous conn) every random pick paid the full 500 msGlobals.awaitbudget and failed.Fix
Two complementary changes — both in this PR — that target the same root cause:
NodeProxy.handle_failed_send/2(commit 1) — only evict (and re-armensure_connections) when the WSConn pid is actuallyProcess.alive?(false). If it's still alive, leave it in the pool: either its handshake completes on its own and subsequent requests succeed, or its own ping watchdog (WSConn.handle_info(:ping, _)) tears it down and the existing:DOWNmonitor inNodeProxycleans the pool up.NodeProxy.pick_connection/1(commit 2) — prefer WSConns whose handshake has already completed. Falls back to the full pool only when none are ready, preserving the empty-pool behaviour so callers still see a well-defined{:error, :disconnect}rather than a silent hang.This relies on a new helper,
RemoteChain.WSConn.ready?/1, a non-blocking, side-effect-free check based onGlobals.get({WSConn, pid}). UnlikeWSConn.send_request/3it does not register a waiter inGlobals, so it produces no "Awaiting undefined key" or zombie-timeout log noise even when called for many handshaking pids.Both decisions are extracted to public helpers (
@doc false) so they can be unit-tested without standing up real websockets.Tests
test/remote_chain/node_proxy_test.exscovers seven cases across three groups:describe "handle_failed_send/2":{:error, :disconnect}fallbackandfallback_urldescribe "pick_connection/1":describe "WSConn.ready?/1":falsefor an alive but non-registered (still handshaking) pid, with arefute_receiveproving the call does not register aGlobalswaitertrueoncehandle_connect/2has registered the connVerified all four "new" tests fail on a tree with the test file but without the lib changes (
UndefinedFunctionErrorforWSConn.ready?/1/NodeProxy.handle_failed_send/2), and pass with the fix: