fix(stability): guard handleConnected against transport-swap race (#3247)#3248
Conversation
) When the transport's 'connect' event scheduled handleConnected but a parallel disconnect/reconnect path nulled `this.transport` during one of the awaits inside the handler (notifyNodeConnected, channel snapshot), the eventual `await this.sendWantConfigId()` threw "Transport not initialized". The catch block treated that as a fatal post-connect reset, armed the cooldown, emitted disconnect, and called handleDisconnected — which on the very next reconnect produced the same race again, deterministically. Reporters saw a 3×/min reconnect loop on otherwise-healthy TCP sessions, with the device's config response arriving ~300 ms AFTER MeshMonitor had already declared the socket dead. Fix: * Capture `const transportAtConnect = this.transport` at handleConnected entry; bail with a debug log if it's already null. * In the catch block, compare `this.transport` against the captured reference. If they differ (the transport was swapped/nulled during the handshake), the failing send was against a stale generation — bail silently. A new connect cycle is already in flight (or the source is being shut down); doing another teardown here would reproduce #3247. * Genuine transport-layer send failures (e.g. tcpTransport throwing "Not connected to TCP server" on a real socket close) still hit the existing post-reset cooldown path. Tests: src/server/meshtasticManager.connectRace.test.ts covers three cases — null-at-entry bail, mid-handshake swap bail (the exact #3247 scenario), and the counter-case where a stable-reference send failure still triggers the legitimate teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @Yeraze's task —— View job Code Review Complete ✅I've thoroughly reviewed this PR and it represents an excellent solution to a complex race condition. Here's my detailed feedback: Todo List:
✅ Code Quality & Best PracticesExcellent implementation with several highlights:
✅ Race Condition AnalysisThe fix correctly addresses the core issue in Root Cause Identified: // meshtasticManager.ts:1237-1240
const transportAtConnect = this.transport;
if (!transportAtConnect) {
logger.debug('🟡 [connect-race] handleConnected fired with no transport — skipping handshake (#3247)');
return;
}Race Detection Logic: // meshtasticManager.ts:1350-1353
if (this.transport !== transportAtConnect) {
logger.debug(`🟡 [connect-race] sendWantConfigId aborted — transport replaced during handshake (${msg}) (#3247)`);
return;
}This elegantly distinguishes between:
✅ Test CoverageThe test suite in
Test quality highlights:
✅ Performance Considerations
✅ Security Assessment
|
Patch release combining the MeshCore automation suite (auto-announce, auto-responder, timer triggers — #3249), auto-acknowledge automation (#3245), the handleConnected transport-swap race fix that was producing deterministic 3×/min reconnect loops on TCP Meshtastic sources (#3248 / #3247), MeshCore neighbor publicKey input validation (#3240), CodeQL security remediations covering polynomial-ReDoS, log-injection and regex-DoS (#3246), and a translations update (#3208). Updated CHANGELOG.md (converted [Unreleased] MeshCore-automation entry into a full [4.8.1] release section covering all six commits since 4.8.0), CLAUDE.md version header, and version across all 5 version files (package.json, package-lock.json, helm/meshmonitor/Chart.yaml, desktop/src-tauri/tauri.conf.json, desktop/package.json). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes the per-minute reconnect loop reported in #3247. On TCP sources, the
Connection status: connectedhandler would fire, then within 17–40 ms logConnection status: disconnected, followed byInitial sendWantConfigId failed (Transport not initialized) — treating as transient post-connect reset. The underlying TCP socket was healthy (the device's config response would arrive ~300 ms after MeshMonitor had already declared the connection dead), so the loop was a pure internal state race — and it re-fired deterministically 3×/min, producing ~15 spurious log lines and 3 spurious disconnect/reconnect notifications per minute per affected node.Root cause
The transport's
'connect'event scheduledhandleConnected, which awaits several things (notifyNodeConnected, channel snapshot) before reachingawait this.sendWantConfigId(). During those awaits, a parallel disconnect handler could nullthis.transport.sendWantConfigId's ownif (!this.transport)guard then threwTransport not initialized. The catch block treated that as "transient post-connect reset," armed the cooldown, calledtransport?.disconnect(), emitted a disconnect, and ranhandleDisconnected()— which on the next reconnect reproduced the same race, deterministically.The conflation was between two error classes:
Transport not initialized(thrown bymeshtasticManager.ts:1618) — our own state race, transport is fine, new connect already in flight.Not connected to TCP server(thrown bytcpTransport.ts:310) — genuine socket-layer failure, existing teardown is correct.The original
try/catchtreated both the same.Fix
const transportAtConnect = this.transportathandleConnectedentry; bail with a debug log if it's already null.this.transportagainsttransportAtConnect. If they differ (transport was swapped/nulled during the handshake), bail silently — no cooldown, no disconnect emit, nohandleDisconnectedcall.Tests
src/server/meshtasticManager.connectRace.test.ts(new) covers:handleConnectedfires withtransportalready null → silent bail, no side effects.handleConnectedfires withtransportset, but a mid-await disconnect nulls it beforesendWantConfigIdruns → silent bail (the exact [BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions #3247 scenario).handleConnectedwith stable transport reference,send()throws → existing teardown + cooldown still fires.All 3 pass. Full server suite: 3089 pass / 3 fail; the 3 failures are pre-existing on
main(mqttBrokerManager.test.tszero-hop encode), unrelated to this fix — verified by stashing the change and re-running.Test plan
Fixes #3247
🤖 Generated with Claude Code