Skip to content

fix(stability): guard handleConnected against transport-swap race (#3247)#3248

Merged
Yeraze merged 1 commit into
mainfrom
fix/3247-connect-handshake-race
May 28, 2026
Merged

fix(stability): guard handleConnected against transport-swap race (#3247)#3248
Yeraze merged 1 commit into
mainfrom
fix/3247-connect-handshake-race

Conversation

@Yeraze
Copy link
Copy Markdown
Owner

@Yeraze Yeraze commented May 28, 2026

Summary

Fixes the per-minute reconnect loop reported in #3247. On TCP sources, the Connection status: connected handler would fire, then within 17–40 ms log Connection status: disconnected, followed by Initial sendWantConfigId failed (Transport not initialized) — treating as transient post-connect reset. The underlying TCP socket was healthy (the device's config response would arrive ~300 ms after MeshMonitor had already declared the connection dead), so the loop was a pure internal state race — and it re-fired deterministically 3×/min, producing ~15 spurious log lines and 3 spurious disconnect/reconnect notifications per minute per affected node.

Root cause

The transport's 'connect' event scheduled handleConnected, which awaits several things (notifyNodeConnected, channel snapshot) before reaching await this.sendWantConfigId(). During those awaits, a parallel disconnect handler could null this.transport. sendWantConfigId's own if (!this.transport) guard then threw Transport not initialized. The catch block treated that as "transient post-connect reset," armed the cooldown, called transport?.disconnect(), emitted a disconnect, and ran handleDisconnected() — which on the next reconnect reproduced the same race, deterministically.

The conflation was between two error classes:

  • Transport not initialized (thrown by meshtasticManager.ts:1618) — our own state race, transport is fine, new connect already in flight.
  • Not connected to TCP server (thrown by tcpTransport.ts:310) — genuine socket-layer failure, existing teardown is correct.

The original try/catch treated both the same.

Fix

  • Capture const transportAtConnect = this.transport at handleConnected entry; bail with a debug log if it's already null.
  • In the catch block, compare this.transport against transportAtConnect. If they differ (transport was swapped/nulled during the handshake), bail silently — no cooldown, no disconnect emit, no handleDisconnected call.
  • Genuine transport-layer failures (stable reference, real send throw) still hit the existing post-reset cooldown path unchanged.

Tests

src/server/meshtasticManager.connectRace.test.ts (new) covers:

  1. handleConnected fires with transport already null → silent bail, no side effects.
  2. handleConnected fires with transport set, but a mid-await disconnect nulls it before sendWantConfigId runs → silent bail (the exact [BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions #3247 scenario).
  3. Counter-test: handleConnected with stable transport reference, send() throws → existing teardown + cooldown still fires.

All 3 pass. Full server suite: 3089 pass / 3 fail; the 3 failures are pre-existing on main (mqttBrokerManager.test.ts zero-hop encode), unrelated to this fix — verified by stashing the change and re-running.

Test plan

Fixes #3247

🤖 Generated with Claude Code

)

When the transport's 'connect' event scheduled handleConnected but a
parallel disconnect/reconnect path nulled `this.transport` during one
of the awaits inside the handler (notifyNodeConnected, channel
snapshot), the eventual `await this.sendWantConfigId()` threw
"Transport not initialized". The catch block treated that as a fatal
post-connect reset, armed the cooldown, emitted disconnect, and called
handleDisconnected — which on the very next reconnect produced the
same race again, deterministically. Reporters saw a 3×/min reconnect
loop on otherwise-healthy TCP sessions, with the device's config
response arriving ~300 ms AFTER MeshMonitor had already declared the
socket dead.

Fix:
* Capture `const transportAtConnect = this.transport` at handleConnected
  entry; bail with a debug log if it's already null.
* In the catch block, compare `this.transport` against the captured
  reference. If they differ (the transport was swapped/nulled during
  the handshake), the failing send was against a stale generation —
  bail silently. A new connect cycle is already in flight (or the
  source is being shut down); doing another teardown here would
  reproduce #3247.
* Genuine transport-layer send failures (e.g. tcpTransport throwing
  "Not connected to TCP server" on a real socket close) still hit the
  existing post-reset cooldown path.

Tests: src/server/meshtasticManager.connectRace.test.ts covers three
cases — null-at-entry bail, mid-handshake swap bail (the exact #3247
scenario), and the counter-case where a stable-reference send failure
still triggers the legitimate teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

Claude finished @Yeraze's task —— View job


Code Review Complete ✅

I've thoroughly reviewed this PR and it represents an excellent solution to a complex race condition. Here's my detailed feedback:

Todo List:

  • Read the modified meshtasticManager.ts file
  • Read the new test file
  • Analyze the race condition fix implementation
  • Review code quality and best practices
  • Check for potential bugs or edge cases
  • Evaluate performance considerations
  • Assess security implications
  • Review test coverage completeness
  • Provide comprehensive feedback

✅ Code Quality & Best Practices

Excellent implementation with several highlights:

  • Clear problem identification: The root cause analysis clearly separates the two error classes (Transport not initialized vs Not connected to TCP server)
  • Surgical fix: The solution is minimal and targeted, introducing only the necessary guards without changing broader logic
  • Defensive programming: The transportAtConnect pattern protects against the exact race scenario
  • Documentation: Comprehensive inline comments explain the race scenario and fix rationale

✅ Race Condition Analysis

The fix correctly addresses the core issue in handleConnected() at lines 1225-1368:

Root Cause Identified:

// meshtasticManager.ts:1237-1240
const transportAtConnect = this.transport;
if (!transportAtConnect) {
  logger.debug('🟡 [connect-race] handleConnected fired with no transport — skipping handshake (#3247)');
  return;
}

Race Detection Logic:

// meshtasticManager.ts:1350-1353
if (this.transport !== transportAtConnect) {
  logger.debug(`🟡 [connect-race] sendWantConfigId aborted — transport replaced during handshake (${msg}) (#3247)`);
  return;
}

This elegantly distinguishes between:

  1. Internal race condition (transport swapped during await chain) → silent bail
  2. Genuine transport failure (stable reference, real send error) → existing teardown

✅ Test Coverage

The test suite in meshtasticManager.connectRace.test.ts is comprehensive:

  1. Early null check (lines 171-189): Verifies handler bails when transport is null at entry
  2. Mid-handshake race (lines 191-236): Tests the exact [BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions #3247 scenario where transport is nulled during notifyNodeConnected
  3. Genuine failure counter-test (lines 238-270): Ensures legitimate send errors still trigger proper teardown

Test quality highlights:

  • Proper mocking of all dependencies
  • Clear scenario descriptions
  • Comprehensive assertions covering both positive and negative cases
  • Tests actual behavioral changes, not just code coverage

✅ Performance Considerations

  • Minimal overhead: Only adds one reference capture and comparison
  • No additional async operations: The fix doesn't introduce new await points
  • Efficient early returns: Bail-out paths avoid unnecessary work
  • Zero impact on happy path: Normal connections proceed unchanged

✅ Security Assessment

  • No security implications: This is a pure stability fix with no auth/data handling changes
  • No new attack vectors: Early returns reduce potential for state corruption
  • Defensive coding: Guards against race conditions that could lead to unstable states

⚠️ Minor Considerations

Edge Case Coverage:

  1. Multiple rapid disconnects: The fix handles the common case but rapid successive disconnect/reconnect cycles could potentially create new edge cases. However, this seems unlikely given the typical TCP connection lifecycle.

  2. Memory references: The transportAtConnect reference is properly scoped and doesn't create memory leaks.

✅ Overall Assessment

This is a high-quality fix that:

  • Solves a specific, well-documented production issue
  • Uses minimal, surgical changes
  • Includes comprehensive test coverage
  • Maintains backward compatibility
  • Follows existing code patterns and conventions

The race condition analysis is thorough, the implementation is clean, and the test coverage gives confidence in the fix. The 3×/min reconnection loop was a significant stability issue, and this solution addresses it without introducing new risks.

Recommendation: Approve and merge

The fix demonstrates excellent debugging skills, clean implementation, and thorough testing. It should resolve the reported stability issues without impacting normal operation.


@Yeraze Yeraze merged commit 54ef737 into main May 28, 2026
19 checks passed
@Yeraze Yeraze deleted the fix/3247-connect-handshake-race branch May 28, 2026 20:07
Yeraze added a commit that referenced this pull request May 28, 2026
Patch release combining the MeshCore automation suite (auto-announce,
auto-responder, timer triggers — #3249), auto-acknowledge automation
(#3245), the handleConnected transport-swap race fix that was producing
deterministic 3×/min reconnect loops on TCP Meshtastic sources (#3248
/ #3247), MeshCore neighbor publicKey input validation (#3240), CodeQL
security remediations covering polynomial-ReDoS, log-injection and
regex-DoS (#3246), and a translations update (#3208).

Updated CHANGELOG.md (converted [Unreleased] MeshCore-automation entry
into a full [4.8.1] release section covering all six commits since
4.8.0), CLAUDE.md version header, and version across all 5 version
files (package.json, package-lock.json, helm/meshmonitor/Chart.yaml,
desktop/src-tauri/tauri.conf.json, desktop/package.json).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions

1 participant