fix(stability): guard handleConnected against transport-swap race (#3247) by Yeraze · Pull Request #3248 · Yeraze/meshmonitor

Yeraze · 2026-05-28T19:32:01Z

Summary

Fixes the per-minute reconnect loop reported in #3247. On TCP sources, the Connection status: connected handler would fire, then within 17–40 ms log Connection status: disconnected, followed by Initial sendWantConfigId failed (Transport not initialized) — treating as transient post-connect reset. The underlying TCP socket was healthy (the device's config response would arrive ~300 ms after MeshMonitor had already declared the connection dead), so the loop was a pure internal state race — and it re-fired deterministically 3×/min, producing ~15 spurious log lines and 3 spurious disconnect/reconnect notifications per minute per affected node.

Root cause

The transport's 'connect' event scheduled handleConnected, which awaits several things (notifyNodeConnected, channel snapshot) before reaching await this.sendWantConfigId(). During those awaits, a parallel disconnect handler could null this.transport. sendWantConfigId's own if (!this.transport) guard then threw Transport not initialized. The catch block treated that as "transient post-connect reset," armed the cooldown, called transport?.disconnect(), emitted a disconnect, and ran handleDisconnected() — which on the next reconnect reproduced the same race, deterministically.

The conflation was between two error classes:

Transport not initialized (thrown by meshtasticManager.ts:1618) — our own state race, transport is fine, new connect already in flight.
Not connected to TCP server (thrown by tcpTransport.ts:310) — genuine socket-layer failure, existing teardown is correct.

The original try/catch treated both the same.

Fix

Capture const transportAtConnect = this.transport at handleConnected entry; bail with a debug log if it's already null.
In the catch block, compare this.transport against transportAtConnect. If they differ (transport was swapped/nulled during the handshake), bail silently — no cooldown, no disconnect emit, no handleDisconnected call.
Genuine transport-layer failures (stable reference, real send throw) still hit the existing post-reset cooldown path unchanged.

Tests

src/server/meshtasticManager.connectRace.test.ts (new) covers:

handleConnected fires with transport already null → silent bail, no side effects.
handleConnected fires with transport set, but a mid-await disconnect nulls it before sendWantConfigId runs → silent bail (the exact [BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions #3247 scenario).
Counter-test: handleConnected with stable transport reference, send() throws → existing teardown + cooldown still fires.

All 3 pass. Full server suite: 3089 pass / 3 fail; the 3 failures are pre-existing on main (mqttBrokerManager.test.ts zero-hop encode), unrelated to this fix — verified by stashing the change and re-running.

Test plan

New regression tests pass
Pre-existing test suite unaffected (verified pre-existing failures are independent)
Reporter ([BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions #3247) confirms 3×/min reconnect loop is gone in their deployment

Fixes #3247

🤖 Generated with Claude Code

) When the transport's 'connect' event scheduled handleConnected but a parallel disconnect/reconnect path nulled `this.transport` during one of the awaits inside the handler (notifyNodeConnected, channel snapshot), the eventual `await this.sendWantConfigId()` threw "Transport not initialized". The catch block treated that as a fatal post-connect reset, armed the cooldown, emitted disconnect, and called handleDisconnected — which on the very next reconnect produced the same race again, deterministically. Reporters saw a 3×/min reconnect loop on otherwise-healthy TCP sessions, with the device's config response arriving ~300 ms AFTER MeshMonitor had already declared the socket dead. Fix: * Capture `const transportAtConnect = this.transport` at handleConnected entry; bail with a debug log if it's already null. * In the catch block, compare `this.transport` against the captured reference. If they differ (the transport was swapped/nulled during the handshake), the failing send was against a stale generation — bail silently. A new connect cycle is already in flight (or the source is being shut down); doing another teardown here would reproduce #3247. * Genuine transport-layer send failures (e.g. tcpTransport throwing "Not connected to TCP server" on a real socket close) still hit the existing post-reset cooldown path. Tests: src/server/meshtasticManager.connectRace.test.ts covers three cases — null-at-entry bail, mid-handshake swap bail (the exact #3247 scenario), and the counter-case where a stable-reference send failure still triggers the legitimate teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-28T19:32:15Z

Claude finished @Yeraze's task —— View job

Code Review Complete ✅

I've thoroughly reviewed this PR and it represents an excellent solution to a complex race condition. Here's my detailed feedback:

Todo List:

Read the modified meshtasticManager.ts file
Read the new test file
Analyze the race condition fix implementation
Review code quality and best practices
Check for potential bugs or edge cases
Evaluate performance considerations
Assess security implications
Review test coverage completeness
Provide comprehensive feedback

✅ Code Quality & Best Practices

Excellent implementation with several highlights:

Clear problem identification: The root cause analysis clearly separates the two error classes (Transport not initialized vs Not connected to TCP server)
Surgical fix: The solution is minimal and targeted, introducing only the necessary guards without changing broader logic
Defensive programming: The transportAtConnect pattern protects against the exact race scenario
Documentation: Comprehensive inline comments explain the race scenario and fix rationale

✅ Race Condition Analysis

The fix correctly addresses the core issue in handleConnected() at lines 1225-1368:

Root Cause Identified:

// meshtasticManager.ts:1237-1240
const transportAtConnect = this.transport;
if (!transportAtConnect) {
  logger.debug('🟡 [connect-race] handleConnected fired with no transport — skipping handshake (#3247)');
  return;
}

Race Detection Logic:

// meshtasticManager.ts:1350-1353
if (this.transport !== transportAtConnect) {
  logger.debug(`🟡 [connect-race] sendWantConfigId aborted — transport replaced during handshake (${msg}) (#3247)`);
  return;
}

This elegantly distinguishes between:

Internal race condition (transport swapped during await chain) → silent bail
Genuine transport failure (stable reference, real send error) → existing teardown

✅ Test Coverage

The test suite in meshtasticManager.connectRace.test.ts is comprehensive:

Early null check (lines 171-189): Verifies handler bails when transport is null at entry
Mid-handshake race (lines 191-236): Tests the exact [BUG] Per-minute reconnect loop: 'Transport not initialized' race tears down healthy TCP sessions #3247 scenario where transport is nulled during notifyNodeConnected
Genuine failure counter-test (lines 238-270): Ensures legitimate send errors still trigger proper teardown

Test quality highlights:

Proper mocking of all dependencies
Clear scenario descriptions
Comprehensive assertions covering both positive and negative cases
Tests actual behavioral changes, not just code coverage

✅ Performance Considerations

Minimal overhead: Only adds one reference capture and comparison
No additional async operations: The fix doesn't introduce new await points
Efficient early returns: Bail-out paths avoid unnecessary work
Zero impact on happy path: Normal connections proceed unchanged

✅ Security Assessment

No security implications: This is a pure stability fix with no auth/data handling changes
No new attack vectors: Early returns reduce potential for state corruption
Defensive coding: Guards against race conditions that could lead to unstable states

⚠️ Minor Considerations

Edge Case Coverage:

Multiple rapid disconnects: The fix handles the common case but rapid successive disconnect/reconnect cycles could potentially create new edge cases. However, this seems unlikely given the typical TCP connection lifecycle.
Memory references: The transportAtConnect reference is properly scoped and doesn't create memory leaks.

✅ Overall Assessment

This is a high-quality fix that:

Solves a specific, well-documented production issue
Uses minimal, surgical changes
Includes comprehensive test coverage
Maintains backward compatibility
Follows existing code patterns and conventions

The race condition analysis is thorough, the implementation is clean, and the test coverage gives confidence in the fix. The 3×/min reconnection loop was a significant stability issue, and this solution addresses it without introducing new risks.

Recommendation: Approve and merge ✅

The fix demonstrates excellent debugging skills, clean implementation, and thorough testing. It should resolve the reported stability issues without impacting normal operation.

Patch release combining the MeshCore automation suite (auto-announce, auto-responder, timer triggers — #3249), auto-acknowledge automation (#3245), the handleConnected transport-swap race fix that was producing deterministic 3×/min reconnect loops on TCP Meshtastic sources (#3248 / #3247), MeshCore neighbor publicKey input validation (#3240), CodeQL security remediations covering polynomial-ReDoS, log-injection and regex-DoS (#3246), and a translations update (#3208). Updated CHANGELOG.md (converted [Unreleased] MeshCore-automation entry into a full [4.8.1] release section covering all six commits since 4.8.0), CLAUDE.md version header, and version across all 5 version files (package.json, package-lock.json, helm/meshmonitor/Chart.yaml, desktop/src-tauri/tauri.conf.json, desktop/package.json). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Yeraze merged commit 54ef737 into main May 28, 2026
19 checks passed

Yeraze deleted the fix/3247-connect-handshake-race branch May 28, 2026 20:07

Yeraze mentioned this pull request May 28, 2026

chore: bump version to 4.8.1 with changelog and docs #3250

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(stability): guard handleConnected against transport-swap race (#3247)#3248

fix(stability): guard handleConnected against transport-swap race (#3247)#3248
Yeraze merged 1 commit into
mainfrom
fix/3247-connect-handshake-race

Yeraze commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Yeraze commented May 28, 2026

Summary

Root cause

Fix

Tests

Test plan

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

✅ Code Quality & Best Practices

✅ Race Condition Analysis

✅ Test Coverage

✅ Performance Considerations

✅ Security Assessment

⚠️ Minor Considerations

✅ Overall Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 28, 2026 •

edited

Loading