Skip to content

Fix flaky e2e test: use PID-based unique socket paths to eliminate bridge races#156

Merged
jancurn merged 5 commits intomainfrom
claude/fix-flaky-e2e-test-KyUeH
Apr 10, 2026
Merged

Fix flaky e2e test: use PID-based unique socket paths to eliminate bridge races#156
jancurn merged 5 commits intomainfrom
claude/fix-flaky-e2e-test-KyUeH

Conversation

@jancurn
Copy link
Copy Markdown
Member

@jancurn jancurn commented Apr 9, 2026

Summary

  • Each bridge instance now gets a unique socket path based on its PID (@session.1234.sock), eliminating races where an exiting bridge deletes a new bridge's socket
  • ensureBridgeReady sets lastConnectionAttemptAt before restarting, preventing parallel CLI processes from triggering concurrent background restarts via consolidateSessions
  • mcpc clean now removes orphaned PID-based socket files (older than 5 min) from the bridges directory

Root cause

When e2e tests run with --parallel 6 and a shared home directory, a parallel CLI process running mcpc --json triggers consolidateSessions()reconnectCrashedSessions(). This starts a background bridge restart for a session whose bridge was just killed by another test. Both the background restart and the test's own restart spawn bridges that share the same socket path (@session-name.sock). When the losing bridge exits, its cleanup() deletes the socket file — which now belongs to the winning bridge — causing ENOENT.

Changes

  • getSocketPath(name, pid?) — accepts optional PID to produce unique paths
  • Bridge constructor — uses process.pid for its socket path; cleanup safely deletes only its own unique socket
  • startBridge — computes socket path after spawn (when PID is known); no longer needs pre-spawn socket cleanup
  • ensureBridgeReady — reads session PID to find the right socket; sets lastConnectionAttemptAt before restarting to block parallel restarts; uses PID from restartBridge return value directly
  • SessionClient reconnect — uses PID from restartBridge return value directly (no extra getSession call)
  • consolidateSessions — cleans up PID-based sockets when clearing dead PIDs and removing expired sessions
  • cleanupOrphanedSockets() — new function that globs the bridges directory and removes stale socket files not matching any active session+PID, using mtime to avoid racing with just-spawned bridges
  • mcpc clean — wired up orphaned socket cleanup into both default and --all modes

Test plan

  • npm run lint passes
  • npm run build passes
  • npm run test:unit — 500/500 pass
  • sessions/mcp-session e2e test passes (previously flaky)
  • sessions/close, sessions/restart, sessions/lifecycle all pass
  • Full parallel e2e suite (--parallel 6) — no regressions from this change

https://claude.ai/code/session_0138WV4ffyNZzBFuzoTDzuBn

claude added 3 commits April 9, 2026 13:06
The `restartSession` function calls `showServerDetails` after successfully
starting a new bridge. Under CI load, this health check can fail transiently
(bridge socket not yet ready), causing the entire restart command to fail
even though the restart itself succeeded.

Make `showServerDetails` non-fatal in the restart path: catch errors,
reset session status to 'active' (since the bridge was started), and
show a warning instead of failing. This fixes the flaky
sessions/mcp-session e2e test.

https://claude.ai/code/session_0138WV4ffyNZzBFuzoTDzuBn
When tests run in parallel with a shared home directory, a background
bridge reconnect (triggered by consolidateSessions in a parallel CLI
process) can race with an explicit restart. The exiting bridge's
cleanup() would delete the socket file at this.socketPath — but a
NEW bridge for the same session may have already created its socket
at that same path. This causes the new bridge's socket to vanish,
producing the ENOENT error seen in the flaky sessions/mcp-session test.

Fix: stop deleting the socket file in the bridge's cleanup(). Stale
sockets are already cleaned up in three other places:
- startBridge() removes old sockets before spawning a new bridge
- createSocketServer() removes old sockets before listening
- consolidateSessions() removes sockets for expired/unauthorized sessions

https://claude.ai/code/session_0138WV4ffyNZzBFuzoTDzuBn
When tests run in parallel with a shared home directory, a background
reconnect (triggered by consolidateSessions in a parallel CLI process)
can race with an explicit restart. Both start bridges for the same
session, and the exiting bridge's cleanup deletes the new bridge's
socket at the shared path, causing ENOENT.

Fix by giving each bridge instance a unique socket path based on its
PID: `@session-name.<pid>.sock`. Since each bridge owns its own path,
cleanup never conflicts with other bridges.

Changes:
- getSocketPath() accepts optional `pid` parameter for unique paths
- Bridge uses process.pid to compute its socket path
- startBridge computes socket path after spawn (when PID is known)
- ensureBridgeReady reads session PID to find the right socket
- ensureBridgeReady sets lastConnectionAttemptAt before restarting,
  preventing parallel consolidateSessions from triggering concurrent
  background restarts for the same session
- consolidateSessions cleans up PID-based sockets when clearing dead
  bridge PIDs and removing expired sessions
- SessionClient reconnect logic re-reads session after restartBridge
  to get the new PID-based socket path

https://claude.ai/code/session_0138WV4ffyNZzBFuzoTDzuBn
@jancurn jancurn changed the title Handle transient failures when displaying server details after restart Fix flaky e2e test: use PID-based unique socket paths to eliminate bridge races Apr 10, 2026
claude added 2 commits April 10, 2026 11:58
- Use the PID returned by restartBridge() directly instead of
  re-reading sessions.json via getSession()
- Add cleanupOrphanedSockets() to remove stale PID-based socket files
  from the bridges directory during `mcpc clean`. Only deletes sockets
  older than 5 minutes (configurable) to avoid racing with a bridge
  that was just spawned.
- Wire orphaned socket cleanup into cleanStale/cleanAll and display

https://claude.ai/code/session_0138WV4ffyNZzBFuzoTDzuBn
Every call site always passes a PID. Making it optional would silently
produce a path without the PID suffix that no bridge ever listens on.

https://claude.ai/code/session_0138WV4ffyNZzBFuzoTDzuBn
@jancurn jancurn merged commit 1582882 into main Apr 10, 2026
6 checks passed
@jancurn jancurn deleted the claude/fix-flaky-e2e-test-KyUeH branch April 10, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants