Skip to content

fix(harness+server): MCP shim hardening + SessionRegistry#4

Merged
OmGuptaIND merged 5 commits intomainfrom
OmGuptaIND/mcp-shim-and-session-registry
Apr 21, 2026
Merged

fix(harness+server): MCP shim hardening + SessionRegistry#4
OmGuptaIND merged 5 commits intomainfrom
OmGuptaIND/mcp-shim-and-session-registry

Conversation

@OmGuptaIND
Copy link
Copy Markdown
Contributor

Summary

  • MCP shim resolution & lifecycle: standalone Anton MCP shim (anton-mcp-shim.ts) is now spawned via a resolved absolute path (mcp-spawn-config.ts) with explicit probe fallback; adds a 4-state connection machine (idle/connecting/authed/lost) with a single transitionToLost funnel, per-session IPC auth, and a ping/pong liveness loop.
  • SessionRegistry: bounded LRU registry partitioned into conversation / routine / ephemeral pools. Replaces the unbounded Map<string, Session> in server.ts. delete() awaits shutdown() so codex + shim subprocesses are reaped; pin() / unpin() protect sessions while a turn is streaming; onEvict cleans up IPC auth + context maps before the subprocess dies. Includes a race guard for id-reuse between eviction-dispatch and the async shutdown body.
  • Webhook session disposal: webhook runner now cleans up pending plan_confirm / ask_user interactions when a session is evicted, resolving with { approved: false } (no feedback string, which was leaking into model input as plan-revision / ask_user answers).
  • Handshake race fix (new commit): if the server sends bye in the same readline chunk as auth_ok, transitionToLost runs synchronously inside readline while ensureAuthed is awaited on doConnect(). Added a post-await state check so we don't overwrite lost with authed and hand callers a half-closed socket.
  • IPC write safety: safeWrite guards against writing to destroyed sockets; bye notifications are sent best-effort before subprocess kill.
  • Docs: specs/features/SESSION_LIFECYCLE.md (new), HARNESS_ARCHITECTURE.md updated with MCP hardening section.
  • Integration checks: check-mcp-shim.ts, check-mcp-shim-runtime.ts, check-session-registry.ts cover happy-path, reconnect/backoff, eviction, pin/unpin, race-guard, and handshake state-flap.

Test plan

  • pnpm -r build passes
  • pnpm --filter @anton/agent-core check:mcp-shim passes
  • pnpm --filter @anton/agent-core check:mcp-shim-runtime passes
  • pnpm --filter @anton/agent-core check:session-registry passes
  • Manual: start a harness chat, kill the server-side IPC mid-turn — shim backs off and reconnects; no dangling codex/shim processes on the box
  • Manual: fill conversation pool past capacity — LRU victim is evicted, its codex subprocess exits, IPC auth entry drops
  • Manual: Slack/Telegram plan_confirm while session is evicted — plan_confirm resolves as rejected, no "Session evicted." leaks into the next model turn

🤖 Generated with Claude Code

OmGuptaIND and others added 5 commits April 21, 2026 15:03
Two production bugs surfaced on the VPS deploy, fixed here together
because the server.ts wiring overlaps.

MCP shim path resolution — the previous code composed the shim path
from `homedir() + '../node_modules/...'`, which resolved to
`/home/anton/node_modules/...` on the VPS where the actual install sits
at `/opt/anton/node_modules/...`. Harness sessions spawned `codex
app-server` with a non-existent shim and the model hallucinated tool
calls against a connector surface that couldn't actually reach anything.

Replaced with a single-source-of-truth `buildMcpSpawnConfig()` that
resolves the shim via this module's own `import.meta.url`, uses
`process.execPath` instead of the literal `"node"` (systemd services
don't inherit PATH), and exposes the resolved `shimPath` for
diagnostics. Added `probeMcpShim()` — a 5s JSON-RPC `initialize`
round-trip run on boot and every 60s; failures are logged with stderr
tail + shim dir for ops visibility. When the probe fails the server
omits the capability block and passes an empty connector list to
harness sessions so the model stops believing it has tools it can't
call. The shim now embeds the package version in its `initialize`
response so version skew between a partial rsync deploy and the host
binary shows up as a warn log.

Harness session opts now take `mcp: { socketPath, authToken, spawn:
McpSpawnConfig }` instead of flat `socketPath`/`shimPath`/`authToken`
— keeps the spawn config in one place.

Session lifecycle — `handleSessionDestroy` deleted the Map entry but
never awaited `session.shutdown()`, so the codex app-server subprocess
and its MCP shim child survived until the host process died. Over
hours of normal use a VPS accumulated tens of orphaned node + codex
processes. Nothing ever evicted sessions the client forgot about
either, so the set grew without bound at ~30 MB RSS per harness
session.

Added `SessionRegistry<T extends Shutdownable>` — a bounded
LRU-ordered store with partitioned pools (conversation / routine /
ephemeral, each with independent capacity and recency floor). Every
session type goes through it: Pi SDK Session, HarnessSession,
CodexHarnessSession, and agent runs. `put()` fires background eviction
when the target pool is over capacity, skipping pinned or
within-recency-floor entries. `delete()` awaits `shutdown()` so the
destroy handler can block on cleanup. `pin()`/`unpin()` wraps active
turns so mid-stream eviction can't rip a session out. `shutdownAll()`
fans out on graceful server shutdown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
check-mcp-shim: buildMcpSpawnConfig shape, getExpectedShimVersion
non-empty, probe round-trip against the compiled shim, version match,
clean-fail (not hang) on missing binary + missing shim path. Runs via
`pnpm --filter @anton/agent-core check:mcp-shim` (compiles first
because the probe needs dist/harness/anton-mcp-shim.js).

check-session-registry: put/get, delete-awaits-shutdown, partitioned
pools, LRU eviction, pinning, recency-floor warn path,
replace-in-place (no shutdown), peek no-touch, shutdownAll,
missing-shutdown no-op, throwing-shutdown continues. Runs via `pnpm
--filter @anton/agent-core check:session-registry`.

Follows the existing tsx-runnable script convention in
src/harness/__fixtures__/check.ts — no test framework added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SESSION_LIFECYCLE.md describes the registry contract: categories
(conversation/routine/ephemeral), pool defaults, method invariants,
eviction policy, pinning during active turns, server-shutdown fan-out,
and the failure-mode history that motivated it (destroy-leak,
unbounded accumulation).

HARNESS_ARCHITECTURE.md: added an "MCP shim spawn + health" section
covering buildMcpSpawnConfig invariants (execPath, import.meta.url),
probeMcpShim behavior + cadence, and capability-block gating on probe
failure. Delivery-status table gains two rows for the shipped MCP
hardening + session-lifecycle work, with a pointer to the new spec.
Key file map picks up mcp-spawn-config.ts and session-registry.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… disposal

SessionRegistry
- runEviction: yield a microtask before checking for id-reuse so the
  race guard actually defers. Previously the sync portion of the
  async function called onEvict before put(sameId) could repopulate
  the slot, wiping the replacement's bookkeeping.
- Replacement detection skips onEvict but still shuts down the old
  session so the subprocess is reaped.

MCP IPC handler
- safeWrite short-circuits on destroyed/writableEnded sockets and
  wraps conn.write in try/catch so EPIPE on a dead peer no longer
  bubbles up as an unhandled error.
- rl.on('line', …) wraps the async processLine in an IIFE with its
  own try/catch so thrown errors from late-arriving frames never
  escape to the event emitter.

Webhook agent-runner
- New SessionDisposer callback lets the server own the actual
  SessionRegistry.delete. disposeSession() drops the local Map
  entry, clears pendingInteractions (timeout + reject), clears
  progressStates timers, then awaits the server disposer.
- Wired through all four eviction sites (evictSession,
  switchAllSessionModels cross-provider, handleMenuAction m:s:*,
  getOrCreateSession). Previously deleting just the local Map entry
  orphaned codex/claude-code subprocesses on every /model switch.
- Swallow settled.catch(() => {}) on idle queue tails to prevent
  stale unhandledRejection on long-lived chains.

server.ts
- Registry onEvict hook wipes activeTurns, mcpIpcServer auth,
  harnessSessionContexts, and harnessExtractionCursor symmetrically
  when the registry evicts a session.
- handleSessionDestroy post-await id-reuse guard skips tail cleanup
  and persisted-session deletion when a new session was created for
  the same id during the shutdown await.
- WebhookAgentRunner constructor gets an async disposer that
  mirrors handleSessionProviderSwitch ordering.

Tests
- New session-registry case: eviction→replace race confirms onEvict
  is skipped for the re-registered id but still fires for the
  legitimately evicted one. 18/18 pass.
- New mcp-shim-runtime fixture: 5 integration checks covering happy
  path, log notifications, connection drop reconnect, bye
  notification reconnect, and auth-failure clean error.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… leak

- anton-mcp-shim: after await doConnect(), re-check state before setting
  'authed'. A server-sent `bye` in the same readline chunk as auth_ok
  triggers transitionToLost synchronously while we're awaited; without
  this guard we'd overwrite 'lost' with 'authed' and hand callers a
  half-closed socket.
- webhook agent-runner: disposeSession resolves pending plan_confirm /
  ask_user interactions with { approved: false } only. The prior
  feedback string was forwarded to the model as plan-revision input or
  the first question's answer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@OmGuptaIND OmGuptaIND merged commit 402c6a0 into main Apr 21, 2026
@OmGuptaIND OmGuptaIND deleted the OmGuptaIND/mcp-shim-and-session-registry branch April 21, 2026 14:27
OmGuptaIND added a commit that referenced this pull request Apr 22, 2026
### Other
- fix(harness+desktop): coalesce streaming text deltas, add detached turns, cross-surface invariant (#6)
- feat(settings): mark Claude CLI as coming soon, simplify provider form (#5)
- fix(harness+server): MCP shim hardening + SessionRegistry (#4)
- fix(caddy): preserve /health and /status paths upstream to sidecar (#3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant