fix(harness+server): MCP shim hardening + SessionRegistry#4
Merged
OmGuptaIND merged 5 commits intomainfrom Apr 21, 2026
Merged
Conversation
Two production bugs surfaced on the VPS deploy, fixed here together
because the server.ts wiring overlaps.
MCP shim path resolution — the previous code composed the shim path
from `homedir() + '../node_modules/...'`, which resolved to
`/home/anton/node_modules/...` on the VPS where the actual install sits
at `/opt/anton/node_modules/...`. Harness sessions spawned `codex
app-server` with a non-existent shim and the model hallucinated tool
calls against a connector surface that couldn't actually reach anything.
Replaced with a single-source-of-truth `buildMcpSpawnConfig()` that
resolves the shim via this module's own `import.meta.url`, uses
`process.execPath` instead of the literal `"node"` (systemd services
don't inherit PATH), and exposes the resolved `shimPath` for
diagnostics. Added `probeMcpShim()` — a 5s JSON-RPC `initialize`
round-trip run on boot and every 60s; failures are logged with stderr
tail + shim dir for ops visibility. When the probe fails the server
omits the capability block and passes an empty connector list to
harness sessions so the model stops believing it has tools it can't
call. The shim now embeds the package version in its `initialize`
response so version skew between a partial rsync deploy and the host
binary shows up as a warn log.
Harness session opts now take `mcp: { socketPath, authToken, spawn:
McpSpawnConfig }` instead of flat `socketPath`/`shimPath`/`authToken`
— keeps the spawn config in one place.
Session lifecycle — `handleSessionDestroy` deleted the Map entry but
never awaited `session.shutdown()`, so the codex app-server subprocess
and its MCP shim child survived until the host process died. Over
hours of normal use a VPS accumulated tens of orphaned node + codex
processes. Nothing ever evicted sessions the client forgot about
either, so the set grew without bound at ~30 MB RSS per harness
session.
Added `SessionRegistry<T extends Shutdownable>` — a bounded
LRU-ordered store with partitioned pools (conversation / routine /
ephemeral, each with independent capacity and recency floor). Every
session type goes through it: Pi SDK Session, HarnessSession,
CodexHarnessSession, and agent runs. `put()` fires background eviction
when the target pool is over capacity, skipping pinned or
within-recency-floor entries. `delete()` awaits `shutdown()` so the
destroy handler can block on cleanup. `pin()`/`unpin()` wraps active
turns so mid-stream eviction can't rip a session out. `shutdownAll()`
fans out on graceful server shutdown.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
check-mcp-shim: buildMcpSpawnConfig shape, getExpectedShimVersion non-empty, probe round-trip against the compiled shim, version match, clean-fail (not hang) on missing binary + missing shim path. Runs via `pnpm --filter @anton/agent-core check:mcp-shim` (compiles first because the probe needs dist/harness/anton-mcp-shim.js). check-session-registry: put/get, delete-awaits-shutdown, partitioned pools, LRU eviction, pinning, recency-floor warn path, replace-in-place (no shutdown), peek no-touch, shutdownAll, missing-shutdown no-op, throwing-shutdown continues. Runs via `pnpm --filter @anton/agent-core check:session-registry`. Follows the existing tsx-runnable script convention in src/harness/__fixtures__/check.ts — no test framework added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SESSION_LIFECYCLE.md describes the registry contract: categories (conversation/routine/ephemeral), pool defaults, method invariants, eviction policy, pinning during active turns, server-shutdown fan-out, and the failure-mode history that motivated it (destroy-leak, unbounded accumulation). HARNESS_ARCHITECTURE.md: added an "MCP shim spawn + health" section covering buildMcpSpawnConfig invariants (execPath, import.meta.url), probeMcpShim behavior + cadence, and capability-block gating on probe failure. Delivery-status table gains two rows for the shipped MCP hardening + session-lifecycle work, with a pointer to the new spec. Key file map picks up mcp-spawn-config.ts and session-registry.ts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… disposal
SessionRegistry
- runEviction: yield a microtask before checking for id-reuse so the
race guard actually defers. Previously the sync portion of the
async function called onEvict before put(sameId) could repopulate
the slot, wiping the replacement's bookkeeping.
- Replacement detection skips onEvict but still shuts down the old
session so the subprocess is reaped.
MCP IPC handler
- safeWrite short-circuits on destroyed/writableEnded sockets and
wraps conn.write in try/catch so EPIPE on a dead peer no longer
bubbles up as an unhandled error.
- rl.on('line', …) wraps the async processLine in an IIFE with its
own try/catch so thrown errors from late-arriving frames never
escape to the event emitter.
Webhook agent-runner
- New SessionDisposer callback lets the server own the actual
SessionRegistry.delete. disposeSession() drops the local Map
entry, clears pendingInteractions (timeout + reject), clears
progressStates timers, then awaits the server disposer.
- Wired through all four eviction sites (evictSession,
switchAllSessionModels cross-provider, handleMenuAction m:s:*,
getOrCreateSession). Previously deleting just the local Map entry
orphaned codex/claude-code subprocesses on every /model switch.
- Swallow settled.catch(() => {}) on idle queue tails to prevent
stale unhandledRejection on long-lived chains.
server.ts
- Registry onEvict hook wipes activeTurns, mcpIpcServer auth,
harnessSessionContexts, and harnessExtractionCursor symmetrically
when the registry evicts a session.
- handleSessionDestroy post-await id-reuse guard skips tail cleanup
and persisted-session deletion when a new session was created for
the same id during the shutdown await.
- WebhookAgentRunner constructor gets an async disposer that
mirrors handleSessionProviderSwitch ordering.
Tests
- New session-registry case: eviction→replace race confirms onEvict
is skipped for the re-registered id but still fires for the
legitimately evicted one. 18/18 pass.
- New mcp-shim-runtime fixture: 5 integration checks covering happy
path, log notifications, connection drop reconnect, bye
notification reconnect, and auth-failure clean error.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… leak
- anton-mcp-shim: after await doConnect(), re-check state before setting
'authed'. A server-sent `bye` in the same readline chunk as auth_ok
triggers transitionToLost synchronously while we're awaited; without
this guard we'd overwrite 'lost' with 'authed' and hand callers a
half-closed socket.
- webhook agent-runner: disposeSession resolves pending plan_confirm /
ask_user interactions with { approved: false } only. The prior
feedback string was forwarded to the model as plan-revision input or
the first question's answer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OmGuptaIND
added a commit
that referenced
this pull request
Apr 22, 2026
### Other - fix(harness+desktop): coalesce streaming text deltas, add detached turns, cross-surface invariant (#6) - feat(settings): mark Claude CLI as coming soon, simplify provider form (#5) - fix(harness+server): MCP shim hardening + SessionRegistry (#4) - fix(caddy): preserve /health and /status paths upstream to sidecar (#3)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
anton-mcp-shim.ts) is now spawned via a resolved absolute path (mcp-spawn-config.ts) with explicit probe fallback; adds a 4-state connection machine (idle/connecting/authed/lost) with a singletransitionToLostfunnel, per-session IPC auth, and a ping/pong liveness loop.conversation/routine/ephemeralpools. Replaces the unboundedMap<string, Session>inserver.ts.delete()awaitsshutdown()so codex + shim subprocesses are reaped;pin()/unpin()protect sessions while a turn is streaming;onEvictcleans up IPC auth + context maps before the subprocess dies. Includes a race guard for id-reuse between eviction-dispatch and the async shutdown body.{ approved: false }(no feedback string, which was leaking into model input as plan-revision / ask_user answers).byein the same readline chunk asauth_ok,transitionToLostruns synchronously inside readline whileensureAuthedis awaited ondoConnect(). Added a post-await state check so we don't overwritelostwithauthedand hand callers a half-closed socket.safeWriteguards against writing to destroyed sockets;byenotifications are sent best-effort before subprocess kill.specs/features/SESSION_LIFECYCLE.md(new),HARNESS_ARCHITECTURE.mdupdated with MCP hardening section.check-mcp-shim.ts,check-mcp-shim-runtime.ts,check-session-registry.tscover happy-path, reconnect/backoff, eviction, pin/unpin, race-guard, and handshake state-flap.Test plan
pnpm -r buildpassespnpm --filter @anton/agent-core check:mcp-shimpassespnpm --filter @anton/agent-core check:mcp-shim-runtimepassespnpm --filter @anton/agent-core check:session-registrypassesconversationpool past capacity — LRU victim is evicted, its codex subprocess exits, IPC auth entry drops🤖 Generated with Claude Code