fix(framework): eliminate ephemeral-zombie source + respect retryable + HealthModule#32
Merged
Anarchid merged 1 commit intoMay 21, 2026
Conversation
Three interrelated changes addressing the "subagent zombie" pattern that
locked production concurrency slots for 7 days at a stretch:
1. driveStream: reorder `agent.reset()` before `emitTrace('inference:completed')`
`emitTrace` is synchronous (in-line listener invocation). Previously the
trace fired BEFORE `agent.reset()` because reset was guarded behind the
`await dispatchSpeech` call. Listeners that gated on
`agent.state.status === 'idle'` (notably `runEphemeralToCompletion`)
observed `streaming` at the trace boundary, failed their idle check, and
never resolved their promise. The SubagentModule's await on that promise
then held its concurrency slot indefinitely.
Production traces showed `search-notion-universal-driver` zombie for
7 days holding 1 of 5 slots; combined with other latent zombies, slot
exhaustion timed out 3 new spawns at 120s each before the demand-driven
reaper fired.
2. runEphemeralToCompletion: drop status gate + add completion watchdog
The `agent.state.status === 'idle'` gate is removed: `inference:completed`
is only emitted from `case 'complete'`, which is terminal by construction
(mid-tool-cycle uses `inference:stream_resumed`, not `completed`). The
gate was order-fragile defense for a bug that's now fixed at the source;
keeping it as belt-and-suspenders, plus a new 15-minute idle-deadline
watchdog that rejects + cleans up if no trace event addressed to the
agent arrives for 15 minutes after inference starts. Every addressed
event refreshes the deadline, so long-running streams aren't penalized.
3. DefaultErrorPolicy: respect MembraneError.retryable
Retried inferences blindly up to maxRetries=3 with exponential backoff,
ignoring the membrane's `retryable` classification. 400 invalid_request
errors (e.g. orphan tool_use_id from compression bugs) are guaranteed
not to change between retries; the framework was burning 4 inferences
per error (1 + 3 retries) before giving up. Now: if MembraneError.retryable
is false, terminal on attempt 0. Honors `retryAfterMs` when present
(rate limit hints from providers).
4. HealthModule (new): framework self-introspection tool
`health--snapshot` returns a structured snapshot — recent inference
counts, last N errors with details, per-agent token totals, subagent
registry summary (read directly from chronicle state), module list.
Read-only; designed for agents to self-diagnose after a concurrency
timeout or 400 burst without needing the operator to ssh in.
Tests: all 136 framework tests pass. The compaction-watchdog and retry-
policy changes don't have direct test coverage yet — adding regression
tests is part of follow-up work; the production behavioral change is
verifiable via the new health snapshot tool.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three interrelated fixes for the "subagent zombie" pattern that locked production concurrency slots for 7 days at a stretch, plus a new HealthModule for self-introspection.
1.
driveStream: reorderagent.reset()beforeemitTrace('inference:completed')emitTraceis synchronous (inline listener invocation). Previously the trace fired beforeagent.reset()because reset was guarded behind theawait dispatchSpeechcall. Listeners gating onagent.state.status === 'idle'(notablyrunEphemeralToCompletion) observedstreamingat the trace boundary, failed their idle check, and never resolved their promise. The SubagentModule'sawaiton that promise then held its concurrency slot indefinitely.Production trace:
search-notion-universal-driverheld a slot for 7 days before a fresh spawn finally tripped the demand-driven reaper. Combined with other latent zombies, slot exhaustion timed out 3 new spawns at 120s each.2.
runEphemeralToCompletion: drop the status gate + add a completion watchdogagent.state.status === 'idle'gate is removed:inference:completedis only emitted fromcase 'complete', which is terminal by construction (mid-tool-cycle usesinference:stream_resumed, notcompleted). The gate was order-fragile defense for a bug now fixed at the source.3.
DefaultErrorPolicy: respectMembraneError.retryableRetried inferences blindly up to
maxRetries=3with exponential backoff, ignoring the membrane'sretryableclassification. 400invalid_requesterrors (e.g. orphantool_use_idfrom compression bugs) are guaranteed not to change between retries; the framework was burning 4 inferences per error (1 + 3 retries) before giving up.Now: if
MembraneError.retryable === false, terminal on attempt 0. HonorsretryAfterMswhen present (rate-limit hints).Companion PR in membrane: classify 400s as
invalid_requesttype (instead of falling through tounknown). Both PRs work independently —unknownerrors are alsoretryable: false, so the framework fix triggers either way.4.
HealthModule(new): framework self-introspectionNew built-in module exposing
health--snapshot:{ \"window\": { \"lookback\": 20, \"inferencesInWindow\": 20 }, \"inferences\": { \"successCount\": 18, \"errorCount\": 2, \"recentErrors\": [{ \"timestamp\": ..., \"agentName\": \"clerk\", \"error\": \"400 ...\" }] }, \"tokenTotalsByAgent\": { \"clerk\": { \"input\": 2983925, ... } }, \"subagents\": [ { \"name\": \"search-notion-universal-driver\", \"runtimeSeconds\": 605000, \"silentSeconds\": 605000, \"status\": \"running\" } ], \"modules\": [\"subagent\", \"workspace\", \"health\", ...] }Read-only. Designed for agents to self-diagnose after a concurrency timeout or 400 burst without operator ssh-in. Reads
modules/subagent/statedirectly from chronicle, so it works with any host's SubagentModule layout.Why now
Postmortem of production triumvirate run May 8–21:
Test plan
tsc --noEmitcleanhealth--snapshotfrom a live host🤖 Generated with Claude Code