fix(framework): eliminate ephemeral-zombie source + respect retryable + HealthModule by Tengro · Pull Request #32 · anima-research/agent-framework

Tengro · 2026-05-21T09:41:40Z

Summary

Three interrelated fixes for the "subagent zombie" pattern that locked production concurrency slots for 7 days at a stretch, plus a new HealthModule for self-introspection.

1. `driveStream`: reorder `agent.reset()` before `emitTrace('inference:completed')`

emitTrace is synchronous (inline listener invocation). Previously the trace fired before agent.reset() because reset was guarded behind the await dispatchSpeech call. Listeners gating on agent.state.status === 'idle' (notably runEphemeralToCompletion) observed streaming at the trace boundary, failed their idle check, and never resolved their promise. The SubagentModule's await on that promise then held its concurrency slot indefinitely.

Production trace: search-notion-universal-driver held a slot for 7 days before a fresh spawn finally tripped the demand-driven reaper. Combined with other latent zombies, slot exhaustion timed out 3 new spawns at 120s each.

2. `runEphemeralToCompletion`: drop the status gate + add a completion watchdog

The agent.state.status === 'idle' gate is removed: inference:completed is only emitted from case 'complete', which is terminal by construction (mid-tool-cycle uses inference:stream_resumed, not completed). The gate was order-fragile defense for a bug now fixed at the source.
New 15-minute idle-deadline watchdog rejects + cleans up if no trace event addressed to the agent arrives for 15 minutes after inference starts. Every addressed event refreshes the deadline, so long-running streams aren't penalized.

3. `DefaultErrorPolicy`: respect `MembraneError.retryable`

Retried inferences blindly up to maxRetries=3 with exponential backoff, ignoring the membrane's retryable classification. 400 invalid_request errors (e.g. orphan tool_use_id from compression bugs) are guaranteed not to change between retries; the framework was burning 4 inferences per error (1 + 3 retries) before giving up.

Now: if MembraneError.retryable === false, terminal on attempt 0. Honors retryAfterMs when present (rate-limit hints).

Companion PR in membrane: classify 400s as invalid_request type (instead of falling through to unknown). Both PRs work independently — unknown errors are also retryable: false, so the framework fix triggers either way.

4. `HealthModule` (new): framework self-introspection

New built-in module exposing health--snapshot:

{
  \"window\": { \"lookback\": 20, \"inferencesInWindow\": 20 },
  \"inferences\": {
    \"successCount\": 18,
    \"errorCount\": 2,
    \"recentErrors\": [{ \"timestamp\": ..., \"agentName\": \"clerk\", \"error\": \"400 ...\" }]
  },
  \"tokenTotalsByAgent\": { \"clerk\": { \"input\": 2983925, ... } },
  \"subagents\": [
    { \"name\": \"search-notion-universal-driver\", \"runtimeSeconds\": 605000,
      \"silentSeconds\": 605000, \"status\": \"running\" }
  ],
  \"modules\": [\"subagent\", \"workspace\", \"health\", ...]
}

Read-only. Designed for agents to self-diagnose after a concurrency timeout or 400 burst without operator ssh-in. Reads modules/subagent/state directly from chronicle, so it works with any host's SubagentModule layout.

Why now

Postmortem of production triumvirate run May 8–21:

11 inference failures from compression-induced orphan tool_use_ids (clerk × 7, reviewer × 4), each triggering 4-attempt retry loops
1 subagent zombie holding a concurrency slot for 7 days
3 new spawns timed out at 120s each waiting for the held slots
Researcher had no in-band way to diagnose the slot exhaustion until the demand-driven reaper happened to fire

Test plan

tsc --noEmit clean
All 136 framework tests pass (no regressions)
Verify health--snapshot from a live host
Trigger a 400 in dev and confirm error policy returns immediately (no retry storm)
Force-stall an ephemeral and confirm completion watchdog rejects after 15min

🤖 Generated with Claude Code

Three interrelated changes addressing the "subagent zombie" pattern that locked production concurrency slots for 7 days at a stretch: 1. driveStream: reorder `agent.reset()` before `emitTrace('inference:completed')` `emitTrace` is synchronous (in-line listener invocation). Previously the trace fired BEFORE `agent.reset()` because reset was guarded behind the `await dispatchSpeech` call. Listeners that gated on `agent.state.status === 'idle'` (notably `runEphemeralToCompletion`) observed `streaming` at the trace boundary, failed their idle check, and never resolved their promise. The SubagentModule's await on that promise then held its concurrency slot indefinitely. Production traces showed `search-notion-universal-driver` zombie for 7 days holding 1 of 5 slots; combined with other latent zombies, slot exhaustion timed out 3 new spawns at 120s each before the demand-driven reaper fired. 2. runEphemeralToCompletion: drop status gate + add completion watchdog The `agent.state.status === 'idle'` gate is removed: `inference:completed` is only emitted from `case 'complete'`, which is terminal by construction (mid-tool-cycle uses `inference:stream_resumed`, not `completed`). The gate was order-fragile defense for a bug that's now fixed at the source; keeping it as belt-and-suspenders, plus a new 15-minute idle-deadline watchdog that rejects + cleans up if no trace event addressed to the agent arrives for 15 minutes after inference starts. Every addressed event refreshes the deadline, so long-running streams aren't penalized. 3. DefaultErrorPolicy: respect MembraneError.retryable Retried inferences blindly up to maxRetries=3 with exponential backoff, ignoring the membrane's `retryable` classification. 400 invalid_request errors (e.g. orphan tool_use_id from compression bugs) are guaranteed not to change between retries; the framework was burning 4 inferences per error (1 + 3 retries) before giving up. Now: if MembraneError.retryable is false, terminal on attempt 0. Honors `retryAfterMs` when present (rate limit hints from providers). 4. HealthModule (new): framework self-introspection tool `health--snapshot` returns a structured snapshot — recent inference counts, last N errors with details, per-agent token totals, subagent registry summary (read directly from chronicle state), module list. Read-only; designed for agents to self-diagnose after a concurrency timeout or 400 burst without needing the operator to ssh in. Tests: all 136 framework tests pass. The compaction-watchdog and retry- policy changes don't have direct test coverage yet — adding regression tests is part of follow-up work; the production behavioral change is verifiable via the new health snapshot tool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tengro mentioned this pull request May 21, 2026

fix(subagent): periodic reaper + lastActivityAt + slot-holder diagnostics anima-research/connectome-host#29

Merged

5 tasks

Anarchid merged commit e0a4206 into anima-research:main May 21, 2026

Anarchid mentioned this pull request May 21, 2026

Trace events have become load-bearing for control flow in runEphemeralToCompletion #33

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(framework): eliminate ephemeral-zombie source + respect retryable + HealthModule#32

fix(framework): eliminate ephemeral-zombie source + respect retryable + HealthModule#32
Anarchid merged 1 commit into
anima-research:mainfrom
Tengro:fix/zombie-cleanup-and-retry-policy

Tengro commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tengro commented May 21, 2026

Summary

1. driveStream: reorder agent.reset() before emitTrace('inference:completed')

2. runEphemeralToCompletion: drop the status gate + add a completion watchdog

3. DefaultErrorPolicy: respect MembraneError.retryable

4. HealthModule (new): framework self-introspection

Why now

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `driveStream`: reorder `agent.reset()` before `emitTrace('inference:completed')`

2. `runEphemeralToCompletion`: drop the status gate + add a completion watchdog

3. `DefaultErrorPolicy`: respect `MembraneError.retryable`

4. `HealthModule` (new): framework self-introspection