Skip to content

fix(framework): eliminate ephemeral-zombie source + respect retryable + HealthModule#32

Merged
Anarchid merged 1 commit into
anima-research:mainfrom
Tengro:fix/zombie-cleanup-and-retry-policy
May 21, 2026
Merged

fix(framework): eliminate ephemeral-zombie source + respect retryable + HealthModule#32
Anarchid merged 1 commit into
anima-research:mainfrom
Tengro:fix/zombie-cleanup-and-retry-policy

Conversation

@Tengro
Copy link
Copy Markdown
Collaborator

@Tengro Tengro commented May 21, 2026

Summary

Three interrelated fixes for the "subagent zombie" pattern that locked production concurrency slots for 7 days at a stretch, plus a new HealthModule for self-introspection.

1. driveStream: reorder agent.reset() before emitTrace('inference:completed')

emitTrace is synchronous (inline listener invocation). Previously the trace fired before agent.reset() because reset was guarded behind the await dispatchSpeech call. Listeners gating on agent.state.status === 'idle' (notably runEphemeralToCompletion) observed streaming at the trace boundary, failed their idle check, and never resolved their promise. The SubagentModule's await on that promise then held its concurrency slot indefinitely.

Production trace: search-notion-universal-driver held a slot for 7 days before a fresh spawn finally tripped the demand-driven reaper. Combined with other latent zombies, slot exhaustion timed out 3 new spawns at 120s each.

2. runEphemeralToCompletion: drop the status gate + add a completion watchdog

  • The agent.state.status === 'idle' gate is removed: inference:completed is only emitted from case 'complete', which is terminal by construction (mid-tool-cycle uses inference:stream_resumed, not completed). The gate was order-fragile defense for a bug now fixed at the source.
  • New 15-minute idle-deadline watchdog rejects + cleans up if no trace event addressed to the agent arrives for 15 minutes after inference starts. Every addressed event refreshes the deadline, so long-running streams aren't penalized.

3. DefaultErrorPolicy: respect MembraneError.retryable

Retried inferences blindly up to maxRetries=3 with exponential backoff, ignoring the membrane's retryable classification. 400 invalid_request errors (e.g. orphan tool_use_id from compression bugs) are guaranteed not to change between retries; the framework was burning 4 inferences per error (1 + 3 retries) before giving up.

Now: if MembraneError.retryable === false, terminal on attempt 0. Honors retryAfterMs when present (rate-limit hints).

Companion PR in membrane: classify 400s as invalid_request type (instead of falling through to unknown). Both PRs work independently — unknown errors are also retryable: false, so the framework fix triggers either way.

4. HealthModule (new): framework self-introspection

New built-in module exposing health--snapshot:

{
  \"window\": { \"lookback\": 20, \"inferencesInWindow\": 20 },
  \"inferences\": {
    \"successCount\": 18,
    \"errorCount\": 2,
    \"recentErrors\": [{ \"timestamp\": ..., \"agentName\": \"clerk\", \"error\": \"400 ...\" }]
  },
  \"tokenTotalsByAgent\": { \"clerk\": { \"input\": 2983925, ... } },
  \"subagents\": [
    { \"name\": \"search-notion-universal-driver\", \"runtimeSeconds\": 605000,
      \"silentSeconds\": 605000, \"status\": \"running\" }
  ],
  \"modules\": [\"subagent\", \"workspace\", \"health\", ...]
}

Read-only. Designed for agents to self-diagnose after a concurrency timeout or 400 burst without operator ssh-in. Reads modules/subagent/state directly from chronicle, so it works with any host's SubagentModule layout.

Why now

Postmortem of production triumvirate run May 8–21:

  • 11 inference failures from compression-induced orphan tool_use_ids (clerk × 7, reviewer × 4), each triggering 4-attempt retry loops
  • 1 subagent zombie holding a concurrency slot for 7 days
  • 3 new spawns timed out at 120s each waiting for the held slots
  • Researcher had no in-band way to diagnose the slot exhaustion until the demand-driven reaper happened to fire

Test plan

  • tsc --noEmit clean
  • All 136 framework tests pass (no regressions)
  • Verify health--snapshot from a live host
  • Trigger a 400 in dev and confirm error policy returns immediately (no retry storm)
  • Force-stall an ephemeral and confirm completion watchdog rejects after 15min

🤖 Generated with Claude Code

Three interrelated changes addressing the "subagent zombie" pattern that
locked production concurrency slots for 7 days at a stretch:

1. driveStream: reorder `agent.reset()` before `emitTrace('inference:completed')`

   `emitTrace` is synchronous (in-line listener invocation). Previously the
   trace fired BEFORE `agent.reset()` because reset was guarded behind the
   `await dispatchSpeech` call. Listeners that gated on
   `agent.state.status === 'idle'` (notably `runEphemeralToCompletion`)
   observed `streaming` at the trace boundary, failed their idle check, and
   never resolved their promise. The SubagentModule's await on that promise
   then held its concurrency slot indefinitely.

   Production traces showed `search-notion-universal-driver` zombie for
   7 days holding 1 of 5 slots; combined with other latent zombies, slot
   exhaustion timed out 3 new spawns at 120s each before the demand-driven
   reaper fired.

2. runEphemeralToCompletion: drop status gate + add completion watchdog

   The `agent.state.status === 'idle'` gate is removed: `inference:completed`
   is only emitted from `case 'complete'`, which is terminal by construction
   (mid-tool-cycle uses `inference:stream_resumed`, not `completed`). The
   gate was order-fragile defense for a bug that's now fixed at the source;
   keeping it as belt-and-suspenders, plus a new 15-minute idle-deadline
   watchdog that rejects + cleans up if no trace event addressed to the
   agent arrives for 15 minutes after inference starts. Every addressed
   event refreshes the deadline, so long-running streams aren't penalized.

3. DefaultErrorPolicy: respect MembraneError.retryable

   Retried inferences blindly up to maxRetries=3 with exponential backoff,
   ignoring the membrane's `retryable` classification. 400 invalid_request
   errors (e.g. orphan tool_use_id from compression bugs) are guaranteed
   not to change between retries; the framework was burning 4 inferences
   per error (1 + 3 retries) before giving up. Now: if MembraneError.retryable
   is false, terminal on attempt 0. Honors `retryAfterMs` when present
   (rate limit hints from providers).

4. HealthModule (new): framework self-introspection tool

   `health--snapshot` returns a structured snapshot — recent inference
   counts, last N errors with details, per-agent token totals, subagent
   registry summary (read directly from chronicle state), module list.
   Read-only; designed for agents to self-diagnose after a concurrency
   timeout or 400 burst without needing the operator to ssh in.

Tests: all 136 framework tests pass. The compaction-watchdog and retry-
policy changes don't have direct test coverage yet — adding regression
tests is part of follow-up work; the production behavioral change is
verifiable via the new health snapshot tool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants