fix(adapters): kill CLI process tree on timeout + skip MCP for tool-less profiles by whoabuddy · Pull Request #6 · aibtcdev/agent-runtime

whoabuddy · 2026-06-03T18:57:29Z

Problem

The agent-cli timeout path killed only the direct child (child.kill("SIGTERM")). claude-code — and any CLI that spawns stdio MCP servers or other descendants — leaves those descendants holding the stdout/stderr pipes. The run-once dispatcher reads those pipes, so the parent process never exits and the systemd oneshot service hangs in activating forever, freezing all dispatch for that agent.

Observed (2026-06-03): a single council-lens-review task on Lumen wedged the dispatcher for 2.5h — heartbeat, email-poll, everything stopped. The hung claude processes ended up in kernel D-state (__flush_work), un-killable, requiring a reboot.

Fix

1. Kill the whole process tree on timeout (fleet-wide fix). Spawn CLI children detached: true so each leads its own process group; on timeout, SIGTERM the group, escalate to SIGKILL after a 5s grace, tear down stdio pipes, and unref() the child. A hung descendant can no longer hold the dispatcher (or the oneshot service) open — protects every CLI adapter from any hang cause.

2. Don't load MCP for tool-less profiles (removes the trigger). buildClaudeArgs passes --strict-mcp-config when the profile declares no aibtc integration (integration_policies.aibtc === "none", e.g. council-lens-review). Read-only reasoning profiles need no MCP tools; loading the user-level ~/.claude.json aibtc stdio MCP is what triggered the teardown hang.

Scope

Only the claude-code driver hits the MCP-teardown variant, and only on VMs whose ~/.claude.json registers a stdio MCP server (today: Lumen). But the process-tree-not-killed bug is in the shared timeout path and affects every CLI adapter on every machine. bunx tsc --noEmit clean.

🤖 Generated with Claude Code

…ess profiles The agent-cli timeout path called `child.kill("SIGTERM")` on the direct child only. claude-code (and any CLI that spawns stdio MCP servers or other descendants) leaves those descendants holding the stdout/stderr pipes, so the parent run-once process never exits and the systemd oneshot service hangs in `activating` forever — freezing ALL dispatch for that agent. Observed on Lumen: one council-lens-review task wedged the dispatcher for 2.5h (heartbeat, email-poll, everything stopped), leaving claude procs stuck in kernel D-state. 1. Spawn CLI children `detached: true` and, on timeout, signal the whole process group (SIGTERM, then SIGKILL after a 5s grace), tear down the stdio pipes, and unref the child — so a hung descendant can never hold the dispatcher (and thus the oneshot service) open. Fleet-wide fix: protects every CLI adapter from any hang cause. 2. buildClaudeArgs: pass `--strict-mcp-config` when the profile declares no aibtc integration (`integration_policies.aibtc === "none"`, e.g. council-lens-review). Read-only reasoning profiles don't need MCP tools, and loading the user-level ~/.claude.json aibtc stdio MCP server is what triggered the teardown hang. The timeout fix above is the backstop for every other cause. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

whoabuddy merged commit 66f37c7 into main Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(adapters): kill CLI process tree on timeout + skip MCP for tool-less profiles#6

fix(adapters): kill CLI process tree on timeout + skip MCP for tool-less profiles#6
whoabuddy merged 1 commit into
mainfrom
fix/agent-cli-hang-on-timeout

whoabuddy commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whoabuddy commented Jun 3, 2026

Problem

Fix

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant