Skip to content

fix: raise phantom container memory limit from 2G to 8G#52

Merged
mcheemaa merged 1 commit intomainfrom
fix/raise-phantom-memory-limit-8g
Apr 12, 2026
Merged

fix: raise phantom container memory limit from 2G to 8G#52
mcheemaa merged 1 commit intomainfrom
fix/raise-phantom-memory-limit-8g

Conversation

@mcheemaa
Copy link
Copy Markdown
Member

Summary

  • Raises phantom container memory cgroup limit from 2 GiB to 8 GiB in both docker-compose.yaml and docker-compose.user.yaml
  • Bumps memory reservation from 256 MiB to 512 MiB to match the new steady-state baseline
  • Fixes SIGKILL cascade on Claude Code judge subprocesses under evolution load

Root cause

The 2 GiB cgroup ceiling could not hold peak LLM-judge concurrency. A post-session evolution cycle spawns up to five concurrent bun + cli.js subprocesses via runJudgeQuery (observation, regression, constitution, safety, consolidation). Each judge subprocess holds 300 to 500 MiB RSS, and they run on top of the main phantom process plus whatever agent query subprocess is in flight. Peak concurrent demand lands between 2.5 and 4 GiB, which crosses the 2 GiB ceiling and triggers the container's memcg OOM killer.

Phase 1's runJudgeQuery catches the resulting SIGKILL, the engine correctly fails closed on safety and constitution gates, and the other judges fall back to heuristic, so the main phantom process never crashes. But every LLM judge call after the first kill fails, which defeats the point of having judges enabled at all.

Evidence captured live

Observed on the wehshi Specter VM within 20 minutes of enabling LLM judges (post claude login + restart):

  • docker stats: phantom 2GiB / 2GiB 99.98% 178.54%
  • docker inspect phantom .HostConfig.Memory: 2147483648
  • journalctl -k: repeated Memory cgroup out of memory: Killed process <pid> (bun) events charged to the phantom container's oom_memcg, with anon-rss per killed subprocess ranging 99 MiB to 502 MiB
  • Phantom log stream: 20+ consecutive Claude Code process terminated by signal SIGKILL messages from observation, regression, constitution, safety, and consolidation judges
  • Host free -h: 30 GiB total, 27 GiB available at the time of the kills, so this was strictly a container cap, not a VM sizing problem

Sizing rationale

Hetzner CX53 (the Specter default) ships with 30 GiB RAM. With phantom at 8 GiB, qdrant at 4 GiB, and ollama at 4 GiB, total committed ceilings are 16 GiB, leaving 14 GiB of host headroom for the OS, Docker daemon, and any transient bursts. Actual steady-state phantom RSS is well under 1 GiB, so the 8 GiB cap is a generous upper bound rather than a sustained reservation.

Test plan

  • docker compose up -d phantom on wehshi against the new compose
  • docker inspect phantom --format '{{.HostConfig.Memory}}' reports 8589934592
  • docker stats steady-state shows phantom usage under the new ceiling
  • Trigger evolution cycles via Slack traffic, confirm no SIGKILL log lines from runJudgeQuery
  • journalctl -k --since "10 min ago" shows no new Memory cgroup out of memory events charged to the phantom memcg
  • Existing VMs (mcheema, cheeks) to be rolled forward via docker compose up -d phantom on each

Notes

Both compose files are updated because new Specter deploys use docker-compose.user.yaml (Docker Hub image), while source-built deploys use docker-compose.yaml. Keeping them consistent means every future deployment path inherits the new ceiling.

The 2 GiB cgroup ceiling OOM-killed Claude Code judge subprocesses under
evolution load. A post-session evolution cycle spawns up to five concurrent
bun + cli.js subprocesses via runJudgeQuery (observation, regression,
constitution, safety, consolidation), each holding 300 to 500 MiB RSS, on
top of the main phantom process and whatever agent query subprocess is in
flight. Peak concurrent demand is 2.5 to 4 GiB, which crossed the 2 GiB
ceiling and triggered SIGKILLs that phase 1's runJudgeQuery caught and
reported as "Claude Code process terminated by signal SIGKILL", failing
closed on safety and constitution gates and dropping to heuristics on
observation and regression.

Raising the limit to 8 GiB gives generous headroom for peak judge
concurrency on a host with 30 GiB total (Hetzner CX53 default), leaving
14 GiB free after phantom (8G), qdrant (4G) and ollama (4G) caps.
Reservation bumped from 256 MiB to 512 MiB to match the healthier
steady-state baseline.

Root cause observed on the wehshi VM: the SIGKILL cascade began within
20 minutes of enabling LLM judges, journalctl kernel log showed
"Memory cgroup out of memory" events charged to the phantom container's
memcg, and docker stats reported phantom pinned at 2 GiB / 2 GiB at 99.98
percent while the host sat at 27 GiB free.
@mcheemaa mcheemaa merged commit d07b739 into main Apr 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant