Problem
Phantom has no guard against context window bloat. The Agent SDK manages sessions internally but there's no explicit cap on how much of the 1M context window gets consumed. Long Slack threads, loops with many iterations, and accumulated tool output can push well past safe thresholds.
Empirically, model quality degrades noticeably above ~20% context utilization (~200k tokens on Opus 4.6 1M). The system prompt is only ~5k tokens today (0.5%), so the pressure comes from conversation history and tool output accumulation.
Current state
- System prompt: ~5k tokens (identity + environment + security + constitution + role/workflow + instructions + working memory)
- Memory context budget: 50k tokens (configurable in memory.yaml)
- Working memory: capped at 75 lines with compaction warning
- Conversation history: unbounded - SDK replays full thread
- Loop iterations: unbounded per context - each iteration appends to the same session
- No token counting or budget enforcement anywhere in
src/agent/runtime.ts
Proposed: context budget system
1. Global context budget config
Add to phantom.yaml:
context:
max_utilization_pct: 20 # target ceiling as % of model context window
warning_pct: 15 # emit a warning when crossing this threshold
model_context_tokens: 1000000 # or derive from model name
This gives a hard budget of ~200k tokens at 20%.
2. Token estimation in runtime
Before each query() call, estimate current context usage:
- System prompt (relatively static, measure once at startup)
- Conversation history (count messages, estimate tokens)
- Memory context (already budgeted at 50k max)
- Tool output from current session
The Agent SDK doesn't expose token counts directly, so this would need estimation (chars/4 is rough but serviceable) or integration with the Anthropic token counting API.
3. Conversation context management
When approaching the budget ceiling in a long Slack thread:
- Summarize older messages (keep recent N turns verbatim, compress earlier ones)
- Or start a fresh SDK session with a context handoff summary
- Emit a warning to the user: "This thread is getting long, I'm compacting my context to stay sharp"
4. Loop sub-tick continuation (new feature)
This is the most impactful piece. Today, phantom_loop runs iterations within a single session context. Each iteration's tool output, reasoning, and results accumulate. A 10-iteration loop exploring a codebase can easily hit 100k+ tokens.
Proposed: context-aware loop ticking
loop config:
max_context_pct: 15 # per-tick budget (leave headroom for the tick itself)
When a loop tick approaches max_context_pct:
- Stop the current iteration cleanly
- Extract a continuation summary: what was accomplished, what remains, key findings
- Queue a sub-tick - a new SDK session that receives:
- The original loop goal/prompt
- The continuation summary (compact context handoff)
- The loop's persistent state (iteration count, accumulated results)
- The sub-tick starts fresh with ~5k system prompt + summary, well under budget
- From the outside, it looks like one continuous loop - the sub-tick boundary is invisible
This is essentially garbage collection for context. The loop keeps running but periodically compacts its working memory into a summary and starts a fresh session.
Key design questions:
- Where does the continuation summary live? Working memory? A loop-specific state file?
- How do we handle tool state that spans ticks (e.g., a cloned repo, a running server)?
- Should sub-tick boundaries be visible to the user (Slack notification) or silent?
- How accurate does the token estimation need to be? Off-by-20% is fine if the cap has headroom.
5. Observability
Add context utilization to phantom status and the web dashboard:
- Current session token estimate
- Peak utilization across recent sessions
- Number of sub-tick boundaries triggered in loops
- Warning count (sessions that crossed the warning threshold)
Priority
Medium. The system works today because most Slack threads are short and loops have iteration caps. But as Phantom takes on longer autonomous tasks (AEGIS workflow, multi-repo SWE work), context pressure will grow. Better to have the guardrails before we hit the wall.
References
src/agent/runtime.ts - query() call, no token budgeting
src/loop/runner.ts - loop iteration runner, no context awareness
src/memory/context-builder.ts - memory context already has a 50k budget (good pattern to follow)
src/agent/prompt-assembler.ts - system prompt assembly (~5k tokens)
Problem
Phantom has no guard against context window bloat. The Agent SDK manages sessions internally but there's no explicit cap on how much of the 1M context window gets consumed. Long Slack threads, loops with many iterations, and accumulated tool output can push well past safe thresholds.
Empirically, model quality degrades noticeably above ~20% context utilization (~200k tokens on Opus 4.6 1M). The system prompt is only ~5k tokens today (0.5%), so the pressure comes from conversation history and tool output accumulation.
Current state
src/agent/runtime.tsProposed: context budget system
1. Global context budget config
Add to
phantom.yaml:This gives a hard budget of ~200k tokens at 20%.
2. Token estimation in runtime
Before each
query()call, estimate current context usage:The Agent SDK doesn't expose token counts directly, so this would need estimation (chars/4 is rough but serviceable) or integration with the Anthropic token counting API.
3. Conversation context management
When approaching the budget ceiling in a long Slack thread:
4. Loop sub-tick continuation (new feature)
This is the most impactful piece. Today,
phantom_loopruns iterations within a single session context. Each iteration's tool output, reasoning, and results accumulate. A 10-iteration loop exploring a codebase can easily hit 100k+ tokens.Proposed: context-aware loop ticking
When a loop tick approaches
max_context_pct:This is essentially garbage collection for context. The loop keeps running but periodically compacts its working memory into a summary and starts a fresh session.
Key design questions:
5. Observability
Add context utilization to
phantom statusand the web dashboard:Priority
Medium. The system works today because most Slack threads are short and loops have iteration caps. But as Phantom takes on longer autonomous tasks (AEGIS workflow, multi-repo SWE work), context pressure will grow. Better to have the guardrails before we hit the wall.
References
src/agent/runtime.ts- query() call, no token budgetingsrc/loop/runner.ts- loop iteration runner, no context awarenesssrc/memory/context-builder.ts- memory context already has a 50k budget (good pattern to follow)src/agent/prompt-assembler.ts- system prompt assembly (~5k tokens)