Skip to content

feat: context window budget management (20% cap, loop sub-ticks) #24

@electronicBlacksmith

Description

@electronicBlacksmith

Problem

Phantom has no guard against context window bloat. The Agent SDK manages sessions internally but there's no explicit cap on how much of the 1M context window gets consumed. Long Slack threads, loops with many iterations, and accumulated tool output can push well past safe thresholds.

Empirically, model quality degrades noticeably above ~20% context utilization (~200k tokens on Opus 4.6 1M). The system prompt is only ~5k tokens today (0.5%), so the pressure comes from conversation history and tool output accumulation.

Current state

  • System prompt: ~5k tokens (identity + environment + security + constitution + role/workflow + instructions + working memory)
  • Memory context budget: 50k tokens (configurable in memory.yaml)
  • Working memory: capped at 75 lines with compaction warning
  • Conversation history: unbounded - SDK replays full thread
  • Loop iterations: unbounded per context - each iteration appends to the same session
  • No token counting or budget enforcement anywhere in src/agent/runtime.ts

Proposed: context budget system

1. Global context budget config

Add to phantom.yaml:

context:
  max_utilization_pct: 20    # target ceiling as % of model context window
  warning_pct: 15            # emit a warning when crossing this threshold
  model_context_tokens: 1000000  # or derive from model name

This gives a hard budget of ~200k tokens at 20%.

2. Token estimation in runtime

Before each query() call, estimate current context usage:

  • System prompt (relatively static, measure once at startup)
  • Conversation history (count messages, estimate tokens)
  • Memory context (already budgeted at 50k max)
  • Tool output from current session

The Agent SDK doesn't expose token counts directly, so this would need estimation (chars/4 is rough but serviceable) or integration with the Anthropic token counting API.

3. Conversation context management

When approaching the budget ceiling in a long Slack thread:

  • Summarize older messages (keep recent N turns verbatim, compress earlier ones)
  • Or start a fresh SDK session with a context handoff summary
  • Emit a warning to the user: "This thread is getting long, I'm compacting my context to stay sharp"

4. Loop sub-tick continuation (new feature)

This is the most impactful piece. Today, phantom_loop runs iterations within a single session context. Each iteration's tool output, reasoning, and results accumulate. A 10-iteration loop exploring a codebase can easily hit 100k+ tokens.

Proposed: context-aware loop ticking

loop config:
  max_context_pct: 15   # per-tick budget (leave headroom for the tick itself)

When a loop tick approaches max_context_pct:

  1. Stop the current iteration cleanly
  2. Extract a continuation summary: what was accomplished, what remains, key findings
  3. Queue a sub-tick - a new SDK session that receives:
    • The original loop goal/prompt
    • The continuation summary (compact context handoff)
    • The loop's persistent state (iteration count, accumulated results)
  4. The sub-tick starts fresh with ~5k system prompt + summary, well under budget
  5. From the outside, it looks like one continuous loop - the sub-tick boundary is invisible

This is essentially garbage collection for context. The loop keeps running but periodically compacts its working memory into a summary and starts a fresh session.

Key design questions:

  • Where does the continuation summary live? Working memory? A loop-specific state file?
  • How do we handle tool state that spans ticks (e.g., a cloned repo, a running server)?
  • Should sub-tick boundaries be visible to the user (Slack notification) or silent?
  • How accurate does the token estimation need to be? Off-by-20% is fine if the cap has headroom.

5. Observability

Add context utilization to phantom status and the web dashboard:

  • Current session token estimate
  • Peak utilization across recent sessions
  • Number of sub-tick boundaries triggered in loops
  • Warning count (sessions that crossed the warning threshold)

Priority

Medium. The system works today because most Slack threads are short and loops have iteration caps. But as Phantom takes on longer autonomous tasks (AEGIS workflow, multi-repo SWE work), context pressure will grow. Better to have the guardrails before we hit the wall.

References

  • src/agent/runtime.ts - query() call, no token budgeting
  • src/loop/runner.ts - loop iteration runner, no context awareness
  • src/memory/context-builder.ts - memory context already has a 50k budget (good pattern to follow)
  • src/agent/prompt-assembler.ts - system prompt assembly (~5k tokens)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions