feat: context window budget management (20% cap, loop sub-ticks)

## Problem

Phantom has no guard against context window bloat. The Agent SDK manages sessions internally but there's no explicit cap on how much of the 1M context window gets consumed. Long Slack threads, loops with many iterations, and accumulated tool output can push well past safe thresholds.

Empirically, model quality degrades noticeably above ~20% context utilization (~200k tokens on Opus 4.6 1M). The system prompt is only ~5k tokens today (0.5%), so the pressure comes from conversation history and tool output accumulation.

## Current state

- System prompt: ~5k tokens (identity + environment + security + constitution + role/workflow + instructions + working memory)
- Memory context budget: 50k tokens (configurable in memory.yaml)
- Working memory: capped at 75 lines with compaction warning
- Conversation history: **unbounded** - SDK replays full thread
- Loop iterations: **unbounded per context** - each iteration appends to the same session
- No token counting or budget enforcement anywhere in `src/agent/runtime.ts`

## Proposed: context budget system

### 1. Global context budget config

Add to `phantom.yaml`:

```yaml
context:
  max_utilization_pct: 20    # target ceiling as % of model context window
  warning_pct: 15            # emit a warning when crossing this threshold
  model_context_tokens: 1000000  # or derive from model name
```

This gives a hard budget of ~200k tokens at 20%.

### 2. Token estimation in runtime

Before each `query()` call, estimate current context usage:
- System prompt (relatively static, measure once at startup)
- Conversation history (count messages, estimate tokens)
- Memory context (already budgeted at 50k max)
- Tool output from current session

The Agent SDK doesn't expose token counts directly, so this would need estimation (chars/4 is rough but serviceable) or integration with the Anthropic token counting API.

### 3. Conversation context management

When approaching the budget ceiling in a long Slack thread:
- Summarize older messages (keep recent N turns verbatim, compress earlier ones)
- Or start a fresh SDK session with a context handoff summary
- Emit a warning to the user: "This thread is getting long, I'm compacting my context to stay sharp"

### 4. Loop sub-tick continuation (new feature)

This is the most impactful piece. Today, `phantom_loop` runs iterations within a single session context. Each iteration's tool output, reasoning, and results accumulate. A 10-iteration loop exploring a codebase can easily hit 100k+ tokens.

**Proposed: context-aware loop ticking**

```
loop config:
  max_context_pct: 15   # per-tick budget (leave headroom for the tick itself)
```

When a loop tick approaches `max_context_pct`:
1. Stop the current iteration cleanly
2. Extract a continuation summary: what was accomplished, what remains, key findings
3. Queue a **sub-tick** - a new SDK session that receives:
   - The original loop goal/prompt
   - The continuation summary (compact context handoff)
   - The loop's persistent state (iteration count, accumulated results)
4. The sub-tick starts fresh with ~5k system prompt + summary, well under budget
5. From the outside, it looks like one continuous loop - the sub-tick boundary is invisible

This is essentially garbage collection for context. The loop keeps running but periodically compacts its working memory into a summary and starts a fresh session.

**Key design questions:**
- Where does the continuation summary live? Working memory? A loop-specific state file?
- How do we handle tool state that spans ticks (e.g., a cloned repo, a running server)?
- Should sub-tick boundaries be visible to the user (Slack notification) or silent?
- How accurate does the token estimation need to be? Off-by-20% is fine if the cap has headroom.

### 5. Observability

Add context utilization to `phantom status` and the web dashboard:
- Current session token estimate
- Peak utilization across recent sessions
- Number of sub-tick boundaries triggered in loops
- Warning count (sessions that crossed the warning threshold)

## Priority

Medium. The system works today because most Slack threads are short and loops have iteration caps. But as Phantom takes on longer autonomous tasks (AEGIS workflow, multi-repo SWE work), context pressure will grow. Better to have the guardrails before we hit the wall.

## References

- `src/agent/runtime.ts` - query() call, no token budgeting
- `src/loop/runner.ts` - loop iteration runner, no context awareness
- `src/memory/context-builder.ts` - memory context already has a 50k budget (good pattern to follow)
- `src/agent/prompt-assembler.ts` - system prompt assembly (~5k tokens)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: context window budget management (20% cap, loop sub-ticks) #24

Problem

Current state

Proposed: context budget system

1. Global context budget config

2. Token estimation in runtime

3. Conversation context management

4. Loop sub-tick continuation (new feature)

5. Observability

Priority

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: context window budget management (20% cap, loop sub-ticks) #24

Description

Problem

Current state

Proposed: context budget system

1. Global context budget config

2. Token estimation in runtime

3. Conversation context management

4. Loop sub-tick continuation (new feature)

5. Observability

Priority

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions