Skip to content

Pipeline Plan 154

ezigus edited this page Mar 15, 2026 · 1 revision

Implementation Plan: Build Loop Context Exhaustion Prevention

Alternatives Considered

Approach A: Token-based threshold monitoring with proactive summarization (CHOSEN)

  • Monitor accumulated LOOP_INPUT_TOKENS + LOOP_OUTPUT_TOKENS per iteration against a configurable threshold (70% of model context window)
  • When threshold is crossed, generate a compressed state summary and trigger a session restart with that summary injected
  • Pros: Uses existing token tracking infrastructure, leverages existing session restart mechanism, minimal new code
  • Blast radius: Small — adds a new check in the main loop, a new summarization function, and a new module file
  • Trade-offs: Token counts from Claude CLI are per-iteration (not cumulative conversation context), so we estimate cumulative context usage

Approach B: Character-based prompt size tracking only

  • Only track prompt character growth across iterations
  • Pros: Simpler, no new dependency on token math
  • Cons: Doesn't account for Claude's internal conversation context accumulation; prompt size alone doesn't reflect total context usage
  • Rejected: Insufficient — the real exhaustion happens in Claude's conversation context, not just our injected prompt

Approach C: Claude CLI --max-tokens monitoring via stderr parsing

  • Parse Claude CLI stderr for context limit warnings
  • Pros: Most accurate
  • Cons: Fragile (depends on CLI output format), reactive not proactive
  • Rejected: We want proactive prevention, not reactive recovery

Design Decision

Approach A — Track cumulative token usage across iterations and proactively trigger summarization + session restart at 70% of the context window. This builds on:

  1. Existing accumulate_loop_tokens() in sw-loop.sh (already tracks per-iteration tokens)
  2. Existing run_loop_with_restarts() session restart mechanism
  3. Existing emit_event telemetry system

The key insight: each Claude CLI invocation gets the full conversation context. As iterations accumulate, the total tokens grow. We track cumulative input tokens as a proxy for context window usage and trigger preemptive summarization before hitting limits.

Files to Modify

File Action Purpose
scripts/lib/loop-context-monitor.sh CREATE New module: context threshold monitoring + state summarization
scripts/sw-loop.sh MODIFY Source new module, add context check in main loop, wire summarization into restart
scripts/lib/loop-iteration.sh MODIFY Add cumulative token tracking variable, emit enhanced context metrics
scripts/sw-loop-test.sh MODIFY Add test cases for context monitoring and summarization

Implementation Steps

Step 1: Create scripts/lib/loop-context-monitor.sh

New module with these functions:

# Module guard
_LOOP_CONTEXT_MONITOR_LOADED=1

# Constants
CONTEXT_WINDOW_TOKENS=${CONTEXT_WINDOW_TOKENS:-200000}  # Default: 200k (Opus/Sonnet)
CONTEXT_EXHAUSTION_THRESHOLD=${CONTEXT_EXHAUSTION_THRESHOLD:-70}  # Trigger at 70%
CONTEXT_SUMMARIZATION_TRIGGERED=false

# check_context_exhaustion()
# - Computes cumulative_tokens = LOOP_INPUT_TOKENS + LOOP_OUTPUT_TOKENS
# - Computes usage_pct = cumulative_tokens / CONTEXT_WINDOW_TOKENS * 100
# - If usage_pct >= CONTEXT_EXHAUSTION_THRESHOLD:
#   - Emits loop.context_exhaustion_warning event
#   - Returns 0 (threshold crossed)
# - Else returns 1 (safe)

# summarize_loop_state()
# - Writes compressed state to $LOG_DIR/context-summary.md:
#   - Goal (original, not accumulated)
#   - Iteration count and test status
#   - Files modified (from git diff --name-only LOOP_START_COMMIT..HEAD)
#   - Last error summary (from error-summary.json)
#   - Key fixes attempted (from log entries, last 5)
#   - Test results status
# - Returns path to summary file

# get_context_usage_pct()
# - Returns current context usage as integer percentage
# - Used by telemetry and logging

Step 2: Wire into main loop (sw-loop.sh)

  1. Source the new module near the top (after other lib sources)
  2. After accumulate_loop_tokens() call in the main loop (~line 2166), add:
    # Context exhaustion prevention
    if check_context_exhaustion; then
        warn "Context usage at $(get_context_usage_pct)% — triggering proactive summarization"
        summarize_loop_state
        STATUS="context_exhaustion"
        write_state
        write_progress
        break  # Exit to restart wrapper
    fi
  3. In run_loop_with_restarts(), handle STATUS="context_exhaustion" as a restart-worthy condition (alongside "stuck_restart")
  4. When restarting after context exhaustion, inject the summary from context-summary.md into the restart context

Step 3: Enhanced token tracking in loop-iteration.sh

Add to run_claude_iteration() after the accumulate_loop_tokens call:

  • Emit loop.context_usage event with: cumulative_input, cumulative_output, usage_pct, threshold

Step 4: Summarization state preservation

In summarize_loop_state():

  • Read ORIGINAL_GOAL (already preserved in sw-loop.sh)
  • Read git diff stat from LOOP_START_COMMIT
  • Extract error patterns from error-summary.json
  • Extract last 5 log entries from LOG_ENTRIES
  • Write to $LOG_DIR/context-summary.md in structured format
  • The restart mechanism already copies error-summary.json and reads progress.md — we add context-summary.md to the restart context injection

Step 5: Restart integration

In run_loop_with_restarts() (~line 2389):

  • Add context_exhaustion to the list of restartable statuses
  • When restarting after context_exhaustion:
    • Inject context-summary.md content into GOAL as "## Previous Session Context (Summarized)"
    • Reset token counters
    • Emit loop.context_exhaustion_restart event

Step 6: Test coverage

Add to sw-loop-test.sh:

  1. Unit test: threshold calculation — Set LOOP_INPUT_TOKENS/OUTPUT_TOKENS to known values, verify check_context_exhaustion returns correctly at <70%, =70%, >70%
  2. Unit test: summarization output — Create mock state (git, error-summary.json, log entries), run summarize_loop_state, verify output contains essential fields
  3. Unit test: context window sizing — Verify CONTEXT_WINDOW_TOKENS defaults and override via env
  4. Integration test: restart trigger — Simulate tokens exceeding threshold, verify loop breaks with context_exhaustion status and emits correct event

Task Checklist

  • Task 1: Create scripts/lib/loop-context-monitor.sh with module guard, constants, check_context_exhaustion(), summarize_loop_state(), get_context_usage_pct()
  • Task 2: Source the new module in sw-loop.sh (near line 28 with other lib sources)
  • Task 3: Add context exhaustion check in main loop after accumulate_loop_tokens call (~line 2166 in run_single_agent_loop)
  • Task 4: Handle context_exhaustion status in run_loop_with_restarts() — allow restart with summary injection
  • Task 5: Add loop.context_exhaustion_warning and loop.context_exhaustion_restart event emissions
  • Task 6: Emit loop.context_usage event per iteration with cumulative token usage percentage
  • Task 7: Add threshold calculation unit tests to sw-loop-test.sh
  • Task 8: Add summarization output unit tests to sw-loop-test.sh
  • Task 9: Add restart trigger integration test to sw-loop-test.sh
  • Task 10: Verify existing tests still pass after changes

Testing Approach

Test Pyramid Breakdown

  • Unit tests (7): Threshold math at boundary values (<70%, =70%, >70%), summarization output validation (4 field checks), context window default/override
  • Integration tests (2): Full loop restart on context exhaustion, event emission verification
  • E2E tests (1): Existing sw-loop-test.sh regression (no breakage)

Coverage Targets

  • 100% branch coverage on check_context_exhaustion() (3 branches: under/at/over threshold)
  • 100% coverage on summarize_loop_state() output fields
  • Existing test suite remains green

Critical Paths to Test

  • Happy path: Loop runs under threshold, no summarization triggered
  • Error case 1: Tokens exceed 70% threshold mid-loop — summarization fires, loop breaks gracefully
  • Error case 2: Tokens exceed threshold on first iteration (huge prompt) — handled without crash
  • Edge case 1: Zero tokens reported (jq unavailable) — no false positive trigger
  • Edge case 2: CONTEXT_WINDOW_TOKENS set to 0 — division by zero protection

Risk Analysis

Risk Impact Mitigation
Token counts are per-iteration, not cumulative conversation context Underestimates true usage Accumulate across iterations; use conservative 70% threshold
False positive triggers (threshold too aggressive) Unnecessary restarts Make threshold configurable via env/config; default 70% is conservative
Summary too lossy — critical context dropped Regression after restart Include: goal, files modified, error patterns, test status, last 5 log entries
Division by zero if CONTEXT_WINDOW_TOKENS=0 Script crash Guard with [[ "$window" -gt 0 ]] check

Definition of Done

  • check_context_exhaustion() correctly identifies when cumulative tokens exceed 70% of context window
  • summarize_loop_state() produces compressed state with: goal, iteration count, modified files, error patterns, test status
  • Loop continues seamlessly after summarization-triggered restart without losing critical context
  • loop.context_exhaustion_warning event emitted when threshold crossed (observable in events.jsonl)
  • loop.context_exhaustion_restart event emitted when restart occurs
  • Per-iteration loop.context_usage event includes cumulative token percentage
  • All new code has test coverage (threshold boundaries, summarization output, restart trigger)
  • Existing test suite passes without regression
  • Bash 3.2 compatible (no associative arrays, no ${var,,})
  • Uses set -euo pipefail and module guard pattern

Threat Model (STRIDE)

  • Spoofing/Tampering/Repudiation/Elevation: Not applicable — this is internal shell logic, no auth or external input
  • Information Disclosure: context-summary.md contains goal/error text — same sensitivity as existing progress.md. No secrets involved.
  • Denial of Service: False positive summarization could cause unnecessary restarts. Mitigated by configurable threshold and conservative default.

Auth Flow

Not applicable — no authentication involved in this feature.

Input Validation Points

  • CONTEXT_WINDOW_TOKENS from env — validated as integer >0
  • CONTEXT_EXHAUSTION_THRESHOLD from env — validated as integer 1-99
  • Token values from accumulate_loop_tokens — already validated in existing code

Security Checklist

  • No secrets in code
  • No external input from users (internal orchestration only)
  • No network calls added
  • No file path injection risk (all paths derived from existing LOG_DIR/PROJECT_ROOT)

Monitoring Checklist

P0 — Immediate

  • loop.context_exhaustion_warning event fires when expected (threshold crossed)
  • Loop does not crash or hang when summarization triggers

P1 — Short-term

  • loop.context_usage events show monotonically increasing token percentages
  • Restart after context exhaustion produces working sessions (not stuck loops)

Anomaly Detection Triggers

  • context_exhaustion_warning firing on iteration 1 = prompt too large or threshold misconfigured
  • Multiple consecutive context_exhaustion_restart events = possible infinite restart loop (guarded by MAX_RESTARTS)

Log Analysis

  • Search for "context_exhaustion" in events.jsonl
  • Verify context-summary.md written before each exhaustion restart

Auto-Rollback Decision Criteria

Not applicable — this is build infrastructure, not a deployed service.

Systematic Debugging Notes

Root Cause Hypothesis (for potential failures)

  1. Token accumulation undercount — Claude CLI may not report all tokens. Likelihood: medium. Evidence: compare loop-tokens.json totals vs Claude dashboard.
  2. Threshold too aggressive — 70% may trigger too early for small tasks. Likelihood: low. Evidence: check if context_exhaustion events fire on simple 2-iteration loops.
  3. Summary injection bloats restart context — If summary is too large, it defeats the purpose. Likelihood: low. Mitigation: cap summary at 2000 chars.

Evidence to Gather

  • Token accumulation values across iterations (from existing loop.context_efficiency events)
  • Actual Claude context window sizes for different models

Fix Strategy

This is a new feature, not a retry of a failed approach. Building on proven patterns: module guard, emit_event, session restart.

Verification Plan

  1. Run sw-loop-test.sh — all tests pass including new ones
  2. Run npm test — full suite green
  3. Manual verification: set CONTEXT_WINDOW_TOKENS=1000, run loop, confirm early summarization trigger

Clone this wiki locally