Skip to content

Pipeline Design 154

ezigus edited this page Mar 15, 2026 · 1 revision

I now have a thorough understanding of the codebase. Here's the ADR:


Design: Build loop context exhaustion prevention with proactive summarization

Context

The shipwright loop harness runs Claude Code in repeated iterations until a goal is achieved. Each iteration invokes the Claude CLI, which receives the full accumulated conversation context. As iterations accumulate, Claude's internal context window fills. When it exhausts, the session degrades or fails silently — there is no proactive mechanism to detect and prevent this.

Existing infrastructure we build on:

  • accumulate_loop_tokens() in sw-loop.sh:469 already tracks cumulative LOOP_INPUT_TOKENS and LOOP_OUTPUT_TOKENS across iterations
  • run_loop_with_restarts() in sw-loop.sh:2389 already handles session restarts (used by stuckness detection at line 2370)
  • manage_context_window() in loop-iteration.sh:8 trims the injected prompt but not the Claude conversation context
  • write_progress() in loop-progress.sh:8 and error-summary.json already capture iteration state
  • emit_event() provides telemetry

The gap: No mechanism monitors cumulative token usage against the model's context window or triggers a preemptive restart before context exhaustion occurs.

Decision

Add a new module loop-context-monitor.sh that monitors cumulative token usage as a percentage of the model context window. When usage crosses a configurable threshold (default 70%), generate a compressed state summary and break out of the main loop with a context_exhaustion status, triggering the existing restart mechanism with the summary injected.

Key design choices:

  1. Cumulative tokens as proxyLOOP_INPUT_TOKENS grows monotonically and correlates with conversation context growth. It underestimates true context usage (doesn't account for Claude's internal reasoning), so the conservative 70% threshold compensates.

  2. Proactive, not reactive — We detect and act before hitting limits, rather than parsing CLI error output after failure.

  3. Reuse existing restart path — The run_loop_with_restarts() wrapper already handles session resets, artifact archival, and state preservation. We add context_exhaustion as a new restartable status alongside stuck_restart.

  4. Structured summary injection — On restart, inject context-summary.md into the goal so the fresh session has compressed context (goal, files changed, error patterns, test status) without full conversation history.

Alternatives Considered

  1. Character-based prompt size tracking only — Simpler, but only measures the injected prompt. The real exhaustion happens in Claude's accumulated conversation context across turns, which this approach cannot observe. Rejected as insufficient.

  2. Claude CLI stderr parsing for context limit warnings — Most accurate, but reactive (fires after degradation starts), fragile (depends on CLI output format that can change without notice), and doesn't provide time to summarize state cleanly. Rejected in favor of proactive prevention.

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                     sw-loop.sh (orchestrator)               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ loop-         │  │ loop-         │  │ loop-context-    │  │
│  │ iteration.sh  │──│ convergence.sh│  │ monitor.sh [NEW] │  │
│  │               │  │               │  │                  │  │
│  │ Runs Claude   │  │ Detects       │  │ Tracks token %   │  │
│  │ CLI, calls    │  │ stuckness     │  │ Generates state  │  │
│  │ accumulate_   │  │               │  │ summary          │  │
│  │ loop_tokens() │  │               │  │                  │  │
│  └──────┬───────┘  └──────────────┘  └────────┬─────────┘  │
│         │                                      │            │
│         │  tokens accumulate                   │ checks     │
│         ▼                                      │ threshold  │
│  LOOP_INPUT_TOKENS ◄───────────────────────────┘            │
│  LOOP_OUTPUT_TOKENS                                         │
│         │                                                   │
│  ┌──────┴───────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ loop-         │  │ loop-         │  │ events.jsonl     │  │
│  │ restart.sh    │◄─│ progress.sh   │  │ (telemetry)      │  │
│  │               │  │               │  │                  │  │
│  │ Manages state │  │ Writes        │  │ context_usage    │  │
│  │ reset, archive│  │ progress.md   │  │ context_exhaust  │  │
│  │ summary inject│  │ context-      │  │ _warning/restart │  │
│  │               │  │ summary.md    │  │                  │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Data flow:
  CLI JSON output → accumulate_loop_tokens() → LOOP_INPUT/OUTPUT_TOKENS
       → check_context_exhaustion() → threshold crossed?
           YES → summarize_loop_state() → context-summary.md
                → STATUS="context_exhaustion" → break
                → run_loop_with_restarts() picks up, injects summary, restarts
           NO  → emit context_usage event → continue loop

Interface Contracts

// --- loop-context-monitor.sh public API ---

// Constants (configurable via environment)
CONTEXT_WINDOW_TOKENS: number    // default: 200000 (Opus/Sonnet context window)
CONTEXT_EXHAUSTION_THRESHOLD: number  // default: 70 (percent)

/**
 * Check if cumulative token usage exceeds the exhaustion threshold.
 * @reads LOOP_INPUT_TOKENS, LOOP_OUTPUT_TOKENS, CONTEXT_WINDOW_TOKENS, CONTEXT_EXHAUSTION_THRESHOLD
 * @sideeffect Emits loop.context_exhaustion_warning event when threshold crossed
 * @returns 0 if threshold crossed (action needed), 1 if safe
 * @errors Never fails — returns 1 (safe) on any arithmetic error
 */
check_context_exhaustion(): ExitCode  // 0 = exhausted, 1 = safe

/**
 * Generate compressed state summary for restart injection.
 * @reads ORIGINAL_GOAL, ITERATION, MAX_ITERATIONS, TEST_PASSED, STATUS,
 *        LOOP_START_COMMIT, LOG_DIR, LOG_ENTRIES
 * @writes $LOG_DIR/context-summary.md
 * @returns 0 always (best-effort; missing data produces partial summary)
 * @output_format Markdown with sections: Goal, Status, Files Modified,
 *                Error Patterns, Recent Log Entries
 * @size_constraint Max 2000 characters (hard truncate with notice)
 */
summarize_loop_state(): ExitCode  // always 0

/**
 * Return current context usage as integer percentage.
 * @reads LOOP_INPUT_TOKENS, LOOP_OUTPUT_TOKENS, CONTEXT_WINDOW_TOKENS
 * @returns 0 always
 * @stdout Integer 0-100+ (can exceed 100 if already over window)
 * @errors Outputs "0" if CONTEXT_WINDOW_TOKENS <= 0 (division-by-zero guard)
 */
get_context_usage_pct(): ExitCode  // always 0; prints percentage to stdout

Data Flow

1. Each iteration:
   run_claude_iteration()
     → Claude CLI produces JSON with .usage.input_tokens / .output_tokens
     → accumulate_loop_tokens() adds to LOOP_INPUT_TOKENS / LOOP_OUTPUT_TOKENS
     → [NEW] emit loop.context_usage event with cumulative percentage
     → [NEW] check_context_exhaustion()
         → If < threshold: continue to next iteration
         → If >= threshold:
              a. Emit loop.context_exhaustion_warning event
              b. summarize_loop_state() → writes $LOG_DIR/context-summary.md
              c. Set STATUS="context_exhaustion"
              d. write_state() + write_progress()
              e. break out of main loop

2. Restart path (run_loop_with_restarts):
   → Detects STATUS is not "complete" and restarts are available
   → [NEW] Copies context-summary.md to restart archive
   → Resets ITERATION, tokens, state variables (existing behavior)
   → [NEW] If context-summary.md exists, prepends to GOAL:
       "## Previous Session Context (Summarized)\n<summary content>"
   → Emits loop.context_exhaustion_restart event
   → Re-enters run_single_agent_loop with fresh context

Error Boundaries

Component Error Handling
check_context_exhaustion() CONTEXT_WINDOW_TOKENS=0 Guard: [[ "$window" -gt 0 ]]; return 1 (safe)
check_context_exhaustion() Non-numeric token values $(( ... )) with ${var:-0} defaults; worst case returns 1 (safe)
summarize_loop_state() Missing git, no commits, no error-summary.json Each section guarded with `
summarize_loop_state() Summary exceeds 2000 chars Hard truncate with ${summary:0:2000} + notice
get_context_usage_pct() Division by zero, missing tokens Returns "0" on any error
Restart injection context-summary.md missing Conditional: only inject if file exists
Token parsing jq unavailable Existing fallback in accumulate_loop_tokens() uses regex; tokens still accumulate

Error propagation principle: All context monitor functions fail safe — they never cause the loop to crash. False negatives (missing a threshold crossing) are acceptable; false positives (unnecessary restart) are bounded by MAX_RESTARTS.

Implementation Plan

Files to create

  • scripts/lib/loop-context-monitor.sh — New module (~80 lines): constants, check_context_exhaustion(), summarize_loop_state(), get_context_usage_pct()

Files to modify

  • scripts/sw-loop.sh — Source new module (line ~43), add context_exhaustion to restartable statuses in run_loop_with_restarts() (line ~2411), inject summary on restart
  • scripts/lib/loop-iteration.sh — After accumulate_loop_tokens call (line 539), add context check and loop.context_usage event emission
  • scripts/sw-loop-test.sh — Add test cases for threshold boundaries, summarization output, restart triggering

Dependencies

  • None new. Uses existing jq (optional), git, awk, emit_event.

Risk areas

  • Token count accuracy: LOOP_INPUT_TOKENS is a proxy for conversation context size. Each Claude CLI invocation starts a fresh conversation, so cumulative tokens track total work done, not actual context window fill. The 70% threshold is deliberately conservative to compensate.
  • Summary lossy-ness: The 2000-char cap may drop relevant context on complex multi-file changes. Mitigated by preserving error-summary.json and progress.md through restarts (existing behavior).
  • Restart loop: If the summarized context itself pushes tokens high on the first iteration of a restart, it could trigger another immediate exhaustion. Mitigated by MAX_RESTARTS cap (default 3, hard cap 5 at line 2406).

Validation Criteria

  • check_context_exhaustion() returns 0 (exhausted) when cumulative tokens >= 70% of CONTEXT_WINDOW_TOKENS
  • check_context_exhaustion() returns 1 (safe) when cumulative tokens < 70%
  • check_context_exhaustion() returns 1 (safe) when CONTEXT_WINDOW_TOKENS=0 (no division-by-zero crash)
  • check_context_exhaustion() returns 1 (safe) when token counters are 0 (no false positive on fresh loop)
  • summarize_loop_state() produces markdown with goal, iteration count, modified files list, error patterns, and test status
  • summarize_loop_state() output does not exceed 2000 characters
  • Loop breaks with STATUS="context_exhaustion" when threshold is crossed mid-loop
  • run_loop_with_restarts() treats context_exhaustion as restartable (session continues)
  • Summary is injected into GOAL on restart so the fresh session has context
  • loop.context_exhaustion_warning event emitted when threshold crossed
  • loop.context_exhaustion_restart event emitted on restart
  • loop.context_usage event emitted per iteration with usage_pct
  • Existing sw-loop-test.sh tests continue to pass
  • All code is Bash 3.2 compatible (no associative arrays, no ${var,,})

Performance

Baseline Metrics

  • Token accumulation: already runs per-iteration with negligible overhead (<1ms arithmetic)
  • Prompt composition: manage_context_window() already does awk-based trimming per iteration

Optimization Targets

  • Context check overhead: < 1ms per iteration (integer arithmetic only)
  • Summarization: < 100ms when triggered (one git diff --name-only, one file read, one file write)
  • No impact on iteration latency in the common case (threshold not crossed)

Profiling Strategy

Not applicable — this is shell arithmetic and one git command. The bottleneck is Claude CLI execution time (30-120s per iteration), making sub-millisecond monitoring overhead irrelevant.

Benchmark Plan

  • Verify check_context_exhaustion() adds no measurable time by running 100 iterations of the function in a test and confirming < 100ms total
  • Verify summarize_loop_state() completes in < 1s on a repo with 50+ changed files

Clone this wiki locally