Skip to content

Pipeline Design 178

ezigus edited this page Mar 17, 2026 · 3 revisions

Now let me write the ADR with concrete codebase references.

Design: Pipeline cost forecast and budget gate with early warning

Context

Shipwright pipelines can run 12 stages, each consuming model tokens at different rates (Opus at $15/$75 per M tokens, Sonnet at $3/$15, Haiku at $0.25/$1.25). Today, estimate_pipeline_cost() in sw-pipeline.sh provides a rough aggregate estimate (~8K input / ~4K output per stage), and cost_check_budget() in sw-cost.sh:197 only checks whether the daily budget is already exceeded — it cannot predict whether a pipeline about to start will blow the budget. Operators discover cost overruns after the fact.

Constraints from the codebase:

  • Bash 3.2 compatibility — no associative arrays, no readarray, no ${var,,}
  • All JSON manipulation via jq --arg (no string interpolation)
  • Events written to ~/.shipwright/events.jsonl via emit_event
  • set -euo pipefail in all scripts; grep -c under pipefail produces double output (use || true + ${var:-0})
  • Dashboard is TypeScript/Bun with vitest; shell tests use lib/test-helpers.sh assertions
  • Existing --ignore-budget flag on pipeline start (line ~460 of sw-pipeline.sh)

Decision

Approach B: Per-stage forecast using template model assignments + historical durations.

The forecast engine lives in sw-cost.sh as new functions, called by sw-pipeline.sh before stage execution begins. This keeps cost logic centralized (single responsibility) and reuses the existing cost_calculate() function for per-model pricing.

Data Flow

                    ┌─────────────────────────────┐
                    │    Pipeline Template JSON    │
                    │  (enabled stages + models)   │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
┌──────────────┐     ┌─────────────────────────────┐     ┌──────────────┐
│ events.jsonl │────▶│    cost_forecast() engine    │────▶│ forecast.json│
│  (history)   │     │    in sw-cost.sh             │     │ (artifact)   │
└──────────────┘     └──────────────┬──────────────┘     └──────────────┘
                                    │
                          ┌─────────┴─────────┐
                          ▼                   ▼
                  ┌──────────────┐   ┌────────────────┐
                  │ Budget Gate  │   │ CLI / Dashboard │
                  │ (block/warn) │   │ (display)       │
                  └──────────────┘   └────────────────┘
                          │
                          ▼
              ┌────────────────────────┐
              │ Pipeline runs stages   │
              │ ...                    │
              │ On completion:         │
              │ cost_record_variance() │
              └────────────────────────┘

Component Diagram

┌─────────────────────────────────────────────────────────┐
│                     sw-cost.sh                           │
│                                                          │
│  ┌──────────────────┐  ┌──────────────────────────────┐ │
│  │ cost_calculate()  │  │ cost_forecast()              │ │
│  │ (existing)        │◀─│  - reads template stages     │ │
│  └──────────────────┘  │  - queries event history      │ │
│                         │  - applies complexity mult   │ │
│  ┌──────────────────┐  │  - computes confidence        │ │
│  │ cost_remaining_   │  └──────────────────────────────┘ │
│  │ budget() (exists) │                                    │
│  └──────────────────┘  ┌──────────────────────────────┐ │
│                         │ cost_forecast_display()       │ │
│                         │  - renders table to stdout    │ │
│                         └──────────────────────────────┘ │
│                         ┌──────────────────────────────┐ │
│                         │ cost_record_variance()        │ │
│                         │  - emits forecast vs actual   │ │
│                         └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────┐
│                   sw-pipeline.sh                         │
│                                                          │
│  pipeline_start():                                       │
│    1. load_pipeline_config                               │
│    2. cost_forecast → save artifact                      │
│    3. budget_gate (block | warn | pass)                  │
│    4. run stages...                                      │
│    5. cost_record_variance on completion                  │
└─────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────┐
│                   dashboard/server.ts                     │
│  GET /api/costs/forecast — shells to sw cost forecast    │
│  GET /api/status — enriched with forecast from artifact  │
└─────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────┐
│               dashboard/src/views/pipelines.ts           │
│  Queue items display: "Est: $45–$60 (medium confidence)" │
└─────────────────────────────────────────────────────────┘

Interface Contracts

// -- sw-cost.sh outputs (JSON to stdout) --

// cost_forecast(pipeline_config_path, complexity) → stdout
interface CostForecast {
  total_usd: number;       // point estimate
  low_usd: number;         // lower bound (total × 0.8 high, × 0.7 medium, × 0.5 low)
  high_usd: number;        // upper bound (total × 1.2 high, × 1.5 medium, × 2.0 low)
  confidence: "high" | "medium" | "low";
  data_points: number;     // historical runs used
  complexity_multiplier: number;
  stages: Array<{
    id: string;            // e.g. "build", "review"
    model: string;         // e.g. "sonnet", "opus"
    est_duration_s: number;
    est_cost_usd: number;
  }>;
}

// cost_record_variance(forecast_usd, actual_usd, confidence, template, issue) → event emitted
// No return value; writes to events.jsonl

// -- Budget gate return codes (in pipeline_start context) --
// 0 = proceed (under budget or budget unlimited)
// 1 = warn (forecast 50–100% of remaining; pipeline proceeds with warning)
// 2 = block (forecast.high_usd > remaining AND no --force-start; pipeline exits)

// -- Dashboard API --
// GET /api/costs/forecast?pipeline=standard&complexity=5
// Response 200: CostForecast
// Response 400: { error: { code: string, message: string } }

// -- Dashboard types (additions) --
interface QueueItem {
  issue: number;
  title: string;
  score?: number;
  estimated_cost?: number;  // existing field
  factors?: unknown;        // existing field
  forecast?: CostForecast;  // NEW
}

Error Boundaries

Component Error Handling
cost_forecast() No events.jsonl or empty Falls back to default durations; sets confidence="low"
cost_forecast() Invalid template JSON Returns error JSON {"error": "..."}, pipeline logs warning and skips gate
cost_forecast() jq not available Detected at script top; forecast skipped with warning
Budget gate cost_remaining_budget returns "unlimited" Gate skipped entirely
Budget gate Forecast fails Pipeline proceeds with warning (forecast is advisory, not blocking-critical)
cost_record_variance() Missing forecast data at completion Skipped silently (no-op if PIPELINE_FORECAST_USD unset)
Dashboard endpoint sw cost forecast shell-out fails Returns 500 with error message

Confidence Calibration

Level Data Points Interval Width Rationale
High >= 20 runs ±20% (×0.8 / ×1.2) Enough data for stable averages
Medium 5–19 runs ±30-50% (×0.7 / ×1.5) Moderate uncertainty
Low < 5 runs ±50-100% (×0.5 / ×2.0) Cold start, conservative bounds

Budget Gate Logic (precise)

remaining = cost_remaining_budget()
if remaining == "unlimited" → PASS
if FORCE_START || IGNORE_BUDGET → PASS (with audit event)
if forecast.high_usd > remaining → BLOCK (exit 2, suggest --force-start)
if forecast.total_usd > remaining × 0.5 → WARN (continue)
else → PASS

Historical Data Query

Scan events.jsonl for stage.completed events, extract duration per stage name, compute running averages. Limit scan to tail -1000 lines for performance. Group by stage ID. This reuses the existing event format — no new data collection needed.

Alternatives Considered

  1. Simple multiplier (stage_count × flat_rate) — Pros: Trivial, zero dependencies. Cons: Ignores model tier differences (Opus is 60× more expensive than Haiku), ignores stage duration variance (build averages 20min vs intake at 1min), produces estimates so inaccurate they erode trust in the gate. Rejected: too coarse for meaningful go/no-go decisions.

  2. ML regression on historical runs — Pros: Could capture non-linear relationships (e.g., complexity × model × time-of-day). Cons: Requires training infrastructure, minimum ~100 runs for stable regression, adds Python dependency to a shell-native project, massive over-engineering for current data volume (~dozens of runs). Rejected: future enhancement when data justifies it.

Implementation Plan

Files to modify

File Lines Changed (est.) Purpose
scripts/sw-cost.sh +200 cost_forecast(), cost_forecast_display(), cost_record_variance(), forecast CLI subcommand
scripts/sw-pipeline.sh +50 --force-start flag, forecast + budget gate in pipeline_start(), variance at completion
config/event-schema.json +20 cost.forecast and cost.forecast_variance event type definitions
dashboard/server.ts +30 /api/costs/forecast endpoint, forecast in queue enrichment
dashboard/src/types/api.ts +15 CostForecast interface, extend QueueItem
dashboard/src/views/pipelines.ts +15 Forecast display on queued items
scripts/sw-pipeline-test.sh +60 Integration tests for budget gate

Files to create

File Purpose
src/cost-forecast.test.js Unit tests for forecast math and variance tracking

Dependencies

  • None new. Uses existing jq, awk, bash, vitest.

Risk areas

  • events.jsonl scan performance: Mitigated by tail -1000 + grep filter. If file exceeds ~100K lines, consider indexed lookup (future).
  • pipeline_start() is already ~300 lines: Adding forecast + gate adds ~50 lines of sequential logic. Inserted as a discrete block after load_pipeline_config, before state file creation — minimal entanglement with existing flow.
  • Bash 3.2 float arithmetic: All cost math uses awk (already the pattern in cost_calculate()). No bc dependency.
  • Race condition on budget check: Between forecast check and actual spend, another pipeline could start. Acceptable — the gate is advisory, not transactional. --force-start exists as escape valve.

Validation Criteria

  • shipwright cost forecast --pipeline standard --json returns valid CostForecast JSON
  • shipwright cost forecast --pipeline standard renders human-readable table with per-stage breakdown
  • Cold start (empty events.jsonl): forecast uses defaults, shows "low" confidence
  • With 25+ historical stage.completed events: shows "high" confidence with narrow interval
  • Pipeline start displays forecast before executing stages
  • Pipeline blocked when forecast.high_usd > remaining_budget (exit code 2, message includes --force-start hint)
  • --force-start bypasses gate with audit event emitted
  • --ignore-budget also bypasses forecast gate (backward compatible)
  • cost.forecast event emitted at pipeline start
  • cost.forecast_variance event emitted at pipeline completion with forecast/actual/variance fields
  • All existing tests pass (npm test)
  • New unit tests cover: forecast calculation, confidence thresholds at boundaries (4/5/19/20 data points), complexity multiplier scaling, variance recording
  • New integration tests cover: gate blocks over budget, gate warns at 50-100%, --force-start override, variance event in events.jsonl
  • No Bash 4+ features used (verified by shellcheck or manual review)
  • Dashboard /api/costs/forecast returns valid JSON for all template types
  • Dashboard queue view shows forecast inline for queued items

Frontend Sections

Component Hierarchy

pipelines.ts (view)
  └─ renderQueueTable()
       └─ renderQueueRow(item: QueueItem)
            └─ renderForecastBadge(forecast?: CostForecast)  // NEW
                 - "Est: $45–$60 (medium confidence)"
                 - Color-coded: green (under 50% budget), yellow (50-100%), red (over)

State lives in FleetState.queue[].forecast — fetched from server, no local state management needed. The forecast data flows from GET /api/status through to render.

State Management Approach

No new state stores. Forecast data is embedded in the existing FleetState response from /api/status. The queue enrichment in server.ts reads cost-forecast.json from pipeline artifacts when available. Pure props-down data flow.

Accessibility Checklist

  • Forecast badge uses semantic <span> with aria-label="Estimated cost: $45 to $60, medium confidence"
  • Color coding supplemented with text labels (not color-only)
  • Budget warning uses role="alert" for screen reader announcement
  • Table cells use <td> with column headers in <th> (existing pattern)

Responsive Breakpoints

  • 320px: Forecast column hidden; available via row expansion (existing mobile pattern)
  • 768px+: Forecast shown as compact badge: "$45–$60 (M)"
  • 1024px+: Full forecast text: "Est: $45–$60 (medium confidence, 12 runs)"
  • 1440px+: No change from 1024px

The dashboard already uses a responsive table pattern — forecast column follows the same hide/show behavior as existing optional columns.

Clone this wiki locally