Skip to content

Pipeline Design 448

ezigus edited this page May 1, 2026 · 1 revision

Design: Pipeline re-enters build indefinitely after consecutive test-stage failures (no cycling halt)

Context

self_healing_build_test() in scripts/sw-pipeline.sh runs up to BUILD_TEST_RETRIES (default 3) build→test cycles per invocation. When all cycles exhaust and the function returns 1, external orchestration (the autonomous pipeline runner and daemon) re-invokes the pipeline from scratch — resetting all in-memory counters (STUCKNESS_COUNT, RESTART_COUNT, EXTENSION_COUNT in sw-loop.sh). The existing pipeline-state.md log persists across restarts but nothing reads it to detect cumulative failure patterns.

Two existing convergence detectors fail here:

  • Same-error detector (consecutive_same_error): resets when error signatures differ across cycles (timestamps, assertion counts change)
  • Plateau detector: resets when prev_fail_count varies

Neither tracks failures across separate self_healing_build_test invocations. The state file log already records ### test (ts)\nfailed (...) entries on every failure — this is the only durable signal that survives restarts.

Constraints:

  • Bash 3.2 compatible (no declare -A, no ${var,,}, no readarray)
  • No new files — _cleanup_run_artifacts() must not need updating
  • No new state schema fields — blast radius on initialize_state() / write_state() is too high
  • Counter must reset automatically when a test stage passes (no explicit reset path)
  • SW_PIPELINE_MAX_BUILD_RETRIES=0 must be a valid escape hatch for automation

Decision

Parse the existing pipeline-state.md log section at the top of each self_healing_build_test() while-loop iteration to count trailing consecutive test stage failures. If the count reaches SW_PIPELINE_MAX_BUILD_RETRIES (default: 3), set status: stuck_cycling, write a diagnostic log entry, emit a structured event, and return 1. No new files, no new state schema fields.

Key design choices:

  • Counter lives in the log, not in a variable or file — survives daemon restarts by construction
  • Reset is implicit: mark_stage_complete("test") writes complete to the log; the parser sees complete and resets the trailing count to 0
  • The check fires before each build attempt, so it catches cycling on the next entry after the cap is reached — not after N+1 failures
  • Both call sites (self_healing_build_test at line ~1483 and self_healing_review_build_test at line ~1758) get the guard automatically since both invoke self_healing_build_test

Alternatives Considered

  1. External counter file (ARTIFACTS_DIR/consecutive-test-failures.txt) — Pros: trivially simple read/write. Cons: not automatically cleaned on fresh pipeline start; requires _cleanup_run_artifacts() to preserve it across restarts (contradicts its purpose); silently wrong if ARTIFACTS_DIR is wiped.

  2. State file frontmatter field (consecutive_test_failures: N) — Pros: clean data model, first-class field. Cons: requires modifying initialize_state(), write_state(), and resume_state() in pipeline-state.sh; higher blast radius; field must be manually reset on test pass.

  3. In-memory counter passed as argument — Pros: no I/O. Cons: dies on any restart; doesn't solve the cross-invocation cycling problem at all.


Implementation Plan

Files to Modify

File Change
scripts/sw-pipeline.sh Add env default (~line 812), add count_consecutive_test_failures() before self_healing_build_test() (~line 1422), add cycling halt check inside while loop (~line 1483), add stuck_cycling to pipeline_status() display (~line 3407)
scripts/sw-pipeline-test.sh Add test_count_consecutive_test_failures_parsing (unit), test_stuck_cycling_halts_after_max_build_retries (E2E), register both in main()

Files to Create

None.

Dependencies

None — parser uses only Bash builtins and [[ =~ ]].

Risk Areas

Area Risk Mitigation
BASH_REMATCH regex Pattern must be POSIX ERE for bash 3.2 Use ^###[[:space:]]+([a-z_]+)[[:space:]]+ — validated POSIX ERE
State file absent First cycle: file doesn't exist yet Guard: `[[ -z "$state_file"
Log section boundary Parser must not count entries outside ## Log in_log flag gated on ## Log header line
resume_state() compatibility Parser uses same log grammar as line ~774 Reuse identical regex — no divergence risk
stuck_cycling blocking pipeline resume Automated restart fails with env override Resume does NOT treat stuck_cycling as terminal; SW_PIPELINE_MAX_BUILD_RETRIES=0 bypasses check

Component Diagram

┌─────────────────────────────────────────────────────┐
│                  sw-pipeline.sh                      │
│                                                      │
│  ┌──────────────────────────────────────────────┐   │
│  │         self_healing_build_test()            │   │
│  │                                              │   │
│  │  while [cycle < BUILD_TEST_RETRIES]:         │   │
│  │    ┌──────────────────────┐                 │   │
│  │    │  [NEW] cycling guard │                 │   │
│  │    │  count_consecutive_  │◄── reads ──┐   │   │
│  │    │  test_failures()     │            │   │   │
│  │    └──────┬───────────────┘            │   │   │
│  │           │ N >= cap?                  │   │   │
│  │           ├─ YES → stuck_cycling       │   │   │
│  │           │         ↓                  │   │   │
│  │           │    update_status()         │   │   │
│  │           │    log_stage()             │   │   │
│  │           │    emit_event()            │   │   │
│  │           │    return 1               │   │   │
│  │           │                            │   │   │
│  │           └─ NO → run build → run test │   │   │
│  │                        │               │   │   │
│  │                   mark_stage_failed()  │   │   │
│  │                   write_state() ───────┘   │   │
│  └──────────────────────────────────────────  │   │
│                                               │   │
│  ┌────────────────────────────────────────┐  │   │
│  │   [NEW] count_consecutive_test_        │  │   │
│  │         failures(state_file)           │──┘   │
│  │                                        │      │
│  │   reads: pipeline-state.md §## Log     │      │
│  │   parses: ### test → complete|failed   │      │
│  │   returns: N (trailing fail count)     │      │
│  └────────────────────────────────────────┘      │
│                                                   │
│  ┌────────────────────────────────────────┐      │
│  │        pipeline_status()              │      │
│  │  stuck_cycling → ⚠ yellow icon        │      │
│  └────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────┘

       ▼ persists to / reads from ▼

┌─────────────────────────────────────────────────────┐
│               pipeline-state.md                     │
│  ---                                                │
│  status: stuck_cycling                              │
│  ...                                                │
│  ## Log                                             │
│  ### test (2026-04-30T10:00:00Z)                    │
│  failed (exit 1)                                    │
│  ### test (2026-04-30T10:05:00Z)                    │
│  failed (exit 1)           ← parser counts these   │
│  ### test (2026-04-30T10:10:00Z)                    │
│  failed (exit 1)                                    │
└─────────────────────────────────────────────────────┘

Interface Contracts

# count_consecutive_test_failures
# Input:  state_file (path) — optional, defaults to ${STATE_FILE:-}
# Output: integer N printed to stdout (0 if file absent/empty/no test entries)
# Errors: none — always returns 0 on any read failure
# Pre:    state_file may not exist (handled gracefully)
# Post:   N = count of trailing consecutive "failed" test entries in ## Log
#         N resets to 0 on any "complete" test entry
count_consecutive_test_failures() { ... }
# Returns: 0 always (stdout carries the count)

# self_healing_build_test (modified)
# Input:  (no change to existing signature)
# New behavior: calls count_consecutive_test_failures() at top of while loop
#               halts with status=stuck_cycling if count >= SW_PIPELINE_MAX_BUILD_RETRIES
# Errors: returns 1 on stuck_cycling (same as existing exhaustion return)
# Events emitted: pipeline.stuck_cycling { issue, consecutive_failures, cap }

# Environment contract:
# SW_PIPELINE_MAX_BUILD_RETRIES (int, default 3)
#   0  → guard disabled, loop runs unbounded
#   N  → halt after N consecutive test stage failures across all invocations

Data Flow

[daemon / autonomous runner]
         │
         ▼
sw-pipeline.sh → self_healing_build_test()
         │
         ▼ (top of while loop, each iteration)
count_consecutive_test_failures(pipeline-state.md)
         │
         ├─── reads §## Log section line-by-line
         │         tracks in_log flag, current_stage, outcomes string
         │         appends "pass" or "fail" per test entry
         │
         └─── returns N (trailing consecutive fail count)
                   │
    N < cap ───────┤───────── N >= cap (and cap > 0)
         │                           │
    run build                  update_status("stuck_cycling")
    run test                   log_stage("pipeline", "stuck_cycling: ...")
         │                     write_state() → pipeline-state.md
    test fails                 emit_event("pipeline.stuck_cycling", ...)
         │                     error() + warn() to terminal
    mark_stage_failed("test")  return 1
    log_stage("test","failed")       │
    write_state()              [daemon sees return 1, does not re-invoke]
         │
    loop continues

Error Boundaries

Component Errors It Handles Propagation
count_consecutive_test_failures() Missing/unreadable state file, empty log section, malformed lines Returns 0 (safe default), never propagates — caller always gets an integer
Cycling halt check _consec_failures >= cap Sets stuck_cycling status, calls return 1 — propagates as normal build failure to caller
pipeline_status() Unknown status values stuck_cycling case added; unknown values fall through to existing default
emit_event() Event emission failure `

Validation Criteria

  • After exactly SW_PIPELINE_MAX_BUILD_RETRIES (default 3) consecutive test-stage failures logged in pipeline-state.md, pipeline exits with status: stuck_cycling
  • stuck_cycling is present in pipeline-state.md after halt
  • Diagnostic log entry in ## Log section names failure count and override command
  • pipeline.stuck_cycling event emitted with consecutive_failures, cap, issue fields
  • SW_PIPELINE_MAX_BUILD_RETRIES=0 disables the guard entirely — loop runs unbounded
  • Counter resets to 0 after any test stage complete entry in the log
  • count_consecutive_test_failures returns 0 for: missing file, empty file, no test entries, pass after failures
  • shipwright pipeline resume with SW_PIPELINE_MAX_BUILD_RETRIES=0 proceeds past stuck_cycling state
  • npm test passes with no regressions

Test Pyramid Breakdown

Unit tests: 6 — all in test_count_consecutive_test_failures_parsing

Case Input Expected
Missing state file /dev/null/nonexistent 0
No test entries in log log with only build entries 0
Single failure ### test\nfailed (exit 1) 1
Three consecutive failures 3× failed entries 3
Pass resets counter 2 failures → 1 pass → 2 failures 2
Pass after failures 3 failures → 1 pass 0

E2E tests: 1test_stuck_cycling_halts_after_max_build_retries

Setup: SW_PIPELINE_MAX_BUILD_RETRIES=2, mock sw-loop commits but exits 0, mock test always exits 1. Pre-seed state file with 1 prior test failure. Run pipeline. Assert stuck_cycling in state file after 1 additional failure (total=2 = cap).

Coverage targets:

  • Parsing function: 100% branch coverage across the 6 unit cases above
  • Cycling halt check: E2E covers the cap-reached path; existing tests cover the cap-not-reached path implicitly
  • pipeline_status() display: covered by existing status display tests once stuck_cycling case is added

Critical paths:

  • Happy path: test passes on cycle 2 → counter resets → no halt
  • Error case 1: test fails N times in one invocation → halt at cap
  • Error case 2: test fails across multiple daemon re-invocations → halt at cap (cross-restart persistence)
  • Edge case 1: SW_PIPELINE_MAX_BUILD_RETRIES=0 → no halt, loop continues
  • Edge case 2: state file absent on first cycle → count=0, loop proceeds normally

Clone this wiki locally