Skip to content

Pipeline Design 206

ezigus edited this page Mar 20, 2026 · 1 revision

ADR written to .claude/pipeline-artifacts/design.md.

Key findings from the codebase review that shaped the design:

  1. Existing cmd_trace() is broken for current events — it matches pipeline_start/stage_start but the actual schema uses pipeline.started/stage.started. The new cmd_trace_export() is a parallel function rather than a refactor to avoid breaking unknown consumers.

  2. Events already carry ts_epoch (integer seconds) — this simplifies nanosecond conversion to a simple multiply-by-10^9, avoiding all cross-platform date parsing issues.

  3. The dispatch pattern at sw-pipeline.sh:3178 and sw-otel.sh:582 uses a clean case statement — adding new subcommands is mechanical.

  4. Auto-export hooks into pipeline_cleanup_worktree() at line 2071 rather than the plan's suggested line 2727, which is the actual cleanup function in the codebase.

  5. Added otel.export_failed event type alongside otel.trace_exported — the plan only had the success event, but failure observability matters for the auto-export path where errors are intentionally swallowed.

Constraints:

  • Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
  • Must use jq --arg for JSON construction (never string interpolation)
  • Pure bash + jq — no new dependencies
  • Non-blocking: auto-export must never fail the pipeline

Decision

Add a new cmd_trace_export() function to scripts/sw-otel.sh that produces spec-compliant OTLP/HTTP JSON. This is a new function alongside the existing cmd_trace() — the existing function is left intact to avoid breaking current consumers.

Core design choices:

  1. New function, not a refactor of cmd_trace(): The existing function has unknown consumers. cmd_trace_export() produces correct OTLP; a future cleanup can deprecate cmd_trace().

  2. Deterministic span/trace IDs via sha256: traceId = sha256(run-id) | head -c 32, spanId = sha256(run-id + stage) | head -c 16. This makes exports idempotent — re-exporting the same run produces identical output, enabling safe retries and diffing.

  3. Run-id matching against both job_id and issue fields: Pipeline events carry job_id; some carry issue number. Grep pre-filters events.jsonl before piping to jq, bounding I/O for large files.

  4. Nanosecond timestamps from ISO strings: Events carry ts_epoch (integer seconds). Multiply by 1000000000 for nanosecond precision. Sub-second precision is unavailable in events, so this is exact for our data. No date parsing needed — use the ts_epoch field directly.

  5. OTLP attribute encoding: All attributes use the array-of-{key, value} format per the OTLP spec. Values are typed: stringValue for strings, intValue for integers (as strings per proto3 JSON), doubleValue for floats.

  6. Root span from pipeline.started/pipeline.completed; child spans from stage.* events: Each stage span's parentSpanId references the root pipeline span. Skipped stages get SPAN_KIND_INTERNAL with status UNSET. Failed stages get status code 2 (ERROR) with the error message.

  7. Auto-export fires in pipeline_cleanup_worktree() (sw-pipeline.sh:2071): After a successful pipeline completion, if OTEL_EXPORTER_OTLP_ENDPOINT is set, spawn sw-otel.sh trace-export <id> --send with stderr redirected and || true to ensure it never blocks cleanup.

  8. pipeline export subcommand: Thin delegation — parses --format otel (default and only format), forwards remaining args to sw-otel.sh trace-export.

Data flow:

User: shipwright pipeline export --format otel <run-id>
  → sw-pipeline.sh dispatches to sw-otel.sh trace-export <run-id>
  → grep filters events.jsonl by run-id (job_id or issue)
  → jq builds root span from pipeline.started/completed pair
  → jq builds child spans from stage.started → stage.completed/failed/skipped pairs
  → jq assembles OTLP resourceSpans envelope
  → stdout (or --output file, or --send POST to OTLP endpoint)

Error boundaries:

Boundary Behavior
Malformed event lines jq returns empty — line skipped, warning to stderr
No matching events for run-id error() + exit 1
Missing jq error() with install instructions + exit 1
--send fails (curl error) error() + emit otel.export_failed event + exit 1
Auto-export path failure Swallowed by `2>/dev/null

Alternatives Considered

  1. Refactor existing cmd_trace() in-place — Pros: single function, no duplication / Cons: breaks unknown consumers of current output format, riskier change. The current function uses legacy event type names (pipeline_start vs pipeline.started) and non-standard OTLP structure. Migrating it would be a breaking change with unclear blast radius.

  2. New standalone sw-pipeline-export.sh script — Pros: clean separation, independent lifecycle / Cons: duplicates event-reading patterns, EVENTS_FILE path management, emit_event() helpers already in sw-otel.sh. Inconsistent with the existing pattern where all OTel concerns live in sw-otel.sh.

  3. Node.js implementation using @opentelemetry/sdk-trace-base — Pros: official SDK, guaranteed spec compliance / Cons: new dependency, heavier runtime, breaks the pure-bash pattern of the scripts directory. The project's shell scripts intentionally avoid Node dependencies for portability.

Implementation Plan

Files to modify

File Lines affected Change
scripts/sw-otel.sh +~150 lines after line 319 New cmd_trace_export() function
scripts/sw-otel.sh lines 540-577 (help) Add trace-export to help text
scripts/sw-otel.sh lines 582-611 (dispatch) Add trace-export case
scripts/sw-pipeline.sh lines 3178-3196 (dispatch) Add export case
scripts/sw-pipeline.sh lines 340-379 (help) Add export to help text
scripts/sw-pipeline.sh lines 2071-2105 (cleanup) Add auto-export hook
config/event-schema.json end of event_types Add otel.trace_exported and otel.export_failed types
scripts/sw-otel-test.sh +~120 lines 10 new test cases for trace-export

Files to create

File Purpose
docs/observability.md Jaeger/Honeycomb integration guide with Docker setup example

Dependencies

  • No new dependencies. Uses existing jq, curl, sha256sum/shasum, date, grep.
  • Cross-platform sha256: sha256sum (Linux) or shasum -a 256 (macOS) — add a sha256_hex() helper in the function using the same pattern as compat.sh.

Risk areas

Risk Severity Mitigation
OTLP JSON doesn't validate against Jaeger Medium Test against OTLP proto3 JSON spec field names; include a test that validates structure with jq schema checks
Nanosecond timestamp precision Low Events carry ts_epoch as integer seconds; multiply by 10^9 — no cross-platform date issues
Large events.jsonl (>100K lines) causes slowness Low Pre-filter with grep for run-id before jq parsing; stays under 5s for 100K lines
pipeline_cleanup_worktree() modifying shared cleanup path Low Auto-export is appended after existing logic, guarded by env var check, non-blocking

Validation Criteria

  • shipwright otel trace-export <run-id> outputs valid JSON parseable by jq
  • Output has exactly one resourceSpans entry with service.name = "shipwright"
  • Root pipeline span has empty parentSpanId, correct traceId, nano timestamps
  • Each stage span has parentSpanId equal to root span's spanId
  • Failed stage spans have status.code = 2 (ERROR) with message
  • Completed stage spans have status.code = 1 (OK)
  • Skipped stage spans have status.code = 0 (UNSET)
  • Attributes use OTLP array-of-key-value format with typed values
  • --output <file> writes to file instead of stdout
  • --send POSTs to OTEL_EXPORTER_OTLP_ENDPOINT/v1/traces with correct Content-Type
  • Auto-export triggers only when OTEL_EXPORTER_OTLP_ENDPOINT is set
  • Auto-export failure does not affect pipeline exit code
  • Re-exporting same run-id produces identical output (deterministic IDs)
  • Run-id matches both job_id field and issue field
  • Missing run-id returns exit 1 with descriptive error
  • All 10 new tests in sw-otel-test.sh pass
  • All existing tests pass (npm test)
  • No Bash 3.2 incompatibilities (no associative arrays, no readarray)

Endpoint Specification

CLI endpoint: shipwright pipeline export [--format otel] <run-id>

  • Input: run-id — string matching job_id or issue number in events
  • Output: OTLP JSON to stdout (exit 0), or error to stderr (exit 1)
  • Flags: --format otel (default, only format), --output <file>, --send

CLI endpoint: shipwright otel trace-export <run-id> [--output <file>] [--send]

  • Same behavior (pipeline export delegates here)

Error codes:

  • Exit 0: Successful export
  • Exit 1: Missing argument, no matching events, jq unavailable, or --send failure

Rate Limiting: N/A — CLI tool, not a service. Versioning: N/A — internal CLI with no external API contract.

Monitoring Checklist

Not applicable — this is a local CLI export tool, not a deployed service.

Self-monitoring is handled by:

  • otel.trace_exported event emitted on successful export (captures run_id, endpoint, spans_count)
  • otel.export_failed event emitted on --send failure (captures run_id, error)
  • Both events are queryable via existing shipwright otel metrics

Anomaly Detection / Log Analysis / Auto-Rollback

Not applicable — additive CLI feature with no deployed runtime component. Failures are local and immediately visible to the user. The auto-export path is explicitly non-blocking (|| true), so there is no production blast radius to monitor.

Clone this wiki locally