Skip to content

Pipeline Plan 206

ezigus edited this page Mar 20, 2026 · 1 revision

Plan written to .claude/pipeline-artifacts/plan.md.

Summary of the approach:

  1. Enhance sw-otel.sh with a new cmd_trace_export() function that builds proper OTLP JSON with parent/child span hierarchy, nanosecond timestamps, and typed attributes (string/int/double)
  2. Add pipeline export subcommand in sw-pipeline.sh that delegates to sw-otel.sh trace-export
  3. Auto-export hook in pipeline completion path — fires when OTEL_EXPORTER_OTLP_ENDPOINT is set
  4. 5 files touched: sw-otel.sh (core logic), sw-pipeline.sh (CLI + auto-export), event-schema.json (new event type), sw-otel-test.sh (10 test cases), docs/observability.md (Jaeger guide)

Key design decisions:

  • Reuse sw-otel.sh rather than creating a new script — it already has trace-building infrastructure
  • Run-id matching supports both job_id and issue number for flexibility
  • Deterministic span IDs from sha256 of run-id + stage name for reproducibility
  • Non-blocking auto-export — failures logged but never fail the pipeline
  • Pure bash + jq — no new dependencies lure
  1. Auto-export on pipeline completion if OTEL_EXPORTER_ENDPOINT set
  2. Integration example with Jaeger in docs/observability.md

Design Alternatives

Approach A: Enhance existing sw-otel.sh cmd_trace()

  • Modify the existing cmd_trace() function to accept run-id, build proper span hierarchy with attributes
  • Add pipeline export subcommand in sw-pipeline.sh that delegates to sw-otel.sh
  • Pros: Reuses existing infrastructure, minimal new files, consistent with codebase patterns
  • Cons: Makes sw-otel.sh larger

Approach B: New dedicated sw-pipeline-export.sh script

  • Create a new standalone script for OTLP export
  • Pros: Clean separation
  • Cons: Duplicates event-reading logic, more files to maintain, inconsistent with the existing sw-otel.sh which already has trace building

Chosen: Approach A — enhancing sw-otel.sh minimizes blast radius and builds on existing code. The pipeline export subcommand in sw-pipeline.sh will delegate to sw-otel.sh trace-export.

Risk Assessment

Risk Likelihood Mitigation
OTLP JSON format doesn't match spec Medium Use exact field names from OTLP protobuf spec; validate against Jaeger
Large events.jsonl causes slow export Low Filter by run-id early with grep before jq parsing
Auto-export fails silently on completion Medium Log export errors but don't fail the pipeline; emit event
Bash 3.2 compatibility issues Low Avoid associative arrays; use indexed arrays and jq

Dependency Analysis

Depends on:

  • scripts/lib/helpers.shemit_event(), output helpers, now_iso()
  • scripts/sw-otel.sh — existing trace/metrics infrastructure
  • scripts/sw-pipeline.sh — pipeline completion hook, subcommand dispatch
  • config/event-schema.json — event type definitions
  • ~/.shipwright/events.jsonl — event data source

Depended on by: Nothing (new feature, additive only)


Architecture Decision Record

Context

Pipeline executions generate rich event data in events.jsonl but there's no way to visualize execution flow in standard observability tools. The existing cmd_trace() in sw-otel.sh builds a rudimentary OTLP structure but lacks run-id filtering, proper span IDs, attributes, and parent/child relationships.

Decision

Enhance sw-otel.sh with a new trace-export subcommand that builds proper OTLP JSON from events. Add pipeline export as a thin delegation. Hook auto-export into the pipeline completion path.

Consequences

  • Users can export any pipeline run as OTLP traces viewable in Jaeger/Honeycomb
  • Auto-export on completion enables continuous observability without manual steps
  • No new dependencies; pure bash + jq implementation

Component Diagram

┌─────────────────────┐
│  sw-pipeline.sh     │
│  (export subcommand)│
└────────┬────────────┘
         │ delegates
         ▼
┌─────────────────────┐     ┌──────────────────┐
│  sw-otel.sh         │────▶│ events.jsonl     │
│  (trace-export)     │     │ (event source)   │
└────────┬────────────┘     └──────────────────┘
         │ outputs
         ▼
┌─────────────────────┐     ┌──────────────────┐
│  OTLP JSON          │────▶│ OTLP Collector   │
│  (stdout or file)   │     │ Jaeger/Honeycomb │
└─────────────────────┘     └──────────────────┘

Auto-export path:
┌─────────────────────┐
│  sw-pipeline.sh     │
│  pipeline_cleanup() │──── if OTEL_EXPORTER_ENDPOINT set ──▶ sw-otel.sh export trace
└─────────────────────┘

Interface Contracts

// OTLP JSON output structure (OTLP/HTTP JSON encoding)
interface OTLPTraceExport {
  resourceSpans: [{
    resource: {
      attributes: Array<{key: string, value: {stringValue: string}}>
    },
    scopeSpans: [{
      scope: { name: string, version: string },
      spans: Span[]
    }]
  }]
}

interface Span {
  traceId: string           // 32 hex chars
  spanId: string            // 16 hex chars
  parentSpanId: string      // 16 hex chars (empty for root)
  name: string              // "pipeline" or stage name
  kind: number              // 1 = SPAN_KIND_INTERNAL
  startTimeUnixNano: string // nanosecond epoch as string
  endTimeUnixNano: string   // nanosecond epoch as string
  status: {
    code: number            // 0=UNSET, 1=OK, 2=ERROR
    message?: string
  }
  attributes: Array<{
    key: string,
    value: {stringValue?: string, intValue?: string, doubleValue?: number}
  }>
}

// CLI interface
// shipwright pipeline export --format otel <run-id>
// shipwright otel trace-export <run-id> [--output <file>]

Data Flow

1. User runs: shipwright pipeline export --format otel <run-id>
2. sw-pipeline.sh delegates to: sw-otel.sh trace-export <run-id>
3. sw-otel.sh reads ~/.shipwright/events.jsonl
4. Filters events by job_id or issue number matching <run-id>
5. Builds root span from pipeline.started → pipeline.completed events
6. Builds child spans from stage.started → stage.completed/failed events
7. Attaches attributes (cost, template, outcome, etc.) from event fields
8. Outputs OTLP JSON to stdout (or file with --output)
9. If --send flag or auto-export: POST to OTEL_EXPORTER_ENDPOINT/v1/traces

Error Boundaries

  • Event parsing errors: Skip malformed lines, continue processing (warn to stderr)
  • Missing run-id: Error with "No events found for run-id: X", exit 1
  • OTLP export failure: Log error, emit otel.export_failed event, exit 1 (non-fatal in auto-export path)
  • Missing jq: Error with install instructions, exit 1

Files to Modify

File Action Purpose
scripts/sw-otel.sh Modify Add trace-export subcommand with proper OTLP span building
scripts/sw-pipeline.sh Modify Add export subcommand, add auto-export hook at completion
config/event-schema.json Modify Add otel.trace_exported event type
scripts/sw-otel-test.sh Modify Add tests for trace-export
docs/observability.md Create Jaeger integration guide

Implementation Steps

Step 1: Add trace-export command to sw-otel.sh

Add a new cmd_trace_export() function that:

  1. Accepts <run-id> argument (matches against job_id or issue fields)
  2. Reads events.jsonl and filters events for the given run
  3. Generates deterministic trace ID from run-id (md5/sha256 truncated to 32 hex)
  4. Generates deterministic span IDs from stage names (sha256 truncated to 16 hex)
  5. Builds root span from pipeline.started/pipeline.completed events
  6. Builds child spans from stage.started/stage.completed/stage.failed/stage.skipped events
  7. Converts ISO timestamps to nanosecond epoch (OTLP requirement)
  8. Attaches proper OTLP-format attributes using {key, value: {stringValue/intValue}} encoding
  9. Outputs valid OTLP JSON

Root span attributes:

  • pipeline.issue, pipeline.template, pipeline.result, pipeline.total_cost, pipeline.iterations, pipeline.agent_id

Stage span attributes:

  • stage.name, stage.outcome, stage.duration_s, stage.error_class

Add --output <file> flag to write to file instead of stdout. Add --send flag to POST to OTEL_EXPORTER_OTLP_ENDPOINT/v1/traces.

Update the help text and main dispatch.

Step 2: Add export subcommand to sw-pipeline.sh

Add a pipeline_export() function and dispatch case:

  1. Parse --format otel flag (only format for now, default to otel)
  2. Accept positional <run-id> argument
  3. Delegate to bash "$SCRIPT_DIR/sw-otel.sh" trace-export "$run_id" "$@"
  4. Add to show_help() output

Step 3: Add auto-export on pipeline completion

In sw-pipeline.sh pipeline_cleanup() function (around line 2727, after successful completion event):

  1. Check if OTEL_EXPORTER_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT is set
  2. If set, run bash "$SCRIPT_DIR/sw-otel.sh" trace-export "$SHIPWRIGHT_PIPELINE_ID" --send 2>/dev/null || true
  3. Emit otel.trace_exported event on success

Step 4: Update event schema

Add to config/event-schema.json:

"otel.trace_exported": {
  "required": ["run_id"],
  "optional": ["endpoint", "spans_count"]
}

Step 5: Write tests

Add to scripts/sw-otel-test.sh:

  1. Test trace-export with synthetic events (pipeline.started + stage events + pipeline.completed)
  2. Validate OTLP JSON structure (resourceSpans, scopeSpans, spans array)
  3. Validate span parent/child relationships (stage spans reference root span)
  4. Validate span attributes are present
  5. Validate timestamps are nanosecond epoch strings
  6. Test with issue number as run-id
  7. Test with job_id as run-id
  8. Test with no matching events (error case)
  9. Test --output flag writes to file
  10. Test --send flag behavior (mock curl)

Step 6: Create docs/observability.md

Write integration guide with:

  1. Prerequisites (Jaeger Docker setup)
  2. Running a pipeline and exporting traces
  3. Auto-export configuration
  4. Viewing traces in Jaeger UI
  5. Honeycomb setup alternative

Task Checklist

  • Task 1: Implement cmd_trace_export() in sw-otel.sh with OTLP span building, run-id filtering, proper attributes
  • Task 2: Add trace-export subcommand dispatch and help text in sw-otel.sh
  • Task 3: Add export subcommand to sw-pipeline.sh delegating to sw-otel.sh
  • Task 4: Add auto-export hook in sw-pipeline.sh pipeline_cleanup() on completion
  • Task 5: Add otel.trace_exported event type to config/event-schema.json
  • Task 6: Write tests for trace-export in sw-otel-test.sh
  • Task 7: Create docs/observability.md with Jaeger integration example
  • Task 8: Run npm test (specifically sw-otel-test.sh) and fix any failures

Testing Approach

Test Pyramid Breakdown

  • Unit tests (8): OTLP JSON structure validation, span relationships, attribute encoding, timestamp conversion, run-id filtering, error handling (in sw-otel-test.sh)
  • Integration tests (2): Full pipeline export flow via sw-pipeline.sh export, auto-export mock (in sw-otel-test.sh)
  • E2E tests (0): Skipped — requires actual OTLP collector; covered by Jaeger docs example

Coverage Targets

  • All span types: root pipeline span, completed stage span, failed stage span, skipped stage span
  • All attribute types: string, integer, double values
  • Error paths: missing run-id, no events, malformed events

Critical Paths to Test

  • Happy path: Export a complete pipeline run with 3+ stages as OTLP JSON, validate Jaeger-compatible structure
  • Error case 1: Run-id with no matching events returns error
  • Error case 2: Events with missing fields (graceful degradation)
  • Edge case 1: Pipeline with only started event (no completion) — spans have empty endTime
  • Edge case 2: Multiple pipeline runs for same issue — exports all as separate traces

Endpoint Specification

CLI endpoint: shipwright pipeline export --format otel <run-id>

  • Input: run-id (string — job_id or issue number)
  • Output: OTLP JSON to stdout (exit 0) or error message to stderr (exit 1)
  • Flags: --output <file>, --send (POST to OTLP endpoint)

CLI endpoint: shipwright otel trace-export <run-id>

  • Same as above (pipeline export delegates here)

Error codes:

  • Exit 1: Missing run-id argument
  • Exit 1: No events found for run-id
  • Exit 1: jq not available
  • Exit 1: --send failed (OTLP endpoint unreachable)

Rate Limiting: Not applicable (CLI tool, not a server) Versioning: Not applicable (internal CLI, no breaking API surface)


Monitoring Checklist

Not applicable — This is a CLI export tool, not a deployed service. However:

  • The otel.trace_exported event enables monitoring export frequency via existing shipwright otel metrics
  • Auto-export failures are logged to events.jsonl for post-mortem analysis

Performance Considerations

Baseline Metrics

  • Current cmd_trace() scans entire events.jsonl (~1-10K lines typical) in <1s
  • No performance regression expected for typical event volumes

Optimization Targets

  • Export should complete in <5s for events.jsonl files up to 100K lines
  • Memory usage bounded by jq streaming (not loading full file into bash variables)

Profiling Strategy

  • Not needed for initial implementation — bash+jq is well-understood
  • If events.jsonl grows very large, could add date-range pre-filtering with grep

Benchmark Plan

  • Test with synthetic 10K-line events.jsonl to validate <5s target
  • No formal benchmarking needed for this feature scope

Definition of Done

  • shipwright pipeline export --format otel <run-id> outputs valid OTLP JSON
  • Each pipeline stage is a child span of the root pipeline span
  • Spans have correct timestamps (nanosecond epoch), status codes, and attributes
  • Span attributes include: stage name, template, outcome, cost, iteration count
  • Trace attributes include: issue number, repo, total cost, success/failure
  • Auto-export triggers on pipeline completion when OTEL_EXPORTER_ENDPOINT is set
  • docs/observability.md documents Jaeger integration
  • All existing tests pass (npm test)
  • New tests cover happy path, error cases, and edge cases

Clone this wiki locally