Pipeline Plan 206

Plan written to .claude/pipeline-artifacts/plan.md.

Summary of the approach:

Enhance sw-otel.sh with a new cmd_trace_export() function that builds proper OTLP JSON with parent/child span hierarchy, nanosecond timestamps, and typed attributes (string/int/double)
Add pipeline export subcommand in sw-pipeline.sh that delegates to sw-otel.sh trace-export
Auto-export hook in pipeline completion path — fires when OTEL_EXPORTER_OTLP_ENDPOINT is set
5 files touched: sw-otel.sh (core logic), sw-pipeline.sh (CLI + auto-export), event-schema.json (new event type), sw-otel-test.sh (10 test cases), docs/observability.md (Jaeger guide)

Key design decisions:

Reuse sw-otel.sh rather than creating a new script — it already has trace-building infrastructure
Run-id matching supports both job_id and issue number for flexibility
Deterministic span IDs from sha256 of run-id + stage name for reproducibility
Non-blocking auto-export — failures logged but never fail the pipeline
Pure bash + jq — no new dependencies lure

Auto-export on pipeline completion if OTEL_EXPORTER_ENDPOINT set
Integration example with Jaeger in docs/observability.md

Design Alternatives

Approach A: Enhance existing sw-otel.sh cmd_trace()

Modify the existing cmd_trace() function to accept run-id, build proper span hierarchy with attributes
Add pipeline export subcommand in sw-pipeline.sh that delegates to sw-otel.sh
Pros: Reuses existing infrastructure, minimal new files, consistent with codebase patterns
Cons: Makes sw-otel.sh larger

Approach B: New dedicated sw-pipeline-export.sh script

Create a new standalone script for OTLP export
Pros: Clean separation
Cons: Duplicates event-reading logic, more files to maintain, inconsistent with the existing sw-otel.sh which already has trace building

Chosen: Approach A — enhancing sw-otel.sh minimizes blast radius and builds on existing code. The pipeline export subcommand in sw-pipeline.sh will delegate to sw-otel.sh trace-export.

Risk Assessment

Risk	Likelihood	Mitigation
OTLP JSON format doesn't match spec	Medium	Use exact field names from OTLP protobuf spec; validate against Jaeger
Large events.jsonl causes slow export	Low	Filter by run-id early with grep before jq parsing
Auto-export fails silently on completion	Medium	Log export errors but don't fail the pipeline; emit event
Bash 3.2 compatibility issues	Low	Avoid associative arrays; use indexed arrays and jq

Dependency Analysis

Depends on:

scripts/lib/helpers.sh — emit_event(), output helpers, now_iso()
scripts/sw-otel.sh — existing trace/metrics infrastructure
scripts/sw-pipeline.sh — pipeline completion hook, subcommand dispatch
config/event-schema.json — event type definitions
~/.shipwright/events.jsonl — event data source

Depended on by: Nothing (new feature, additive only)

Architecture Decision Record

Context

Pipeline executions generate rich event data in events.jsonl but there's no way to visualize execution flow in standard observability tools. The existing cmd_trace() in sw-otel.sh builds a rudimentary OTLP structure but lacks run-id filtering, proper span IDs, attributes, and parent/child relationships.

Decision

Enhance sw-otel.sh with a new trace-export subcommand that builds proper OTLP JSON from events. Add pipeline export as a thin delegation. Hook auto-export into the pipeline completion path.

Consequences

Users can export any pipeline run as OTLP traces viewable in Jaeger/Honeycomb
Auto-export on completion enables continuous observability without manual steps
No new dependencies; pure bash + jq implementation

Component Diagram

┌─────────────────────┐
│  sw-pipeline.sh     │
│  (export subcommand)│
└────────┬────────────┘
         │ delegates
         ▼
┌─────────────────────┐     ┌──────────────────┐
│  sw-otel.sh         │────▶│ events.jsonl     │
│  (trace-export)     │     │ (event source)   │
└────────┬────────────┘     └──────────────────┘
         │ outputs
         ▼
┌─────────────────────┐     ┌──────────────────┐
│  OTLP JSON          │────▶│ OTLP Collector   │
│  (stdout or file)   │     │ Jaeger/Honeycomb │
└─────────────────────┘     └──────────────────┘

Auto-export path:
┌─────────────────────┐
│  sw-pipeline.sh     │
│  pipeline_cleanup() │──── if OTEL_EXPORTER_ENDPOINT set ──▶ sw-otel.sh export trace
└─────────────────────┘

Interface Contracts

// OTLP JSON output structure (OTLP/HTTP JSON encoding)
interface OTLPTraceExport {
  resourceSpans: [{
    resource: {
      attributes: Array<{key: string, value: {stringValue: string}}>
    },
    scopeSpans: [{
      scope: { name: string, version: string },
      spans: Span[]
    }]
  }]
}

interface Span {
  traceId: string           // 32 hex chars
  spanId: string            // 16 hex chars
  parentSpanId: string      // 16 hex chars (empty for root)
  name: string              // "pipeline" or stage name
  kind: number              // 1 = SPAN_KIND_INTERNAL
  startTimeUnixNano: string // nanosecond epoch as string
  endTimeUnixNano: string   // nanosecond epoch as string
  status: {
    code: number            // 0=UNSET, 1=OK, 2=ERROR
    message?: string
  }
  attributes: Array<{
    key: string,
    value: {stringValue?: string, intValue?: string, doubleValue?: number}
  }>
}

// CLI interface
// shipwright pipeline export --format otel <run-id>
// shipwright otel trace-export <run-id> [--output <file>]

Data Flow

1. User runs: shipwright pipeline export --format otel <run-id>
2. sw-pipeline.sh delegates to: sw-otel.sh trace-export <run-id>
3. sw-otel.sh reads ~/.shipwright/events.jsonl
4. Filters events by job_id or issue number matching <run-id>
5. Builds root span from pipeline.started → pipeline.completed events
6. Builds child spans from stage.started → stage.completed/failed events
7. Attaches attributes (cost, template, outcome, etc.) from event fields
8. Outputs OTLP JSON to stdout (or file with --output)
9. If --send flag or auto-export: POST to OTEL_EXPORTER_ENDPOINT/v1/traces

Error Boundaries

Event parsing errors: Skip malformed lines, continue processing (warn to stderr)
Missing run-id: Error with "No events found for run-id: X", exit 1
OTLP export failure: Log error, emit otel.export_failed event, exit 1 (non-fatal in auto-export path)
Missing jq: Error with install instructions, exit 1

Files to Modify

File	Action	Purpose
`scripts/sw-otel.sh`	Modify	Add `trace-export` subcommand with proper OTLP span building
`scripts/sw-pipeline.sh`	Modify	Add `export` subcommand, add auto-export hook at completion
`config/event-schema.json`	Modify	Add `otel.trace_exported` event type
`scripts/sw-otel-test.sh`	Modify	Add tests for trace-export
`docs/observability.md`	Create	Jaeger integration guide

Implementation Steps

Step 1: Add `trace-export` command to `sw-otel.sh`

Add a new cmd_trace_export() function that:

Accepts <run-id> argument (matches against job_id or issue fields)
Reads events.jsonl and filters events for the given run
Generates deterministic trace ID from run-id (md5/sha256 truncated to 32 hex)
Generates deterministic span IDs from stage names (sha256 truncated to 16 hex)
Builds root span from pipeline.started/pipeline.completed events
Builds child spans from stage.started/stage.completed/stage.failed/stage.skipped events
Converts ISO timestamps to nanosecond epoch (OTLP requirement)
Attaches proper OTLP-format attributes using {key, value: {stringValue/intValue}} encoding
Outputs valid OTLP JSON

Root span attributes:

pipeline.issue, pipeline.template, pipeline.result, pipeline.total_cost, pipeline.iterations, pipeline.agent_id

Stage span attributes:

stage.name, stage.outcome, stage.duration_s, stage.error_class

Add --output <file> flag to write to file instead of stdout. Add --send flag to POST to OTEL_EXPORTER_OTLP_ENDPOINT/v1/traces.

Update the help text and main dispatch.

Step 2: Add `export` subcommand to `sw-pipeline.sh`

Add a pipeline_export() function and dispatch case:

Parse --format otel flag (only format for now, default to otel)
Accept positional <run-id> argument
Delegate to bash "$SCRIPT_DIR/sw-otel.sh" trace-export "$run_id" "$@"
Add to show_help() output

Step 3: Add auto-export on pipeline completion

In sw-pipeline.sh pipeline_cleanup() function (around line 2727, after successful completion event):

Check if OTEL_EXPORTER_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT is set
If set, run bash "$SCRIPT_DIR/sw-otel.sh" trace-export "$SHIPWRIGHT_PIPELINE_ID" --send 2>/dev/null || true
Emit otel.trace_exported event on success

Step 4: Update event schema

Add to config/event-schema.json:

"otel.trace_exported": {
  "required": ["run_id"],
  "optional": ["endpoint", "spans_count"]
}

Step 5: Write tests

Add to scripts/sw-otel-test.sh:

Test trace-export with synthetic events (pipeline.started + stage events + pipeline.completed)
Validate OTLP JSON structure (resourceSpans, scopeSpans, spans array)
Validate span parent/child relationships (stage spans reference root span)
Validate span attributes are present
Validate timestamps are nanosecond epoch strings
Test with issue number as run-id
Test with job_id as run-id
Test with no matching events (error case)
Test --output flag writes to file
Test --send flag behavior (mock curl)

Step 6: Create docs/observability.md

Write integration guide with:

Prerequisites (Jaeger Docker setup)
Running a pipeline and exporting traces
Auto-export configuration
Viewing traces in Jaeger UI
Honeycomb setup alternative

Task Checklist

Task 1: Implement cmd_trace_export() in sw-otel.sh with OTLP span building, run-id filtering, proper attributes
Task 2: Add trace-export subcommand dispatch and help text in sw-otel.sh
Task 3: Add export subcommand to sw-pipeline.sh delegating to sw-otel.sh
Task 4: Add auto-export hook in sw-pipeline.sh pipeline_cleanup() on completion
Task 5: Add otel.trace_exported event type to config/event-schema.json
Task 6: Write tests for trace-export in sw-otel-test.sh
Task 7: Create docs/observability.md with Jaeger integration example
Task 8: Run npm test (specifically sw-otel-test.sh) and fix any failures

Testing Approach

Test Pyramid Breakdown

Unit tests (8): OTLP JSON structure validation, span relationships, attribute encoding, timestamp conversion, run-id filtering, error handling (in sw-otel-test.sh)
Integration tests (2): Full pipeline export flow via sw-pipeline.sh export, auto-export mock (in sw-otel-test.sh)
E2E tests (0): Skipped — requires actual OTLP collector; covered by Jaeger docs example

Coverage Targets

All span types: root pipeline span, completed stage span, failed stage span, skipped stage span
All attribute types: string, integer, double values
Error paths: missing run-id, no events, malformed events

Critical Paths to Test

Happy path: Export a complete pipeline run with 3+ stages as OTLP JSON, validate Jaeger-compatible structure
Error case 1: Run-id with no matching events returns error
Error case 2: Events with missing fields (graceful degradation)
Edge case 1: Pipeline with only started event (no completion) — spans have empty endTime
Edge case 2: Multiple pipeline runs for same issue — exports all as separate traces

Endpoint Specification

CLI endpoint: shipwright pipeline export --format otel <run-id>

Input: run-id (string — job_id or issue number)
Output: OTLP JSON to stdout (exit 0) or error message to stderr (exit 1)
Flags: --output <file>, --send (POST to OTLP endpoint)

CLI endpoint: shipwright otel trace-export <run-id>

Same as above (pipeline export delegates here)

Error codes:

Exit 1: Missing run-id argument
Exit 1: No events found for run-id
Exit 1: jq not available
Exit 1: --send failed (OTLP endpoint unreachable)

Rate Limiting: Not applicable (CLI tool, not a server) Versioning: Not applicable (internal CLI, no breaking API surface)

Monitoring Checklist

Not applicable — This is a CLI export tool, not a deployed service. However:

The otel.trace_exported event enables monitoring export frequency via existing shipwright otel metrics
Auto-export failures are logged to events.jsonl for post-mortem analysis

Performance Considerations

Baseline Metrics

Current cmd_trace() scans entire events.jsonl (~1-10K lines typical) in <1s
No performance regression expected for typical event volumes

Optimization Targets

Export should complete in <5s for events.jsonl files up to 100K lines
Memory usage bounded by jq streaming (not loading full file into bash variables)

Profiling Strategy

Not needed for initial implementation — bash+jq is well-understood
If events.jsonl grows very large, could add date-range pre-filtering with grep

Benchmark Plan

Test with synthetic 10K-line events.jsonl to validate <5s target
No formal benchmarking needed for this feature scope

Definition of Done

shipwright pipeline export --format otel <run-id> outputs valid OTLP JSON
Each pipeline stage is a child span of the root pipeline span
Spans have correct timestamps (nanosecond epoch), status codes, and attributes
Span attributes include: stage name, template, outcome, cost, iteration count
Trace attributes include: issue number, repo, total cost, success/failure
Auto-export triggers on pipeline completion when OTEL_EXPORTER_ENDPOINT is set
docs/observability.md documents Jaeger integration
All existing tests pass (npm test)
New tests cover happy path, error cases, and edge cases

Pipeline Plan 206

Design Alternatives

Risk Assessment

Dependency Analysis

Architecture Decision Record

Context

Decision

Consequences

Component Diagram

Interface Contracts

Data Flow

Error Boundaries

Files to Modify

Implementation Steps

Step 1: Add trace-export command to sw-otel.sh

Step 2: Add export subcommand to sw-pipeline.sh

Step 3: Add auto-export on pipeline completion

Step 4: Update event schema

Step 5: Write tests

Step 6: Create docs/observability.md

Task Checklist

Testing Approach

Test Pyramid Breakdown

Coverage Targets

Critical Paths to Test

Endpoint Specification

Monitoring Checklist

Performance Considerations

Baseline Metrics

Optimization Targets

Profiling Strategy

Benchmark Plan

Definition of Done

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Step 1: Add `trace-export` command to `sw-otel.sh`

Step 2: Add `export` subcommand to `sw-pipeline.sh`