Skip to content

Add optional OpenTelemetry trace export for job lifecycle#4465

Draft
stefanpenner wants to merge 1 commit intoactions:masterfrom
stefanpenner:otel-trace-recorder
Draft

Add optional OpenTelemetry trace export for job lifecycle#4465
stefanpenner wants to merge 1 commit intoactions:masterfrom
stefanpenner:otel-trace-recorder

Conversation

@stefanpenner
Copy link
Copy Markdown

Summary

  • Adds an OpenTelemetry trace recorder that implements the existing listener.MetricsRecorder interface alongside the Prometheus exporter
  • When configured with an OTLP endpoint, emits three child spans per completed job: runner.queue, runner.startup, runner.execution
  • Uses a CompositeRecorder to fan out to both Prometheus and OTel when both are enabled
  • Zero behavior change when otel_endpoint is not set

Motivation

GitHub Actions workflows can be reconstructed as OpenTelemetry traces (workflow → job → step), but there's a visibility gap between "job was queued" and "step started executing." ARC has the timestamps that explain this gap — QueueTime, ScaleSetAssignTime, RunnerAssignTime, FinishTime — but currently only exposes them as Prometheus histogram aggregates.

This PR emits those timestamps as individual trace spans, giving per-job visibility into:

  • Queue wait — time before ARC acquires the job
  • Runner startup — time for pod creation and runner registration
  • Execution — time spent running the job

Trace correlation

Spans use deterministic IDs (TraceID = MD5(runID-attempt), SpanID = BigEndian(jobID)) that are compatible with tools like otel-explorer which reconstruct workflow traces from the GitHub API. ARC's runner spans automatically merge into the same trace as the workflow/job/step spans — no correlation configuration needed.

Configuration

{
  "otel_endpoint": "otel-collector.monitoring:4318",
  "otel_insecure": true
}

Or via Helm values:

listenerTemplate:
  spec:
    containers:
      - name: listener
        env:
          - name: OTEL_ENDPOINT
            value: "otel-collector.monitoring:4318"

Files changed

File Change
cmd/ghalistener/metrics/otel.go OTelRecorder implementing listener.MetricsRecorder
cmd/ghalistener/metrics/composite.go CompositeRecorder fan-out wrapper
cmd/ghalistener/metrics/otel_test.go Tests for recorder, composite, and deterministic IDs
cmd/ghalistener/main.go Wire OTel recorder alongside Prometheus
cmd/ghalistener/config/config.go Add otel_endpoint and otel_insecure fields
go.mod / go.sum Add OTel SDK dependencies

Test plan

  • All existing metrics tests pass
  • New tests: 3-span emission, missing timestamps, common attributes, run attempt override, no-op methods, composite delegation, deterministic ID stability
  • go build ./cmd/ghalistener/ compiles clean
  • Integration test with real ARC deployment + OTel Collector
  • Verify spans appear in Tempo/Jaeger alongside workflow traces

🤖 Generated with Claude Code

Adds an OTel trace recorder that implements the existing
MetricsRecorder interface. When configured with an OTLP endpoint,
the listener emits three child spans per completed job:

  - runner.queue:     QueueTime → ScaleSetAssignTime
  - runner.startup:   ScaleSetAssignTime → RunnerAssignTime
  - runner.execution: RunnerAssignTime → FinishTime

Spans use deterministic trace/span IDs (MD5 of runID-attempt,
big-endian jobID) compatible with tools that reconstruct GitHub
Actions workflows as OpenTelemetry traces.

Configuration: set otel_endpoint (and optionally otel_insecure)
in the listener config JSON, or pass via Helm values. When no
endpoint is configured, behavior is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant