Skip to content

[grafana-otel-advisor] OTel improvement: emit gh-aw.engine.id and gen_ai.system on the setup span #32563

@github-actions

Description

@github-actions

OTel Instrumentation Improvement: emit gh-aw.engine.id and gen_ai.system on the setup span

Analysis Date: 2026-05-16
Priority: Medium
Effort: Small (< 2h)

Problem

The gh-aw.<jobName>.setup span (built in actions/setup/js/send_otlp_span.cjs:962 by sendJobSetupSpan) is emitted without the gh-aw.engine.id and gen_ai.system attributes, even though the workflow declares an engine. This happens because the Setup Scripts step runs before the Generate agentic run info step in the compiled lock file (see pkg/workflow/compiler_yaml_step_generation.go:130-199). At setup time:

  1. /tmp/gh-aw/aw_info.json does not yet exist, so awInfo.engine_id and awInfo.context.engine_id resolve to empty strings.
  2. GH_AW_INFO_ENGINE_ID is injected only into the env block of the generate_aw_info step (pkg/workflow/compiler_yaml.go:801) — it is not injected into the env block of the Setup Scripts step.

The result: resolveEngineId(awInfo) returns "" at setup time, and the if (engineId) guard at actions/setup/js/send_otlp_span.cjs:1048-1052 skips pushing both gen_ai.system and gh-aw.engine.id. A DevOps engineer querying Tempo/Grafana for "p95 setup latency by engine" cannot answer the question from a single span — they must join setup spans to the conclusion span by trace ID, doubling query cost and breaking when only the setup span survives (e.g., when the agent step is cancelled before conclusion).

Why This Matters (DevOps Perspective)
  • Unblocks engine-segmented setup latency dashboards. Grafana's GenAI / Application Observability panels filter on gen_ai.system — without it on setup spans, the panel under-counts cold-start time.
  • Improves cancelled-run diagnostics. When a job is cancelled during setup, the conclusion span never fires. The setup span is the only surviving signal for that run, and right now it carries no engine identity at all.
  • Reduces MTTR for noisy-neighbor incidents. "Is the claude setup phase slow today or is it all engines?" requires gh-aw.engine.id on the setup span itself; joining via trace ID adds latency and fails on partial traces.
Current Behavior

Live evidence from this workflow run (trace d945112102984b62d8c85d2bf1dc6ba3, span gh-aw.agent.setup, workflow uses claude engine):

Resource attributes present: ✅ service.name, service.version, github.repository, github.run_id, github.event_name, deployment.environment, etc.

Span attributes present on gh-aw.agent.setup:

gh-aw.episode.id
gh-aw.episode.kind
gh-aw.event_name
gh-aw.hop.id
gh-aw.job.name
gh-aw.repository
gh-aw.run.actor
gh-aw.run.attempt
gh-aw.run.id
gh-aw.staged
gh-aw.workflow.name
gh-aw.workflow_call.id

Missing from the setup span: gh-aw.engine.id, gen_ai.system.

The code path that should set them (actions/setup/js/send_otlp_span.cjs:1048-1052):

const engineId = resolveEngineId(awInfo); // returns "" at setup time
// ...
if (engineId) {
  const genAiSystem = ENGINE_TO_SYSTEM_MAP[engineId] || engineId;
  attributes.push(buildAttr("gen_ai.system", genAiSystem));
  attributes.push(buildAttr("gh-aw.engine.id", engineId));
}

The compiler-side env block that omits it (pkg/workflow/compiler_yaml_step_generation.go:185-198):

lines = append(lines,
  "        env:\n",
  fmt.Sprintf("          GH_AW_SETUP_WORKFLOW_NAME: %q\n", data.Name),
  fmt.Sprintf("          GH_AW_CURRENT_WORKFLOW_REF: %s\n", buildSetupWorkflowRefExpr(data)),
)
if v := getVersionForSetup(data); v != "" {
  lines = append(lines, fmt.Sprintf("          GH_AW_INFO_VERSION: %q\n", v))
}
// no GH_AW_INFO_ENGINE_ID here
Proposed Change

Inject GH_AW_INFO_ENGINE_ID into the Setup Scripts step's env block in generateSetupStep. The engine ID is already in scope on data.EngineConfig.ID / data.AI (see pkg/workflow/compiler_yaml.go:721-725), so this is a small, mechanical addition with no new lookups.

// pkg/workflow/compiler_yaml_step_generation.go (both script-mode branch and dev/release branch)
if data != nil {
  // existing GH_AW_SETUP_WORKFLOW_NAME / GH_AW_CURRENT_WORKFLOW_REF lines ...

  // NEW: propagate engine ID so the setup span carries gh-aw.engine.id and gen_ai.system.
  engineID := ""
  if data.EngineConfig != nil && data.EngineConfig.ID != "" {
    engineID = data.EngineConfig.ID
  } else if data.AI != "" {
    engineID = data.AI
  }
  if engineID != "" {
    lines = append(lines, fmt.Sprintf("          GH_AW_INFO_ENGINE_ID: %q\n", engineID))
  }
}

No runtime JS change is needed — resolveEngineId(awInfo) already falls back to process.env.GH_AW_INFO_ENGINE_ID at actions/setup/js/send_otlp_span.cjs:178. This fix just makes the env var visible to the setup step.

Expected Outcome

After this change, every gh-aw.<jobName>.setup span (agent, activation, safe-outputs, conclusion, threat-detection, etc.) carries gh-aw.engine.id and gen_ai.system from the moment it is created:

  • In Grafana / Tempo: TraceQL { span.gh-aw.engine.id = "claude" && name =~ ".*\\.setup" } returns just claude setup spans. Span-metrics generators can now break out p95 setup latency per engine.
  • In Honeycomb / Datadog: gen_ai.system populates the native GenAI service panels for setup spans, not only for conclusion/agent spans.
  • In the JSONL mirror: /tmp/gh-aw/otel.jsonl shows the engine on the first span of every job — useful when the conclusion span never gets written (cancelled / timed-out runs).
  • For on-call: a single span search by engine answers "is this slow for everyone or just one engine?" without a trace-ID join.
Implementation Steps
  • Edit pkg/workflow/compiler_yaml_step_generation.go: add GH_AW_INFO_ENGINE_ID env injection in both the script-mode branch (around line 142) and the dev/release-mode branch (around line 185) of generateSetupStep.
  • Update pkg/workflow/setup_step_version_test.go (and pkg/workflow/observability_otlp_test.go if it asserts on setup-step env) to expect the new env line for engines copilot, claude, codex, gemini.
  • Verify actions/setup/js/action_setup_otlp.test.cjs covers the path where process.env.GH_AW_INFO_ENGINE_ID is set and asserts that the resulting span attributes include gh-aw.engine.id and gen_ai.system. If not, add the assertion.
  • Recompile golden fixtures: make recompile (or equivalent) to regenerate pkg/workflow/testdata/**.golden so the new env line is present.
  • Run make test-unit and cd actions/setup/js && npx vitest run.
  • Run make fmt.
  • Open a PR referencing this issue.
Evidence from Live Grafana Data

Tempo backend status: tempo_traceql-search against grafanacloud-traces returned 0 traces over the last 7 days for {} and {resource.service.name="gh-aw"} — the Grafana Cloud Tempo instance bound to this MCP is not the production OTLP destination for this repository, so the live tracing-backend playbook was not directly usable. Falling back to telemetry-source priority #2 in the otel-queries skill (/tmp/gh-aw/otel.jsonl) gave a current, real span produced by this very workflow run.

JSONL evidence (this run):

  • traceId: d945112102984b62d8c85d2bf1dc6ba3
  • spanId: d43a7a6b0171531b
  • name: gh-aw.agent.setup
  • workflow uses claude engine (confirmed in .github/workflows/daily-grafana-otel-instrumentation-advisor.lock.yml:132 GH_AW_INFO_ENGINE_ID: "claude")
  • Span attribute keys (12 total): gh-aw.episode.id, gh-aw.episode.kind, gh-aw.event_name, gh-aw.hop.id, gh-aw.job.name, gh-aw.repository, gh-aw.run.actor, gh-aw.run.attempt, gh-aw.run.id, gh-aw.staged, gh-aw.workflow.name, gh-aw.workflow_call.id
  • Not present: gh-aw.engine.id, gen_ai.system

This is reproducible on every gh-aw run — the gap is structural, not an outlier.

Related Files
  • pkg/workflow/compiler_yaml_step_generation.go (primary change — inject env in generateSetupStep)
  • pkg/workflow/compiler_yaml.go (reference for how engineID is resolved at compile time, lines 721–725)
  • actions/setup/js/send_otlp_span.cjs (runtime — resolveEngineId at line 177, attribute push at 1048)
  • actions/setup/js/action_setup_otlp.cjs (entry point that triggers sendJobSetupSpan)
  • actions/setup/js/action_setup_otlp.test.cjs (add coverage)
  • pkg/workflow/setup_step_version_test.go (golden assertions for setup-step env)
  • pkg/workflow/testdata/**.golden (regenerated lock-file fixtures)

Generated by the Daily Grafana OTel Instrumentation Advisor workflow

Generated by 📊 Daily Grafana OTel Instrumentation Advisor · ● 26M ·

  • expires on May 23, 2026, 5:37 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions