Skip to content

[otel-advisor] OTel improvement: source gh-aw.effective_tokens from agent_usage.json (native cost metric is on 0 spans) #35900

@github-actions

Description

@github-actions

📡 OTel Instrumentation Improvement: make gh-aw.effective_tokens reliable by reading the durable agent_usage.json artifact

Analysis Date: 2026-05-30
Priority: High
Effort: Small (< 2h)

Problem

sendJobConclusionSpan in actions/setup/js/send_otlp_span.cjs is supposed to emit gh-aw.effective_tokens — gh-aw's engine-agnostic per-run token-cost metric — on every conclusion span. But it sources the value only from the GH_AW_EFFECTIVE_TOKENS environment variable:

// send_otlp_span.cjs:1745-1747
const rawET = process.env.GH_AW_EFFECTIVE_TOKENS || "";
const effectiveTokens = rawET ? parseInt(rawET, 10) : NaN;

That env var is exported via core.exportVariable (i.e. $GITHUB_ENV) by parse_token_usage.cjs inside the agent job, but it is not propagated into the OTLP conclusion post-step environment (it is only explicitly wired into the safe_outputs job via needs.agent.outputs.effective_tokens). The result: when the conclusion span is built, GH_AW_EFFECTIVE_TOKENS is unset, effectiveTokens is NaN, and the attribute is silently dropped.

Live telemetry confirms this is a 100% silent failure, not a rollout blip:

  • Sentry spans dataset (github org, gh-aw project): has:gh-aw.effective_tokens0 spans over the last 30 days.
  • In the last 24h: 0 of 344 gh-aw.agent.conclusion spans carry gh-aw.effective_tokens or gen_ai.usage.total_tokens — across all engines (copilot 215, claude 62, codex 34, pi 11, gemini 11, antigravity 11).

A DevOps engineer therefore cannot answer "how many tokens / how much did this run cost?" from OTel today — the one native, engine-agnostic cost attribute gh-aw emits never reaches the backend.

Why This Matters (DevOps Perspective)

gh-aw.effective_tokens is the single attribute that normalizes cost across all engines (copilot, claude, codex, gemini, pi, antigravity). The OTel GenAI gen_ai.usage.* attributes are engine-dependent and, per live data, reach only ~4% of runs (434 gh-aw.agent.conclusion spans over 30 days, 0 in the last 24h) because they depend on a result event in agent-stdio.log that several engines never emit.

With gh-aw.effective_tokens reliably present, these become possible with no per-engine special-casing:

  • Dashboards: sum(gh-aw.effective_tokens) per workflow / per engine / per day — token burn-down and cost attribution.
  • Alerts: page when a workflow's effective tokens spike vs. its baseline (run-away agent detection).
  • Triage: correlate gh-aw.run.status:failure (68/24h) with token consumption to spot timeouts caused by context exhaustion.

Today all of these silently return empty, which reads as "zero cost" rather than "no data" — the most dangerous kind of observability gap.

Current Behavior

The conclusion span already reads the durable agent_usage.json artifact — but only for gen_ai.usage.*, ignoring the effective_tokens field that the very same file contains:

// send_otlp_span.cjs:1745-1747  — effective tokens: ENV ONLY
const rawET = process.env.GH_AW_EFFECTIVE_TOKENS || "";
const effectiveTokens = rawET ? parseInt(rawET, 10) : NaN;

// send_otlp_span.cjs:1905-1906  — emitted only when env was present (it never is)
if (!isNaN(effectiveTokens) && effectiveTokens > 0) {
  attributes.push(buildAttr("gh-aw.effective_tokens", effectiveTokens));
}

// send_otlp_span.cjs:2092  — agent_usage.json IS read here, but only for gen_ai.usage.*
const agentUsage = readJSONIfExists("/tmp/gh-aw/agent_usage.json") || runtimeMetrics.tokenUsage || {};

Meanwhile parse_token_usage.cjs writes the value to disk every run:

// parse_token_usage.cjs:129-142
const agentUsage = {
  input_tokens: summary.totalInputTokens,
  output_tokens: summary.totalOutputTokens,
  cache_read_tokens: summary.totalCacheReadTokens,
  cache_write_tokens: summary.totalCacheWriteTokens,
  effective_tokens: effectiveTokens,            // <-- durable, but never read by the OTLP span
  ...(primaryModel ? { primary_model: primaryModel } : {}),
};
fs.writeFileSync(AGENT_USAGE_PATH, JSON.stringify(agentUsage) + "\n");  // /tmp/gh-aw/agent_usage.json
if (effectiveTokens > 0) {
  core.exportVariable("GH_AW_EFFECTIVE_TOKENS", String(effectiveTokens));  // <-- not visible in the post-step
}
Proposed Change

Fall back to the on-disk agent_usage.json artifact (already bundled in the agent artifact and present on disk for every job) when the env var is missing, mirroring how gen_ai.usage.* is already sourced. Also gate the attribute to the agent job so sum(gh-aw.effective_tokens) is not inflated across the multiple downstream jobs that download the same artifact.

// Proposed: actions/setup/js/send_otlp_span.cjs (~line 1745)
// Prefer the GH_AW_EFFECTIVE_TOKENS env var, but fall back to the durable
// agent_usage.json artifact: the env var is exported to GITHUB_ENV inside the
// agent job and is NOT visible in the OTLP conclusion post-step, so relying on
// it alone drops the attribute on 100% of spans.
const rawET = process.env.GH_AW_EFFECTIVE_TOKENS || "";
let effectiveTokens = rawET ? parseInt(rawET, 10) : NaN;
if (!(Number.isFinite(effectiveTokens) && effectiveTokens > 0)) {
  const usageForET = readJSONIfExists("/tmp/gh-aw/agent_usage.json");
  if (usageForET && typeof usageForET.effective_tokens === "number" && usageForET.effective_tokens > 0) {
    effectiveTokens = usageForET.effective_tokens;
  }
}
// Proposed: actions/setup/js/send_otlp_span.cjs (~line 1905)
// Gate to the agent job to avoid double-counting across downstream jobs that
// also have agent_usage.json on disk (same rationale as gen_ai.usage.* below).
if (jobName === "agent" && Number.isFinite(effectiveTokens) && effectiveTokens > 0) {
  attributes.push(buildAttr("gh-aw.effective_tokens", effectiveTokens));
}
Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog / Sentry: gh-aw.effective_tokens becomes present on gh-aw.agent.conclusion spans for every engine, enabling sum/avg/p95 token-cost dashboards and threshold alerts with no per-engine special-casing.
  • In the JSONL mirror: the agent conclusion span gains a populated gh-aw.effective_tokens attribute, so post-hoc artifact debugging shows run cost without a live collector.
  • For on-call engineers: failed/timed-out runs can be correlated with token burn (context-exhaustion timeouts become visible).

(Note: Sentry's EAP currently types gh-aw.* custom attributes as string fields, so avg()/sum() in Sentry still rejects them — a Sentry schema-inference behavior, not a gh-aw wire-format bug, so out of scope here. Grafana/Honeycomb/Datadog aggregate them fine.)

Implementation Steps
  • Edit actions/setup/js/send_otlp_span.cjs: add the agent_usage.json fallback for effectiveTokens (~line 1745) and gate emission to jobName === "agent" (~line 1905).
  • Update actions/setup/js/send_otlp_span.test.cjs to assert gh-aw.effective_tokens is emitted from agent_usage.json when GH_AW_EFFECTIVE_TOKENS is unset, and is absent on non-agent jobs.
  • Run make test-unit (or cd actions/setup/js && npx vitest run send_otlp_span) to confirm tests pass.
  • Run make fmt to ensure formatting.
  • Open a PR referencing this issue.
Evidence from Live OTel Data (Sentry / Grafana)

Backend used: Sentry spans dataset — org github, project gh-aw, region https://us.sentry.io. (Grafana has a Tempo datasource grafanacloud-ghaw-traces, but the Grafana MCP build available to this run exposes only list_datasources/get_datasource — no tempo_traceql-search/tempo_get-trace — so Tempo trace querying was not possible. Noted as a backend/tooling limitation; Sentry provided sufficient evidence.)

Pipeline is healthy (rules out a broad export problem):

  • span.name:gh-aw.* over 24h → setup+conclusion spans for activation (348), conclusion (348), agent (345 setup / 344 conclusion), pre_activation (287), safe_outputs (282), detection (234).
  • Trace continuity intact: trace 797e7af5c08fc5b14427502603b2e4b0 joins gh-aw lifecycle spans with mcp.tool_call / gateway.backend.execute children under one trace.
  • gh-aw.run.status populated: success 1994 / failure 68 (24h).
  • Resource attributes verified present (HEAD local mirror + Sentry): service.version, github.repository, github.run_id, github.event_name, deployment.environment.

The gap:

  • has:gh-aw.effective_tokens0 spans / 30 days (and 0 / 24h).
  • span.name:gh-aw.agent.conclusion has:gh-aw.effective_tokens grouped by gh-aw.engine.idNo results (24h).
  • For contrast, gen_ai.usage.total_tokens reaches only gh-aw.agent.conclusion 434 spans / 30d (copilot 297, claude 111, codex 26; gemini/pi/antigravity 0) and 0 in the last 24h — confirming token telemetry is broadly missing and the engine-agnostic effective_tokens is the right metric to make reliable.
  • The dedicated gh-aw.agent.agent span (intended token carrier) returns 0 results / 30 days, so the conclusion-span fallback is the only carrier — making its effective_tokens source the highest-leverage fix.
Related Files
  • actions/setup/js/send_otlp_span.cjs (lines 1745-1747, 1905-1906, 2092)
  • actions/setup/js/parse_token_usage.cjs (lines 129-142 — writes effective_tokens into agent_usage.json)
  • actions/setup/js/action_conclusion_otlp.cjs (conclusion-span entrypoint; passes startMs only)
  • actions/setup/js/send_otlp_span.test.cjs (unit tests to extend)
  • pkg/workflow/compiler_safe_outputs_job.go (line 672 — the only job explicitly wired with GH_AW_EFFECTIVE_TOKENS)

Daily OTel Instrumentation Advisor

Generated by 📊 Daily OTel Instrumentation Advisor · opus48 4.1M ·

  • expires on Jun 6, 2026, 10:00 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions