📡 OTel Instrumentation Improvement: make gh-aw.effective_tokens reliable by reading the durable agent_usage.json artifact
Analysis Date: 2026-05-30
Priority: High
Effort: Small (< 2h)
Problem
sendJobConclusionSpan in actions/setup/js/send_otlp_span.cjs is supposed to emit gh-aw.effective_tokens — gh-aw's engine-agnostic per-run token-cost metric — on every conclusion span. But it sources the value only from the GH_AW_EFFECTIVE_TOKENS environment variable:
// send_otlp_span.cjs:1745-1747
const rawET = process.env.GH_AW_EFFECTIVE_TOKENS || "";
const effectiveTokens = rawET ? parseInt(rawET, 10) : NaN;
That env var is exported via core.exportVariable (i.e. $GITHUB_ENV) by parse_token_usage.cjs inside the agent job, but it is not propagated into the OTLP conclusion post-step environment (it is only explicitly wired into the safe_outputs job via needs.agent.outputs.effective_tokens). The result: when the conclusion span is built, GH_AW_EFFECTIVE_TOKENS is unset, effectiveTokens is NaN, and the attribute is silently dropped.
Live telemetry confirms this is a 100% silent failure, not a rollout blip:
- Sentry spans dataset (
github org, gh-aw project): has:gh-aw.effective_tokens → 0 spans over the last 30 days.
- In the last 24h: 0 of 344
gh-aw.agent.conclusion spans carry gh-aw.effective_tokens or gen_ai.usage.total_tokens — across all engines (copilot 215, claude 62, codex 34, pi 11, gemini 11, antigravity 11).
A DevOps engineer therefore cannot answer "how many tokens / how much did this run cost?" from OTel today — the one native, engine-agnostic cost attribute gh-aw emits never reaches the backend.
Why This Matters (DevOps Perspective)
gh-aw.effective_tokens is the single attribute that normalizes cost across all engines (copilot, claude, codex, gemini, pi, antigravity). The OTel GenAI gen_ai.usage.* attributes are engine-dependent and, per live data, reach only ~4% of runs (434 gh-aw.agent.conclusion spans over 30 days, 0 in the last 24h) because they depend on a result event in agent-stdio.log that several engines never emit.
With gh-aw.effective_tokens reliably present, these become possible with no per-engine special-casing:
- Dashboards:
sum(gh-aw.effective_tokens) per workflow / per engine / per day — token burn-down and cost attribution.
- Alerts: page when a workflow's effective tokens spike vs. its baseline (run-away agent detection).
- Triage: correlate
gh-aw.run.status:failure (68/24h) with token consumption to spot timeouts caused by context exhaustion.
Today all of these silently return empty, which reads as "zero cost" rather than "no data" — the most dangerous kind of observability gap.
Current Behavior
The conclusion span already reads the durable agent_usage.json artifact — but only for gen_ai.usage.*, ignoring the effective_tokens field that the very same file contains:
// send_otlp_span.cjs:1745-1747 — effective tokens: ENV ONLY
const rawET = process.env.GH_AW_EFFECTIVE_TOKENS || "";
const effectiveTokens = rawET ? parseInt(rawET, 10) : NaN;
// send_otlp_span.cjs:1905-1906 — emitted only when env was present (it never is)
if (!isNaN(effectiveTokens) && effectiveTokens > 0) {
attributes.push(buildAttr("gh-aw.effective_tokens", effectiveTokens));
}
// send_otlp_span.cjs:2092 — agent_usage.json IS read here, but only for gen_ai.usage.*
const agentUsage = readJSONIfExists("/tmp/gh-aw/agent_usage.json") || runtimeMetrics.tokenUsage || {};
Meanwhile parse_token_usage.cjs writes the value to disk every run:
// parse_token_usage.cjs:129-142
const agentUsage = {
input_tokens: summary.totalInputTokens,
output_tokens: summary.totalOutputTokens,
cache_read_tokens: summary.totalCacheReadTokens,
cache_write_tokens: summary.totalCacheWriteTokens,
effective_tokens: effectiveTokens, // <-- durable, but never read by the OTLP span
...(primaryModel ? { primary_model: primaryModel } : {}),
};
fs.writeFileSync(AGENT_USAGE_PATH, JSON.stringify(agentUsage) + "\n"); // /tmp/gh-aw/agent_usage.json
if (effectiveTokens > 0) {
core.exportVariable("GH_AW_EFFECTIVE_TOKENS", String(effectiveTokens)); // <-- not visible in the post-step
}
Proposed Change
Fall back to the on-disk agent_usage.json artifact (already bundled in the agent artifact and present on disk for every job) when the env var is missing, mirroring how gen_ai.usage.* is already sourced. Also gate the attribute to the agent job so sum(gh-aw.effective_tokens) is not inflated across the multiple downstream jobs that download the same artifact.
// Proposed: actions/setup/js/send_otlp_span.cjs (~line 1745)
// Prefer the GH_AW_EFFECTIVE_TOKENS env var, but fall back to the durable
// agent_usage.json artifact: the env var is exported to GITHUB_ENV inside the
// agent job and is NOT visible in the OTLP conclusion post-step, so relying on
// it alone drops the attribute on 100% of spans.
const rawET = process.env.GH_AW_EFFECTIVE_TOKENS || "";
let effectiveTokens = rawET ? parseInt(rawET, 10) : NaN;
if (!(Number.isFinite(effectiveTokens) && effectiveTokens > 0)) {
const usageForET = readJSONIfExists("/tmp/gh-aw/agent_usage.json");
if (usageForET && typeof usageForET.effective_tokens === "number" && usageForET.effective_tokens > 0) {
effectiveTokens = usageForET.effective_tokens;
}
}
// Proposed: actions/setup/js/send_otlp_span.cjs (~line 1905)
// Gate to the agent job to avoid double-counting across downstream jobs that
// also have agent_usage.json on disk (same rationale as gen_ai.usage.* below).
if (jobName === "agent" && Number.isFinite(effectiveTokens) && effectiveTokens > 0) {
attributes.push(buildAttr("gh-aw.effective_tokens", effectiveTokens));
}
Expected Outcome
After this change:
- In Grafana / Honeycomb / Datadog / Sentry:
gh-aw.effective_tokens becomes present on gh-aw.agent.conclusion spans for every engine, enabling sum/avg/p95 token-cost dashboards and threshold alerts with no per-engine special-casing.
- In the JSONL mirror: the agent conclusion span gains a populated
gh-aw.effective_tokens attribute, so post-hoc artifact debugging shows run cost without a live collector.
- For on-call engineers: failed/timed-out runs can be correlated with token burn (context-exhaustion timeouts become visible).
(Note: Sentry's EAP currently types gh-aw.* custom attributes as string fields, so avg()/sum() in Sentry still rejects them — a Sentry schema-inference behavior, not a gh-aw wire-format bug, so out of scope here. Grafana/Honeycomb/Datadog aggregate them fine.)
Implementation Steps
Evidence from Live OTel Data (Sentry / Grafana)
Backend used: Sentry spans dataset — org github, project gh-aw, region https://us.sentry.io. (Grafana has a Tempo datasource grafanacloud-ghaw-traces, but the Grafana MCP build available to this run exposes only list_datasources/get_datasource — no tempo_traceql-search/tempo_get-trace — so Tempo trace querying was not possible. Noted as a backend/tooling limitation; Sentry provided sufficient evidence.)
Pipeline is healthy (rules out a broad export problem):
span.name:gh-aw.* over 24h → setup+conclusion spans for activation (348), conclusion (348), agent (345 setup / 344 conclusion), pre_activation (287), safe_outputs (282), detection (234).
- Trace continuity intact: trace
797e7af5c08fc5b14427502603b2e4b0 joins gh-aw lifecycle spans with mcp.tool_call / gateway.backend.execute children under one trace.
gh-aw.run.status populated: success 1994 / failure 68 (24h).
- Resource attributes verified present (HEAD local mirror + Sentry):
service.version, github.repository, github.run_id, github.event_name, deployment.environment.
The gap:
has:gh-aw.effective_tokens → 0 spans / 30 days (and 0 / 24h).
span.name:gh-aw.agent.conclusion has:gh-aw.effective_tokens grouped by gh-aw.engine.id → No results (24h).
- For contrast,
gen_ai.usage.total_tokens reaches only gh-aw.agent.conclusion 434 spans / 30d (copilot 297, claude 111, codex 26; gemini/pi/antigravity 0) and 0 in the last 24h — confirming token telemetry is broadly missing and the engine-agnostic effective_tokens is the right metric to make reliable.
- The dedicated
gh-aw.agent.agent span (intended token carrier) returns 0 results / 30 days, so the conclusion-span fallback is the only carrier — making its effective_tokens source the highest-leverage fix.
Related Files
actions/setup/js/send_otlp_span.cjs (lines 1745-1747, 1905-1906, 2092)
actions/setup/js/parse_token_usage.cjs (lines 129-142 — writes effective_tokens into agent_usage.json)
actions/setup/js/action_conclusion_otlp.cjs (conclusion-span entrypoint; passes startMs only)
actions/setup/js/send_otlp_span.test.cjs (unit tests to extend)
pkg/workflow/compiler_safe_outputs_job.go (line 672 — the only job explicitly wired with GH_AW_EFFECTIVE_TOKENS)
Daily OTel Instrumentation Advisor
Generated by 📊 Daily OTel Instrumentation Advisor · opus48 4.1M · ◷
📡 OTel Instrumentation Improvement: make
gh-aw.effective_tokensreliable by reading the durableagent_usage.jsonartifactAnalysis Date: 2026-05-30
Priority: High
Effort: Small (< 2h)
Problem
sendJobConclusionSpaninactions/setup/js/send_otlp_span.cjsis supposed to emitgh-aw.effective_tokens— gh-aw's engine-agnostic per-run token-cost metric — on every conclusion span. But it sources the value only from theGH_AW_EFFECTIVE_TOKENSenvironment variable:That env var is exported via
core.exportVariable(i.e.$GITHUB_ENV) byparse_token_usage.cjsinside the agent job, but it is not propagated into the OTLP conclusion post-step environment (it is only explicitly wired into thesafe_outputsjob vianeeds.agent.outputs.effective_tokens). The result: when the conclusion span is built,GH_AW_EFFECTIVE_TOKENSis unset,effectiveTokensisNaN, and the attribute is silently dropped.Live telemetry confirms this is a 100% silent failure, not a rollout blip:
githuborg,gh-awproject):has:gh-aw.effective_tokens→ 0 spans over the last 30 days.gh-aw.agent.conclusionspans carrygh-aw.effective_tokensorgen_ai.usage.total_tokens— across all engines (copilot 215, claude 62, codex 34, pi 11, gemini 11, antigravity 11).A DevOps engineer therefore cannot answer "how many tokens / how much did this run cost?" from OTel today — the one native, engine-agnostic cost attribute gh-aw emits never reaches the backend.
Why This Matters (DevOps Perspective)
gh-aw.effective_tokensis the single attribute that normalizes cost across all engines (copilot, claude, codex, gemini, pi, antigravity). The OTel GenAIgen_ai.usage.*attributes are engine-dependent and, per live data, reach only ~4% of runs (434gh-aw.agent.conclusionspans over 30 days, 0 in the last 24h) because they depend on a result event inagent-stdio.logthat several engines never emit.With
gh-aw.effective_tokensreliably present, these become possible with no per-engine special-casing:sum(gh-aw.effective_tokens)per workflow / per engine / per day — token burn-down and cost attribution.gh-aw.run.status:failure(68/24h) with token consumption to spot timeouts caused by context exhaustion.Today all of these silently return empty, which reads as "zero cost" rather than "no data" — the most dangerous kind of observability gap.
Current Behavior
The conclusion span already reads the durable
agent_usage.jsonartifact — but only forgen_ai.usage.*, ignoring theeffective_tokensfield that the very same file contains:Meanwhile
parse_token_usage.cjswrites the value to disk every run:Proposed Change
Fall back to the on-disk
agent_usage.jsonartifact (already bundled in the agent artifact and present on disk for every job) when the env var is missing, mirroring howgen_ai.usage.*is already sourced. Also gate the attribute to the agent job sosum(gh-aw.effective_tokens)is not inflated across the multiple downstream jobs that download the same artifact.Expected Outcome
After this change:
gh-aw.effective_tokensbecomes present ongh-aw.agent.conclusionspans for every engine, enablingsum/avg/p95token-cost dashboards and threshold alerts with no per-engine special-casing.gh-aw.effective_tokensattribute, so post-hoc artifact debugging shows run cost without a live collector.(Note: Sentry's EAP currently types
gh-aw.*custom attributes as string fields, soavg()/sum()in Sentry still rejects them — a Sentry schema-inference behavior, not a gh-aw wire-format bug, so out of scope here. Grafana/Honeycomb/Datadog aggregate them fine.)Implementation Steps
actions/setup/js/send_otlp_span.cjs: add theagent_usage.jsonfallback foreffectiveTokens(~line 1745) and gate emission tojobName === "agent"(~line 1905).actions/setup/js/send_otlp_span.test.cjsto assertgh-aw.effective_tokensis emitted fromagent_usage.jsonwhenGH_AW_EFFECTIVE_TOKENSis unset, and is absent on non-agent jobs.make test-unit(orcd actions/setup/js && npx vitest run send_otlp_span) to confirm tests pass.make fmtto ensure formatting.Evidence from Live OTel Data (Sentry / Grafana)
Backend used: Sentry spans dataset — org
github, projectgh-aw, regionhttps://us.sentry.io. (Grafana has a Tempo datasourcegrafanacloud-ghaw-traces, but the Grafana MCP build available to this run exposes onlylist_datasources/get_datasource— notempo_traceql-search/tempo_get-trace— so Tempo trace querying was not possible. Noted as a backend/tooling limitation; Sentry provided sufficient evidence.)Pipeline is healthy (rules out a broad export problem):
span.name:gh-aw.*over 24h → setup+conclusion spans foractivation(348),conclusion(348),agent(345 setup / 344 conclusion),pre_activation(287),safe_outputs(282),detection(234).797e7af5c08fc5b14427502603b2e4b0joins gh-aw lifecycle spans withmcp.tool_call/gateway.backend.executechildren under one trace.gh-aw.run.statuspopulated:success1994 /failure68 (24h).service.version,github.repository,github.run_id,github.event_name,deployment.environment.The gap:
has:gh-aw.effective_tokens→ 0 spans / 30 days (and 0 / 24h).span.name:gh-aw.agent.conclusion has:gh-aw.effective_tokensgrouped bygh-aw.engine.id→ No results (24h).gen_ai.usage.total_tokensreaches onlygh-aw.agent.conclusion434 spans / 30d (copilot 297, claude 111, codex 26; gemini/pi/antigravity 0) and 0 in the last 24h — confirming token telemetry is broadly missing and the engine-agnosticeffective_tokensis the right metric to make reliable.gh-aw.agent.agentspan (intended token carrier) returns 0 results / 30 days, so the conclusion-span fallback is the only carrier — making itseffective_tokenssource the highest-leverage fix.Related Files
actions/setup/js/send_otlp_span.cjs(lines 1745-1747, 1905-1906, 2092)actions/setup/js/parse_token_usage.cjs(lines 129-142 — writeseffective_tokensintoagent_usage.json)actions/setup/js/action_conclusion_otlp.cjs(conclusion-span entrypoint; passesstartMsonly)actions/setup/js/send_otlp_span.test.cjs(unit tests to extend)pkg/workflow/compiler_safe_outputs_job.go(line 672 — the only job explicitly wired withGH_AW_EFFECTIVE_TOKENS)Daily OTel Instrumentation Advisor