Executive Summary
In the last 24 hours, agentic workflows in github/gh-aw emitted 5,475 token-bearing LLM API spans consuming 144.86M tokens (142.75M input / 2.12M output). The token mix is heavily input-dominated — agents push large prompts (full repo context, logs) and receive comparatively small completions. claude-sonnet-4.6 carries 81.9 % of total tokens across 2,418 spans, with gpt-5.4-mini a distant second at 11.5 %.
The top 10 individual runs account for ~26 % of the day's tokens (~38.2M), led by a single Daily Firewall Logs Collector and Reporter run that consumed 10.75M tokens — 22× the median trace and ~7 % of the daily total on its own. This is the strongest single optimization target.
Observability gap: gh-aw.workflow.name is set on gen_ai parent spans but is not propagated to the http.client children that carry gen_ai.usage.* token counts. Direct sum(gen_ai.usage.total_tokens) by gh-aw.workflow.name returns null; per-workflow attribution required cross-referencing via trace IDs. The figures below report top runs (one trace each) rather than workflow rollups for that reason.
Key Metrics
| Metric |
Value |
Events analyzed (token-bearing http.client spans) |
5,475 |
| Events with token data |
5,475 |
gen_ai parent spans (24h) |
3,879 |
| Total input tokens |
142,746,607 |
| Total output tokens |
2,118,233 |
| Total tokens |
144,864,840 |
| Unique workflows seen (parent-span attribute, top 50) |
50+ |
| Avg tokens / event |
26,459 |
| P95 tokens / event |
67,859 |
| Errors dataset events (24h) |
0 |
| Logs dataset events (24h) |
0 |
Tokens by Model
| Model |
Spans |
Input |
Output |
Total |
Share |
| claude-sonnet-4.6 |
2,418 |
117,679,870 |
905,133 |
118,585,003 |
81.9 % |
| gpt-5.4-mini-2026-03-17 |
679 |
16,378,706 |
279,620 |
16,658,326 |
11.5 % |
| claude-haiku-4.5 |
122 |
4,715,764 |
35,735 |
4,751,499 |
3.3 % |
| gpt-5.5-2026-04-23 |
49 |
1,927,071 |
28,942 |
1,956,013 |
1.4 % |
| claude-opus-4-7 |
2,182 |
875,372 |
845,218 |
1,720,590 |
1.2 % |
| claude-sonnet-4.5 |
22 |
1,070,870 |
23,361 |
1,094,231 |
0.8 % |
| gpt-4.1-2025-04-14 |
3 |
98,954 |
224 |
99,178 |
0.07 % |
Top 10 Workflow Runs by Token Consumption
Resolved via trace_id joins between token-bearing http.client spans and the matching gen_ai parent that carries gh-aw.workflow.name / gh-aw.run.id.
| # |
Workflow |
Run |
LLM Spans |
Input |
Output |
Total |
| 1 |
Daily Firewall Logs Collector and Reporter |
§26381491116 |
155 |
10,698,174 |
51,879 |
10,750,053 |
| 2 |
daily-experiment-report |
§26392702536 |
56 |
4,689,831 |
41,421 |
4,731,252 |
| 3 |
Daily Syntax Error Quality Check |
§26391962230 |
74 |
3,952,294 |
10,647 |
3,962,941 |
| 4 |
Dead Code Removal Agent |
§26364187746 |
62 |
3,724,824 |
17,330 |
3,742,154 |
| 5 |
Daily Testify Uber Super Expert |
§26368957854 |
50 |
3,007,447 |
17,496 |
3,024,943 |
| 6 |
Q |
§26364669602 |
49 |
2,588,363 |
10,226 |
2,598,589 |
| 7 |
Daily Compiler Threat Spec Optimizer |
§26381611018 |
43 |
2,492,355 |
13,216 |
2,505,571 |
| 8 |
Copilot CLI Deep Research Agent |
§26384338048 |
39 |
2,284,317 |
19,531 |
2,303,848 |
| 9 |
Layout Specification Maintainer |
§26391872642 |
38 |
2,285,306 |
10,070 |
2,295,376 |
| 10 |
Daily Compiler Quality Check |
§26381848944 |
35 |
2,246,855 |
18,009 |
2,264,864 |
Top-10 combined: ~38.2M tokens, ~26 % of daily total.
Data Quality and Gaps
- Workflow attribute not propagated to LLM spans.
gh-aw.workflow.name and gh-aw.run.id are populated on span.op:gen_ai parent spans only. The span.op:http.client children that carry gen_ai.usage.input_tokens / output_tokens / total_tokens have these attributes as null. As a result, sum(gen_ai.usage.total_tokens) by gh-aw.workflow.name over the spans dataset returns a single null bucket with 5,458 / 144.73M. Per-workflow attribution required iterating over the top traces by token sum and querying trace:<id> has:gh-aw.workflow.name to resolve each one — feasible for the top N, not for the long tail.
transaction field is null on all token-bearing spans, so transaction-level rollup is also unavailable.
errors and logs datasets returned zero events for 24h — confirmed via empty list_events(dataset=errors) and list_events(dataset=logs). This is either silent success (no failures captured) or an instrumentation gap; cannot disambiguate from telemetry alone.
- Token-precedence note: all data was sourced from
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens. No ai.*_tokens or usage.*_tokens aliases were present; no double-counting risk in this window.
gen_ai parent spans (3,879) vs token-bearing spans (5,475): the count gap is expected — a single agent turn often issues multiple LLM HTTP calls (tool-use loops, retries). Long-tail workflows below the top 25 traces are aggregated into model/global totals but not attributed by name.
- One unresolved workflow name in the top spans dataset: a bucket labelled
[Filtered] (20 spans) appeared in the gen_ai parent aggregation, indicating Sentry data-scrubbing redacted the workflow name for that run.
Top Traces by Tokens (raw)
Ranked aggregate over has:gen_ai.usage.total_tokens grouped by trace, top 25:
| Rank |
Trace |
LLM Spans |
Total Tokens |
| 1 |
dd64b9489dfccdacc29707d8b81ff798 |
155 |
10,750,053 |
| 2 |
5f88618ba71e17a9f9c6bf2f3de6b2f7 |
56 |
4,731,252 |
| 3 |
684d3866a4544953a15018858ddc80ec |
74 |
3,962,941 |
| 4 |
9f2914242ceb26a1001c4464fc571052 |
62 |
3,742,154 |
| 5 |
49846a6f751540f74fc88df0287e6137 |
50 |
3,024,943 |
| 6 |
f7f54ad1d117ed5e78586ea5595e5467 |
49 |
2,598,589 |
| 7 |
a6ccf2d1bf48f51c6768a47cb1c356a2 |
43 |
2,505,571 |
| 8 |
6046653e9a0e04c60565ac03a5ac00cd |
39 |
2,303,848 |
| 9 |
a3ec39137084759ec6c31f7c645c8116 |
38 |
2,295,376 |
| 10 |
c7ab4cf3e3f84e64ff9e119cbdbdaaa7 |
35 |
2,264,864 |
| 11 |
8a43d206c57044f44309ccaf35c74493 |
33 |
2,025,021 |
| 12 |
9576b78819df48bb5b1a9852bdae93b6 |
49 |
1,926,224 |
| 13 |
ef3e9a7757f2046bfc4fdc77a4e65234 |
26 |
1,869,476 |
| 14 |
51870a5df94b26ad9ba336966f4605e5 |
36 |
1,794,813 |
| 15 |
9a6cafb05078be9bbd09785930e92440 |
42 |
1,728,219 |
| 16 |
b42f5d6e60cae6c60dbd92caeafcba68 |
68 |
1,698,446 |
| 17 |
1004b6b975a7dd8064bd9a33efa3a75f |
43 |
1,603,427 |
| 18 |
9cc48e3ec74785148163b92e56f2a67e |
28 |
1,587,024 |
| 19 |
b49e4fa159f1a277d3e5e74d4e4422ec |
37 |
1,586,080 |
| 20 |
aa8cee327ae88a068175babb8de849b4 |
32 |
1,539,675 |
| 21 |
2e0a433b0bfc6df4d7ace21524aa6d6f |
51 |
1,494,140 |
| 22 |
9b528ee883a1d7233a3526bd62d32a83 |
28 |
1,488,746 |
| 23 |
233ab5df05d5e5257d1e5d7f9fd50220 |
41 |
1,485,784 |
| 24 |
bc5988a0d81f705718e125e1018bf7e7 |
35 |
1,412,593 |
| 25 |
bf468a29c4852bdecf1577aac3ed6819 |
27 |
1,343,796 |
Recommendations
- Propagate workflow identity onto LLM spans. Add
gh-aw.workflow.name, gh-aw.workflow.id, and gh-aw.run.id as OTLP resource attributes in actions/setup/js/send_otlp_span.cjs (or wherever the SDK is initialized for agent runs). Today the auto-instrumented http.client spans for the Anthropic / OpenAI SDKs have these attributes as null, which blocks single-query workflow rollups in Sentry. Until this is fixed, daily reports must trace-walk to attribute tokens.
- Investigate
Daily Firewall Logs Collector and Reporter run §26381491116. 10.75M tokens in one run (155 LLM calls, ~69K avg per call) is an outlier — single largest consumer in the window. Likely cause: passing raw firewall logs verbatim into a sonnet prompt on every iteration. Consider chunked summarization with a claude-haiku-4.5 first pass, or pre-aggregating logs before they reach the agent.
- Move bulk scheduled scans off
claude-sonnet-4.6. Sonnet drives 81.9 % of all tokens; many of the top-10 consumers are linting / code-review style passes (Daily Syntax Error Quality Check, Daily Compiler Quality Check, Dead Code Removal Agent, Layout Specification Maintainer) where claude-haiku-4.5 produces comparable structured output at roughly 5× lower input cost. Pilot one of these on haiku and compare PR quality before broader rollout.
- Add a per-run token soft cap with checkpoint summarization. 11 of the top 25 traces each exceed 2M tokens, almost entirely input. Introducing a
max_prompt_tokens budget that triggers summarization-of-history before re-prompting would clip the long-tail of runaway agent loops without changing the model mix.
References
Generated by 📊 Daily Token Consumption Report (Sentry OTel) · opus47 12.1M · ◷
Executive Summary
In the last 24 hours, agentic workflows in
github/gh-awemitted 5,475 token-bearing LLM API spans consuming 144.86M tokens (142.75M input / 2.12M output). The token mix is heavily input-dominated — agents push large prompts (full repo context, logs) and receive comparatively small completions.claude-sonnet-4.6carries 81.9 % of total tokens across 2,418 spans, withgpt-5.4-minia distant second at 11.5 %.The top 10 individual runs account for ~26 % of the day's tokens (~38.2M), led by a single
Daily Firewall Logs Collector and Reporterrun that consumed 10.75M tokens — 22× the median trace and ~7 % of the daily total on its own. This is the strongest single optimization target.Observability gap:
gh-aw.workflow.nameis set ongen_aiparent spans but is not propagated to thehttp.clientchildren that carrygen_ai.usage.*token counts. Directsum(gen_ai.usage.total_tokens) by gh-aw.workflow.namereturns null; per-workflow attribution required cross-referencing viatraceIDs. The figures below report top runs (one trace each) rather than workflow rollups for that reason.Key Metrics
http.clientspans)gen_aiparent spans (24h)Tokens by Model
Top 10 Workflow Runs by Token Consumption
Resolved via
trace_idjoins between token-bearinghttp.clientspans and the matchinggen_aiparent that carriesgh-aw.workflow.name/gh-aw.run.id.Top-10 combined: ~38.2M tokens, ~26 % of daily total.
Data Quality and Gaps
gh-aw.workflow.nameandgh-aw.run.idare populated onspan.op:gen_aiparent spans only. Thespan.op:http.clientchildren that carrygen_ai.usage.input_tokens/output_tokens/total_tokenshave these attributes asnull. As a result,sum(gen_ai.usage.total_tokens) by gh-aw.workflow.nameover the spans dataset returns a single null bucket with 5,458 / 144.73M. Per-workflow attribution required iterating over the top traces by token sum and queryingtrace:<id> has:gh-aw.workflow.nameto resolve each one — feasible for the top N, not for the long tail.transactionfield is null on all token-bearing spans, so transaction-level rollup is also unavailable.errorsandlogsdatasets returned zero events for 24h — confirmed via emptylist_events(dataset=errors)andlist_events(dataset=logs). This is either silent success (no failures captured) or an instrumentation gap; cannot disambiguate from telemetry alone.gen_ai.usage.input_tokens,gen_ai.usage.output_tokens, andgen_ai.usage.total_tokens. Noai.*_tokensorusage.*_tokensaliases were present; no double-counting risk in this window.gen_aiparent spans (3,879) vs token-bearing spans (5,475): the count gap is expected — a single agent turn often issues multiple LLM HTTP calls (tool-use loops, retries). Long-tail workflows below the top 25 traces are aggregated into model/global totals but not attributed by name.[Filtered](20 spans) appeared in the gen_ai parent aggregation, indicating Sentry data-scrubbing redacted the workflow name for that run.Top Traces by Tokens (raw)
Ranked aggregate over
has:gen_ai.usage.total_tokensgrouped bytrace, top 25:Recommendations
gh-aw.workflow.name,gh-aw.workflow.id, andgh-aw.run.idas OTLP resource attributes inactions/setup/js/send_otlp_span.cjs(or wherever the SDK is initialized for agent runs). Today the auto-instrumentedhttp.clientspans for the Anthropic / OpenAI SDKs have these attributes asnull, which blocks single-query workflow rollups in Sentry. Until this is fixed, daily reports must trace-walk to attribute tokens.Daily Firewall Logs Collector and Reporterrun §26381491116. 10.75M tokens in one run (155 LLM calls, ~69K avg per call) is an outlier — single largest consumer in the window. Likely cause: passing raw firewall logs verbatim into a sonnet prompt on every iteration. Consider chunked summarization with aclaude-haiku-4.5first pass, or pre-aggregating logs before they reach the agent.claude-sonnet-4.6. Sonnet drives 81.9 % of all tokens; many of the top-10 consumers are linting / code-review style passes (Daily Syntax Error Quality Check,Daily Compiler Quality Check,Dead Code Removal Agent,Layout Specification Maintainer) whereclaude-haiku-4.5produces comparable structured output at roughly 5× lower input cost. Pilot one of these on haiku and compare PR quality before broader rollout.max_prompt_tokensbudget that triggers summarization-of-history before re-prompting would clip the long-tail of runaway agent loops without changing the model mix.References