Executive Summary
Overall health for the last 24h on github/gh-aw is healthy with isolated failures and material observability gaps. Across 2,553 conclusion spans carrying gh-aw.run.status, 29 spans (1.14%) were marked failure, representing exactly 7 distinct failed runs (one per workflow) — no recurring per-workflow failure pattern. The errors and logs datasets in Sentry are empty for this window. No spans report gh-aw.run.status:timeout or cancelled.
The most actionable signal is instrumentation, not runtime: gen_ai.response.finish_reasons, service.version, release, and span.status are absent on all sampled spans, so it is not possible from Sentry alone to distinguish a genuine agent failure from a length-truncated response or a runner-timeout, nor to pin a regression to a gh-aw release.
Top Reliability Findings
| Priority |
Workflow |
Problem |
Evidence |
Next Action |
| P1 |
All gh-aw spans |
gen_ai.response.finish_reasons missing on every span (24h, 5,187 spans) |
has:gen_ai.response.finish_reasons returns 0 results; emit-side only fires when runtimeMetrics.stopReason is parsed from agent stdio log (send_otlp_span.cjs:1795-1798) |
Backfill stopReason for copilot/codex engines so finish-reason-based failure classification works |
| P1 |
All gh-aw spans |
service.version / release null on every span (5,187/5,187) |
has:service.version → 0 results; has:release → 1 group, all null; send_otlp_span.cjs:319-324 only emits service.version when scopeVersion && scopeVersion !== "unknown" |
Resolve a real version (git SHA or release tag) and pass as scopeVersion; map to Sentry release for regression correlation |
| P2 |
Daily Code Metrics and Trend Tracking Agent |
Failed run on claude engine, agent.conclusion 13.7 min |
run 26119008282, trace 2f5411f17313c95e911449cd270b8854, gh-aw.agent.conclusion 819,746 ms |
Inspect run logs; without finish_reasons cannot distinguish model truncation vs. runtime error |
| P2 |
Safe Output Health Monitor |
Failed run on claude engine, agent.conclusion 12.8 min |
run 26143483059, trace 254dd9d49f37603cdc9825c5c1ef4f91, gh-aw.agent.conclusion 766,181 ms |
Same: needs finish-reason or runner-side conclusion attribute |
| P2 |
Documentation Noob Tester |
Failed run on copilot engine, agent.conclusion 9.3 min |
run 26142529471, trace a052ce52143244b41c0714e4331d9e68, gh-aw.agent.conclusion 559,485 ms |
Inspect; copilot engine is over-represented (5/7 of failures) |
| P3 |
Sentry errors + logs datasets |
Empty for 24h |
Both list_events queries on errors and logs returned 0 results |
Inconclusive — gh-aw does not currently emit error events or logs to Sentry; consider mirroring exporter failures as Sentry events to surface them outside the trace surface |
Representative Traces
View representative traces
Verified continuity in each trace below — all expected conclusion sub-spans (agent.conclusion → detection.conclusion → safe_outputs.conclusion → conclusion.conclusion) share the same trace and gh-aw.run.id. Trace continuity is healthy; only the runtime outcome is impossible to fully characterise.
- Safe Output Health Monitor (claude, failed) — trace · run §26143483059 · agent.conclusion 766,181 ms · 4 conclusion spans marked
failure.
- Documentation Noob Tester (copilot, failed) — trace · run §26142529471 · agent.conclusion 559,485 ms.
- Daily Code Metrics and Trend Tracking Agent (claude, failed) — trace · run §26119008282 · agent.conclusion 819,746 ms.
- Other failed runs (4 conclusion spans each, all
gh-aw.run.status:failure): §26146345218 (PR Sous Chef), §26148162505 (Sub-Issue Closer), §26111831694 (Daily SPDD Spec Planner), §26155398474 (Contribution Check).
View latency landscape (informational, not a regression)
Top span groups by count, 24h window (avg / max duration):
| Span |
Count |
Avg (ms) |
Max (ms) |
| gh-aw.agent.conclusion |
359 |
277,567 |
1,554,868 |
| gh-aw.pre_activation.conclusion |
603 |
20,400 |
766,509 |
| gh-aw.detection.conclusion |
260 |
60,105 |
212,912 |
| gh-aw.safe_outputs.conclusion |
327 |
9,726 |
90,927 |
| gh-aw.upload_assets.conclusion |
9 |
36,663 |
48,142 |
The 25.9-min gh-aw.agent.conclusion max is a success (Copilot Agent Prompt Clustering Analysis, run §26157150065), so the top-end latency reflects legitimate long agentic work, not timeouts. Six of the seven longest agent.conclusion spans (>15 min) are gh-aw.run.status:success.
Recommendations
- Wire
gen_ai.response.finish_reasons on every agent-job conclusion span, not only when an engine writes stop_reason to agent-stdio.log. For copilot/codex engines, derive length / tool_use / end_turn from runner outcome or set a sentinel unknown so length-truncation is queryable. Code site: actions/setup/js/send_otlp_span.cjs:1787-1799.
- Emit a real
service.version (commit SHA or release tag) for OTLP resource attrs so Sentry can populate release and surface "this regression started at version X". Code site: actions/setup/js/send_otlp_span.cjs:319-324.
- Triage the 7 failed runs above against runner logs to confirm whether agent.conclusion durations >9 min represent genuine model failure or runner-side process termination; this validates whether recommendation
#1 would have caught them.
- Add a periodic
gh-aw.run.status:failure dashboard panel grouped by gh-aw.engine.id — current sample shows 5/7 of failures on copilot engine vs. 2/7 on claude, which is worth watching but not yet a confirmed regression.
Notes
View notes
- Sentry MCP tool surface:
search_events and get_trace_details are not available in this build of the MCP server; trace inspection used list_events with trace:<id> filter, which preserves full continuity verification.
- Run-status mapping clarification: the prompt's checklist named
gh_aw.workflow_name, but the emit-side attribute key is gh-aw.workflow.name (send_otlp_span.cjs:1105,1744). The dotted form is present on every span; the underscored form does not exist. Treat this as a docs/spec inconsistency, not an instrumentation gap.
span.status field is null on all sampled spans. OTLP status.code is set on the emit side (send_otlp_span.cjs:1725-1738), but Sentry's span-search surface does not appear to map it to the span.status column for this project. gh-aw.run.status is the reliable failure indicator on this backend.
- No
timeout or cancelled values seen on gh-aw.run.status in 24h. Either no such runs occurred, or agentConclusion / workflowRunConclusion never produced those raw values in the window (send_otlp_span.cjs:1720-1729).
- Datasets
errors and logs are empty for this window — explicit observability finding, not silent skip. gh-aw does not currently emit non-trace telemetry to Sentry.
- Inconclusive runtime outcome: per the operating contract, the 7 failed runs are reported as confirmed failures + confirmed instrumentation gap (no finish-reason, no release pinning) rather than as confirmed timeouts. The cluster is small and one-shot per workflow, so it is not yet a recurring pattern.
References:
Generated by 🚨 Daily Reliability Review · ● 15.5M · ◷
Executive Summary
Overall health for the last 24h on
github/gh-awis healthy with isolated failures and material observability gaps. Across 2,553 conclusion spans carryinggh-aw.run.status, 29 spans (1.14%) were markedfailure, representing exactly 7 distinct failed runs (one per workflow) — no recurring per-workflow failure pattern. Theerrorsandlogsdatasets in Sentry are empty for this window. No spans reportgh-aw.run.status:timeoutorcancelled.The most actionable signal is instrumentation, not runtime:
gen_ai.response.finish_reasons,service.version,release, andspan.statusare absent on all sampled spans, so it is not possible from Sentry alone to distinguish a genuine agent failure from a length-truncated response or a runner-timeout, nor to pin a regression to a gh-aw release.Top Reliability Findings
gen_ai.response.finish_reasonsmissing on every span (24h, 5,187 spans)has:gen_ai.response.finish_reasonsreturns 0 results; emit-side only fires whenruntimeMetrics.stopReasonis parsed from agent stdio log (send_otlp_span.cjs:1795-1798)stopReasonfor copilot/codex engines so finish-reason-based failure classification worksservice.version/releasenull on every span (5,187/5,187)has:service.version→ 0 results;has:release→ 1 group, allnull;send_otlp_span.cjs:319-324only emitsservice.versionwhenscopeVersion && scopeVersion !== "unknown"scopeVersion; map to Sentryreleasefor regression correlation2f5411f17313c95e911449cd270b8854,gh-aw.agent.conclusion819,746 msfinish_reasonscannot distinguish model truncation vs. runtime error254dd9d49f37603cdc9825c5c1ef4f91,gh-aw.agent.conclusion766,181 msa052ce52143244b41c0714e4331d9e68,gh-aw.agent.conclusion559,485 mserrors+logsdatasetslist_eventsqueries onerrorsandlogsreturned 0 resultsRepresentative Traces
View representative traces
Verified continuity in each trace below — all expected conclusion sub-spans (
agent.conclusion → detection.conclusion → safe_outputs.conclusion → conclusion.conclusion) share the sametraceandgh-aw.run.id. Trace continuity is healthy; only the runtime outcome is impossible to fully characterise.failure.gh-aw.run.status:failure): §26146345218 (PR Sous Chef), §26148162505 (Sub-Issue Closer), §26111831694 (Daily SPDD Spec Planner), §26155398474 (Contribution Check).View latency landscape (informational, not a regression)
Top span groups by count, 24h window (avg / max duration):
The 25.9-min
gh-aw.agent.conclusionmax is a success (Copilot Agent Prompt Clustering Analysis, run §26157150065), so the top-end latency reflects legitimate long agentic work, not timeouts. Six of the seven longest agent.conclusion spans (>15 min) aregh-aw.run.status:success.Recommendations
gen_ai.response.finish_reasonson every agent-job conclusion span, not only when an engine writesstop_reasontoagent-stdio.log. For copilot/codex engines, derivelength/tool_use/end_turnfrom runner outcome or set a sentinelunknownso length-truncation is queryable. Code site:actions/setup/js/send_otlp_span.cjs:1787-1799.service.version(commit SHA or release tag) for OTLP resource attrs so Sentry can populatereleaseand surface "this regression started at version X". Code site:actions/setup/js/send_otlp_span.cjs:319-324.#1would have caught them.gh-aw.run.status:failuredashboard panel grouped bygh-aw.engine.id— current sample shows 5/7 of failures on copilot engine vs. 2/7 on claude, which is worth watching but not yet a confirmed regression.Notes
View notes
search_eventsandget_trace_detailsare not available in this build of the MCP server; trace inspection usedlist_eventswithtrace:<id>filter, which preserves full continuity verification.gh_aw.workflow_name, but the emit-side attribute key isgh-aw.workflow.name(send_otlp_span.cjs:1105,1744). The dotted form is present on every span; the underscored form does not exist. Treat this as a docs/spec inconsistency, not an instrumentation gap.span.statusfield is null on all sampled spans. OTLPstatus.codeis set on the emit side (send_otlp_span.cjs:1725-1738), but Sentry's span-search surface does not appear to map it to thespan.statuscolumn for this project.gh-aw.run.statusis the reliable failure indicator on this backend.timeoutorcancelledvalues seen ongh-aw.run.statusin 24h. Either no such runs occurred, oragentConclusion/workflowRunConclusionnever produced those raw values in the window (send_otlp_span.cjs:1720-1729).errorsandlogsare empty for this window — explicit observability finding, not silent skip. gh-aw does not currently emit non-trace telemetry to Sentry.References: