Executive Summary
Window: last 24h ending 2026-05-20 (UTC). Sentry org github, project gh-aw. Spans dataset is healthy (5,078 spans; 4,998 gen_ai, 80 default). errors and logs datasets are empty for this project over the window — treat that as a calm signal, not necessarily a clean bill of health.
No timeouts, cancellations, or OTLP exporter failures were observed. 29 spans across 7 workflows carry gh-aw.run.status:failure. The same trace can contain both failure and success spans, so this represents per-span runtime outcome rather than a count of failed runs.
Four instrumentation gaps recur across the window and limit what conclusions can be drawn from spans alone: span.status is null on every span (OTLP status.code → Sentry mapping not applied to spans dataset), release is null on every span (resource service.version not surfacing as Sentry release), service.version is not queryable via has:, and gen_ai.response.finish_reasons is absent on all 5,078 spans (emit-side gate on runtimeMetrics.stopReason — see actions/setup/js/send_otlp_span.cjs:1795-1797).
Top Reliability Findings
| Priority |
Workflow |
Problem |
Evidence |
Next Action |
| P1 |
Daily Code Metrics and Trend Tracking Agent |
Mixed-status run: trace contains both failure and success gen_ai spans |
5 failure spans / 24h, trace 2f5411f17313c95e911449cd270b8854 |
Inspect agent loop; capture failing tool/output for the failure spans |
| P2 |
PR Sous Chef |
4 failure spans/24h; also exhibits a long ~19m gen_ai span on a different trace |
4 failure spans (trace f7ba070a7742343f13dd9eb23ea7d3ba); 1,162s span on trace 870f84211d4adcadbce47409aad81f73 |
Verify whether failures correlate with long-running iterations |
| P2 |
Safe Output Health Monitor |
4 failure spans on one trace |
trace 254dd9d49f37603cdc9825c5c1ef4f91 |
Check safe-output emission path; relevant since safe-outputs are this project's primary write surface |
| P2 |
Sub-Issue Closer, Documentation Noob Tester, Contribution Check, Daily SPDD Spec Planner |
4 failure spans each |
aggregate query shown below |
Group failures by workflow to see whether they share a common failing engine/event |
| P3 |
Test Quality Sentinel |
Token outlier: 834,846 total tokens (818,761 input / 16,085 output) on one trace |
trace 50762cc694e924c0eb15d76837bf098d |
Add prompt caching or scope reduction; high input:output ratio suggests caching wins |
| P3 |
Copilot Agent Prompt Clustering Analysis |
Latency outlier: single gen_ai span 1,402,449ms (~23.4 min) |
trace cdee6c6becf338409e2aa12d3c335f91 |
Confirm this is intended (long agent loop) vs. a single LLM call stuck; absence of finish_reasons:length makes truncation diagnosis impossible |
| P3 (gap) |
All workflows |
gen_ai.response.finish_reasons absent on 0 of 5,078 spans |
has:gen_ai.response.finish_reasons → no results |
Verify runtimeMetrics.stopReason is being parsed; emit a default value (stop / unknown) so truncation is detectable |
| P3 (gap) |
All workflows |
release and service.version not queryable on spans (5,078 spans, all null) |
has:release → 5,078 with release:null; has:service.version → no results |
Confirm resource attribute → Sentry release mapping; without it, regression-to-deploy correlation is blocked |
| P3 (gap) |
All workflows |
span.status null on every span |
aggregate by span.status returns only null |
OTLP status.code set in send_otlp_span.cjs is not flowing through to the spans dataset; queries must rely on gh-aw.run.status instead |
Representative Traces
Mixed-status run — Daily Code Metrics and Trend Tracking Agent
Trace 2f5411f17313c95e911449cd270b8854 (2026-05-19 19:05-19:21 UTC). Across the trace, the agent emits multiple gen_ai spans; some carry gh-aw.run.status:success and others gh-aw.run.status:failure, indicating per-span runtime outcome rather than a single run-level verdict. Continuity by trace filter is intact (all spans share the trace id and the gh-aw.workflow.name attribute).
View: https://github.sentry.io/explore/traces/trace/2f5411f17313c95e911449cd270b8854
Latency outlier — Copilot Agent Prompt Clustering Analysis
Single gen_ai span 320f6ae05e19c650 on trace cdee6c6becf338409e2aa12d3c335f91 with span.duration = 1,402,449ms (~23.4 min). No gen_ai.response.finish_reasons attribute, so we cannot tell from telemetry alone whether this was a hit-the-ceiling agent loop, a stuck call, or expected long-running work.
View: https://github.sentry.io/explore/traces/trace/cdee6c6becf338409e2aa12d3c335f91
Token outlier — Test Quality Sentinel
Trace 50762cc694e924c0eb15d76837bf098d reports 834,846 total tokens (818,761 input / 16,085 output) across multiple gen_ai spans. Input-heavy ratio is a classic prompt-caching candidate.
View: https://github.sentry.io/explore/traces/trace/50762cc694e924c0eb15d76837bf098d
Failure cluster — PR Sous Chef
Trace f7ba070a7742343f13dd9eb23ea7d3ba (2026-05-20 06:50-07:03 UTC) contains 4 spans tagged gh-aw.run.status:failure. Separately, trace 870f84211d4adcadbce47409aad81f73 shows the same workflow ran a ~19.4 min gen_ai span; correlation between the two is plausible but not provable without release/service.version to anchor versions.
View: https://github.sentry.io/explore/traces/trace/f7ba070a7742343f13dd9eb23ea7d3ba
Recommendations
- Emit
gen_ai.response.finish_reasons with a default. In actions/setup/js/send_otlp_span.cjs:1795-1797, the attribute is only emitted when runtimeMetrics.stopReason is present or the agent times out. As a result, 0/5,078 spans surface this field in 24h, which makes truncation undetectable. Emit ["unknown"] or ["stop"] when no stop reason is parsed, so :length filters become meaningful.
- Restore Sentry
release correlation. All 5,078 spans return release:null. service.version is set as a resource attribute at send_otlp_span.cjs:322 but is not appearing as Sentry release. Verify the resource→release mapping in the OTLP→Sentry ingest path; without it, deploy regressions cannot be attributed.
- Make OTLP
status.code visible as span.status. Every span returns span.status:null, so reliability queries must currently rely on the gh-aw-specific gh-aw.run.status attribute. Either document this as the official failure signal, or fix the OTLP status code mapping so generic Sentry tooling works.
- Triage the 7 failing workflows. Five of them (
Daily Code Metrics, Sub-Issue Closer, Documentation Noob Tester, Safe Output Health Monitor, Contribution Check, PR Sous Chef, Daily SPDD Spec Planner) each carry 4-5 failure-tagged spans in 24h. Start with Safe Output Health Monitor since it gates the project's primary write surface.
Notes
get_trace_details is not available in this Sentry MCP build (Error -32602: unknown tool); trace continuity was instead verified via list_events filtered by trace:<id>.
errors and logs datasets returned no results for the 24h window — that's an explicit observability finding, not implicit health. If errors are expected (e.g., from a Sentry SDK in the agent runtime), the SDK may not be reporting.
- All findings about runtime outcome rely on
gh-aw.run.status because span.status is null project-wide. This is reported as an instrumentation/mapping gap, not a runtime failure.
gen_ai.usage.total_tokens is populated and queryable, so token budget tracking is functional.
- No OTLP exporter or auth failures were observed in the spans, errors, or logs datasets for the window.
References:
Generated by 🚨 Daily Reliability Review · ● 6.7M · ◷
Executive Summary
Window: last 24h ending 2026-05-20 (UTC). Sentry org
github, projectgh-aw. Spans dataset is healthy (5,078 spans; 4,998gen_ai, 80default).errorsandlogsdatasets are empty for this project over the window — treat that as a calm signal, not necessarily a clean bill of health.No timeouts, cancellations, or OTLP exporter failures were observed. 29 spans across 7 workflows carry
gh-aw.run.status:failure. The same trace can contain bothfailureandsuccessspans, so this represents per-span runtime outcome rather than a count of failed runs.Four instrumentation gaps recur across the window and limit what conclusions can be drawn from spans alone:
span.statusis null on every span (OTLPstatus.code→ Sentry mapping not applied to spans dataset),releaseis null on every span (resourceservice.versionnot surfacing as Sentryrelease),service.versionis not queryable viahas:, andgen_ai.response.finish_reasonsis absent on all 5,078 spans (emit-side gate onruntimeMetrics.stopReason— seeactions/setup/js/send_otlp_span.cjs:1795-1797).Top Reliability Findings
failureandsuccessgen_aispans2f5411f17313c95e911449cd270b8854failurespans/24h; also exhibits a long ~19mgen_aispan on a different tracef7ba070a7742343f13dd9eb23ea7d3ba); 1,162s span on trace870f84211d4adcadbce47409aad81f73failurespans on one trace254dd9d49f37603cdc9825c5c1ef4f91failurespans each50762cc694e924c0eb15d76837bf098dgen_aispan 1,402,449ms (~23.4 min)cdee6c6becf338409e2aa12d3c335f91finish_reasons:lengthmakes truncation diagnosis impossiblegen_ai.response.finish_reasonsabsent on 0 of 5,078 spanshas:gen_ai.response.finish_reasons→ no resultsruntimeMetrics.stopReasonis being parsed; emit a default value (stop/unknown) so truncation is detectablereleaseandservice.versionnot queryable on spans (5,078 spans, all null)has:release→ 5,078 withrelease:null;has:service.version→ no resultsspan.statusnull on every spanspan.statusreturns onlynullstatus.codeset insend_otlp_span.cjsis not flowing through to the spans dataset; queries must rely ongh-aw.run.statusinsteadRepresentative Traces
Mixed-status run — Daily Code Metrics and Trend Tracking Agent
Trace
2f5411f17313c95e911449cd270b8854(2026-05-19 19:05-19:21 UTC). Across the trace, the agent emits multiplegen_aispans; some carrygh-aw.run.status:successand othersgh-aw.run.status:failure, indicating per-span runtime outcome rather than a single run-level verdict. Continuity bytracefilter is intact (all spans share the trace id and thegh-aw.workflow.nameattribute).View: https://github.sentry.io/explore/traces/trace/2f5411f17313c95e911449cd270b8854
Latency outlier — Copilot Agent Prompt Clustering Analysis
Single
gen_aispan320f6ae05e19c650on tracecdee6c6becf338409e2aa12d3c335f91withspan.duration= 1,402,449ms (~23.4 min). Nogen_ai.response.finish_reasonsattribute, so we cannot tell from telemetry alone whether this was a hit-the-ceiling agent loop, a stuck call, or expected long-running work.View: https://github.sentry.io/explore/traces/trace/cdee6c6becf338409e2aa12d3c335f91
Token outlier — Test Quality Sentinel
Trace
50762cc694e924c0eb15d76837bf098dreports 834,846 total tokens (818,761 input / 16,085 output) across multiplegen_aispans. Input-heavy ratio is a classic prompt-caching candidate.View: https://github.sentry.io/explore/traces/trace/50762cc694e924c0eb15d76837bf098d
Failure cluster — PR Sous Chef
Trace
f7ba070a7742343f13dd9eb23ea7d3ba(2026-05-20 06:50-07:03 UTC) contains 4 spans taggedgh-aw.run.status:failure. Separately, trace870f84211d4adcadbce47409aad81f73shows the same workflow ran a ~19.4 mingen_aispan; correlation between the two is plausible but not provable withoutrelease/service.versionto anchor versions.View: https://github.sentry.io/explore/traces/trace/f7ba070a7742343f13dd9eb23ea7d3ba
Recommendations
gen_ai.response.finish_reasonswith a default. Inactions/setup/js/send_otlp_span.cjs:1795-1797, the attribute is only emitted whenruntimeMetrics.stopReasonis present or the agent times out. As a result, 0/5,078 spans surface this field in 24h, which makes truncation undetectable. Emit["unknown"]or["stop"]when no stop reason is parsed, so:lengthfilters become meaningful.releasecorrelation. All 5,078 spans returnrelease:null.service.versionis set as a resource attribute atsend_otlp_span.cjs:322but is not appearing as Sentryrelease. Verify the resource→release mapping in the OTLP→Sentry ingest path; without it, deploy regressions cannot be attributed.status.codevisible asspan.status. Every span returnsspan.status:null, so reliability queries must currently rely on the gh-aw-specificgh-aw.run.statusattribute. Either document this as the official failure signal, or fix the OTLP status code mapping so generic Sentry tooling works.Daily Code Metrics,Sub-Issue Closer,Documentation Noob Tester,Safe Output Health Monitor,Contribution Check,PR Sous Chef,Daily SPDD Spec Planner) each carry 4-5 failure-tagged spans in 24h. Start withSafe Output Health Monitorsince it gates the project's primary write surface.Notes
get_trace_detailsis not available in this Sentry MCP build (Error -32602: unknown tool); trace continuity was instead verified vialist_eventsfiltered bytrace:<id>.errorsandlogsdatasets returned no results for the 24h window — that's an explicit observability finding, not implicit health. If errors are expected (e.g., from a Sentry SDK in the agent runtime), the SDK may not be reporting.gh-aw.run.statusbecausespan.statusis null project-wide. This is reported as an instrumentation/mapping gap, not a runtime failure.gen_ai.usage.total_tokensis populated and queryable, so token budget tracking is functional.References: