Skip to content

[reliability] Daily Reliability Review - 2026-05-20 #33517

@github-actions

Description

@github-actions

Executive Summary

Window: last 24h ending 2026-05-20 (UTC). Sentry org github, project gh-aw. Spans dataset is healthy (5,078 spans; 4,998 gen_ai, 80 default). errors and logs datasets are empty for this project over the window — treat that as a calm signal, not necessarily a clean bill of health.

No timeouts, cancellations, or OTLP exporter failures were observed. 29 spans across 7 workflows carry gh-aw.run.status:failure. The same trace can contain both failure and success spans, so this represents per-span runtime outcome rather than a count of failed runs.

Four instrumentation gaps recur across the window and limit what conclusions can be drawn from spans alone: span.status is null on every span (OTLP status.code → Sentry mapping not applied to spans dataset), release is null on every span (resource service.version not surfacing as Sentry release), service.version is not queryable via has:, and gen_ai.response.finish_reasons is absent on all 5,078 spans (emit-side gate on runtimeMetrics.stopReason — see actions/setup/js/send_otlp_span.cjs:1795-1797).

Top Reliability Findings

Priority Workflow Problem Evidence Next Action
P1 Daily Code Metrics and Trend Tracking Agent Mixed-status run: trace contains both failure and success gen_ai spans 5 failure spans / 24h, trace 2f5411f17313c95e911449cd270b8854 Inspect agent loop; capture failing tool/output for the failure spans
P2 PR Sous Chef 4 failure spans/24h; also exhibits a long ~19m gen_ai span on a different trace 4 failure spans (trace f7ba070a7742343f13dd9eb23ea7d3ba); 1,162s span on trace 870f84211d4adcadbce47409aad81f73 Verify whether failures correlate with long-running iterations
P2 Safe Output Health Monitor 4 failure spans on one trace trace 254dd9d49f37603cdc9825c5c1ef4f91 Check safe-output emission path; relevant since safe-outputs are this project's primary write surface
P2 Sub-Issue Closer, Documentation Noob Tester, Contribution Check, Daily SPDD Spec Planner 4 failure spans each aggregate query shown below Group failures by workflow to see whether they share a common failing engine/event
P3 Test Quality Sentinel Token outlier: 834,846 total tokens (818,761 input / 16,085 output) on one trace trace 50762cc694e924c0eb15d76837bf098d Add prompt caching or scope reduction; high input:output ratio suggests caching wins
P3 Copilot Agent Prompt Clustering Analysis Latency outlier: single gen_ai span 1,402,449ms (~23.4 min) trace cdee6c6becf338409e2aa12d3c335f91 Confirm this is intended (long agent loop) vs. a single LLM call stuck; absence of finish_reasons:length makes truncation diagnosis impossible
P3 (gap) All workflows gen_ai.response.finish_reasons absent on 0 of 5,078 spans has:gen_ai.response.finish_reasons → no results Verify runtimeMetrics.stopReason is being parsed; emit a default value (stop / unknown) so truncation is detectable
P3 (gap) All workflows release and service.version not queryable on spans (5,078 spans, all null) has:release → 5,078 with release:null; has:service.version → no results Confirm resource attribute → Sentry release mapping; without it, regression-to-deploy correlation is blocked
P3 (gap) All workflows span.status null on every span aggregate by span.status returns only null OTLP status.code set in send_otlp_span.cjs is not flowing through to the spans dataset; queries must rely on gh-aw.run.status instead

Representative Traces

Mixed-status run — Daily Code Metrics and Trend Tracking Agent

Trace 2f5411f17313c95e911449cd270b8854 (2026-05-19 19:05-19:21 UTC). Across the trace, the agent emits multiple gen_ai spans; some carry gh-aw.run.status:success and others gh-aw.run.status:failure, indicating per-span runtime outcome rather than a single run-level verdict. Continuity by trace filter is intact (all spans share the trace id and the gh-aw.workflow.name attribute).

View: https://github.sentry.io/explore/traces/trace/2f5411f17313c95e911449cd270b8854

Latency outlier — Copilot Agent Prompt Clustering Analysis

Single gen_ai span 320f6ae05e19c650 on trace cdee6c6becf338409e2aa12d3c335f91 with span.duration = 1,402,449ms (~23.4 min). No gen_ai.response.finish_reasons attribute, so we cannot tell from telemetry alone whether this was a hit-the-ceiling agent loop, a stuck call, or expected long-running work.

View: https://github.sentry.io/explore/traces/trace/cdee6c6becf338409e2aa12d3c335f91

Token outlier — Test Quality Sentinel

Trace 50762cc694e924c0eb15d76837bf098d reports 834,846 total tokens (818,761 input / 16,085 output) across multiple gen_ai spans. Input-heavy ratio is a classic prompt-caching candidate.

View: https://github.sentry.io/explore/traces/trace/50762cc694e924c0eb15d76837bf098d

Failure cluster — PR Sous Chef

Trace f7ba070a7742343f13dd9eb23ea7d3ba (2026-05-20 06:50-07:03 UTC) contains 4 spans tagged gh-aw.run.status:failure. Separately, trace 870f84211d4adcadbce47409aad81f73 shows the same workflow ran a ~19.4 min gen_ai span; correlation between the two is plausible but not provable without release/service.version to anchor versions.

View: https://github.sentry.io/explore/traces/trace/f7ba070a7742343f13dd9eb23ea7d3ba

Recommendations

  1. Emit gen_ai.response.finish_reasons with a default. In actions/setup/js/send_otlp_span.cjs:1795-1797, the attribute is only emitted when runtimeMetrics.stopReason is present or the agent times out. As a result, 0/5,078 spans surface this field in 24h, which makes truncation undetectable. Emit ["unknown"] or ["stop"] when no stop reason is parsed, so :length filters become meaningful.
  2. Restore Sentry release correlation. All 5,078 spans return release:null. service.version is set as a resource attribute at send_otlp_span.cjs:322 but is not appearing as Sentry release. Verify the resource→release mapping in the OTLP→Sentry ingest path; without it, deploy regressions cannot be attributed.
  3. Make OTLP status.code visible as span.status. Every span returns span.status:null, so reliability queries must currently rely on the gh-aw-specific gh-aw.run.status attribute. Either document this as the official failure signal, or fix the OTLP status code mapping so generic Sentry tooling works.
  4. Triage the 7 failing workflows. Five of them (Daily Code Metrics, Sub-Issue Closer, Documentation Noob Tester, Safe Output Health Monitor, Contribution Check, PR Sous Chef, Daily SPDD Spec Planner) each carry 4-5 failure-tagged spans in 24h. Start with Safe Output Health Monitor since it gates the project's primary write surface.

Notes

  • get_trace_details is not available in this Sentry MCP build (Error -32602: unknown tool); trace continuity was instead verified via list_events filtered by trace:<id>.
  • errors and logs datasets returned no results for the 24h window — that's an explicit observability finding, not implicit health. If errors are expected (e.g., from a Sentry SDK in the agent runtime), the SDK may not be reporting.
  • All findings about runtime outcome rely on gh-aw.run.status because span.status is null project-wide. This is reported as an instrumentation/mapping gap, not a runtime failure.
  • gen_ai.usage.total_tokens is populated and queryable, so token budget tracking is functional.
  • No OTLP exporter or auth failures were observed in the spans, errors, or logs datasets for the window.

References:

Generated by 🚨 Daily Reliability Review · ● 6.7M ·

  • expires on May 22, 2026, 11:07 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions