Skip to content

[reliability] Daily Reliability Review - 2026-05-20 #33525

@github-actions

Description

@github-actions

Executive Summary

Overall health for the last 24h on github/gh-aw is healthy with isolated failures and material observability gaps. Across 2,553 conclusion spans carrying gh-aw.run.status, 29 spans (1.14%) were marked failure, representing exactly 7 distinct failed runs (one per workflow) — no recurring per-workflow failure pattern. The errors and logs datasets in Sentry are empty for this window. No spans report gh-aw.run.status:timeout or cancelled.

The most actionable signal is instrumentation, not runtime: gen_ai.response.finish_reasons, service.version, release, and span.status are absent on all sampled spans, so it is not possible from Sentry alone to distinguish a genuine agent failure from a length-truncated response or a runner-timeout, nor to pin a regression to a gh-aw release.

Top Reliability Findings

Priority Workflow Problem Evidence Next Action
P1 All gh-aw spans gen_ai.response.finish_reasons missing on every span (24h, 5,187 spans) has:gen_ai.response.finish_reasons returns 0 results; emit-side only fires when runtimeMetrics.stopReason is parsed from agent stdio log (send_otlp_span.cjs:1795-1798) Backfill stopReason for copilot/codex engines so finish-reason-based failure classification works
P1 All gh-aw spans service.version / release null on every span (5,187/5,187) has:service.version → 0 results; has:release → 1 group, all null; send_otlp_span.cjs:319-324 only emits service.version when scopeVersion && scopeVersion !== "unknown" Resolve a real version (git SHA or release tag) and pass as scopeVersion; map to Sentry release for regression correlation
P2 Daily Code Metrics and Trend Tracking Agent Failed run on claude engine, agent.conclusion 13.7 min run 26119008282, trace 2f5411f17313c95e911449cd270b8854, gh-aw.agent.conclusion 819,746 ms Inspect run logs; without finish_reasons cannot distinguish model truncation vs. runtime error
P2 Safe Output Health Monitor Failed run on claude engine, agent.conclusion 12.8 min run 26143483059, trace 254dd9d49f37603cdc9825c5c1ef4f91, gh-aw.agent.conclusion 766,181 ms Same: needs finish-reason or runner-side conclusion attribute
P2 Documentation Noob Tester Failed run on copilot engine, agent.conclusion 9.3 min run 26142529471, trace a052ce52143244b41c0714e4331d9e68, gh-aw.agent.conclusion 559,485 ms Inspect; copilot engine is over-represented (5/7 of failures)
P3 Sentry errors + logs datasets Empty for 24h Both list_events queries on errors and logs returned 0 results Inconclusive — gh-aw does not currently emit error events or logs to Sentry; consider mirroring exporter failures as Sentry events to surface them outside the trace surface

Representative Traces

View representative traces

Verified continuity in each trace below — all expected conclusion sub-spans (agent.conclusion → detection.conclusion → safe_outputs.conclusion → conclusion.conclusion) share the same trace and gh-aw.run.id. Trace continuity is healthy; only the runtime outcome is impossible to fully characterise.

  • Safe Output Health Monitor (claude, failed)trace · run §26143483059 · agent.conclusion 766,181 ms · 4 conclusion spans marked failure.
  • Documentation Noob Tester (copilot, failed)trace · run §26142529471 · agent.conclusion 559,485 ms.
  • Daily Code Metrics and Trend Tracking Agent (claude, failed)trace · run §26119008282 · agent.conclusion 819,746 ms.
  • Other failed runs (4 conclusion spans each, all gh-aw.run.status:failure): §26146345218 (PR Sous Chef), §26148162505 (Sub-Issue Closer), §26111831694 (Daily SPDD Spec Planner), §26155398474 (Contribution Check).
View latency landscape (informational, not a regression)

Top span groups by count, 24h window (avg / max duration):

Span Count Avg (ms) Max (ms)
gh-aw.agent.conclusion 359 277,567 1,554,868
gh-aw.pre_activation.conclusion 603 20,400 766,509
gh-aw.detection.conclusion 260 60,105 212,912
gh-aw.safe_outputs.conclusion 327 9,726 90,927
gh-aw.upload_assets.conclusion 9 36,663 48,142

The 25.9-min gh-aw.agent.conclusion max is a success (Copilot Agent Prompt Clustering Analysis, run §26157150065), so the top-end latency reflects legitimate long agentic work, not timeouts. Six of the seven longest agent.conclusion spans (>15 min) are gh-aw.run.status:success.

Recommendations

  1. Wire gen_ai.response.finish_reasons on every agent-job conclusion span, not only when an engine writes stop_reason to agent-stdio.log. For copilot/codex engines, derive length / tool_use / end_turn from runner outcome or set a sentinel unknown so length-truncation is queryable. Code site: actions/setup/js/send_otlp_span.cjs:1787-1799.
  2. Emit a real service.version (commit SHA or release tag) for OTLP resource attrs so Sentry can populate release and surface "this regression started at version X". Code site: actions/setup/js/send_otlp_span.cjs:319-324.
  3. Triage the 7 failed runs above against runner logs to confirm whether agent.conclusion durations >9 min represent genuine model failure or runner-side process termination; this validates whether recommendation #1 would have caught them.
  4. Add a periodic gh-aw.run.status:failure dashboard panel grouped by gh-aw.engine.id — current sample shows 5/7 of failures on copilot engine vs. 2/7 on claude, which is worth watching but not yet a confirmed regression.

Notes

View notes
  • Sentry MCP tool surface: search_events and get_trace_details are not available in this build of the MCP server; trace inspection used list_events with trace:<id> filter, which preserves full continuity verification.
  • Run-status mapping clarification: the prompt's checklist named gh_aw.workflow_name, but the emit-side attribute key is gh-aw.workflow.name (send_otlp_span.cjs:1105,1744). The dotted form is present on every span; the underscored form does not exist. Treat this as a docs/spec inconsistency, not an instrumentation gap.
  • span.status field is null on all sampled spans. OTLP status.code is set on the emit side (send_otlp_span.cjs:1725-1738), but Sentry's span-search surface does not appear to map it to the span.status column for this project. gh-aw.run.status is the reliable failure indicator on this backend.
  • No timeout or cancelled values seen on gh-aw.run.status in 24h. Either no such runs occurred, or agentConclusion / workflowRunConclusion never produced those raw values in the window (send_otlp_span.cjs:1720-1729).
  • Datasets errors and logs are empty for this window — explicit observability finding, not silent skip. gh-aw does not currently emit non-trace telemetry to Sentry.
  • Inconclusive runtime outcome: per the operating contract, the 7 failed runs are reported as confirmed failures + confirmed instrumentation gap (no finish-reason, no release pinning) rather than as confirmed timeouts. The cluster is small and one-shot per workflow, so it is not yet a recurring pattern.

References:

Generated by 🚨 Daily Reliability Review · ● 15.5M ·

  • expires on May 22, 2026, 12:17 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions