[reliability] Daily Reliability Review - 2026-05-20

### Executive Summary

Overall health for the last 24h on `github/gh-aw` is **healthy with isolated failures and material observability gaps**. Across **2,553** conclusion spans carrying `gh-aw.run.status`, **29 spans (1.14%) were marked `failure`**, representing exactly **7 distinct failed runs** (one per workflow) — no recurring per-workflow failure pattern. The `errors` and `logs` datasets in Sentry are **empty** for this window. No spans report `gh-aw.run.status:timeout` or `cancelled`.

The most actionable signal is **instrumentation**, not runtime: `gen_ai.response.finish_reasons`, `service.version`, `release`, and `span.status` are absent on **all** sampled spans, so it is not possible from Sentry alone to distinguish a genuine agent failure from a length-truncated response or a runner-timeout, nor to pin a regression to a gh-aw release.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P1 | _All gh-aw spans_ | `gen_ai.response.finish_reasons` missing on every span (24h, 5,187 spans) | `has:gen_ai.response.finish_reasons` returns 0 results; emit-side only fires when `runtimeMetrics.stopReason` is parsed from agent stdio log (`send_otlp_span.cjs:1795-1798`) | Backfill `stopReason` for copilot/codex engines so finish-reason-based failure classification works |
| P1 | _All gh-aw spans_ | `service.version` / `release` null on every span (5,187/5,187) | `has:service.version` → 0 results; `has:release` → 1 group, all `null`; `send_otlp_span.cjs:319-324` only emits `service.version` when `scopeVersion && scopeVersion !== "unknown"` | Resolve a real version (git SHA or release tag) and pass as `scopeVersion`; map to Sentry `release` for regression correlation |
| P2 | Daily Code Metrics and Trend Tracking Agent | Failed run on claude engine, agent.conclusion 13.7 min | run [26119008282](https://github.com/github/gh-aw/actions/runs/26119008282), trace `2f5411f17313c95e911449cd270b8854`, `gh-aw.agent.conclusion` 819,746 ms | Inspect run logs; without `finish_reasons` cannot distinguish model truncation vs. runtime error |
| P2 | Safe Output Health Monitor | Failed run on claude engine, agent.conclusion 12.8 min | run [26143483059](https://github.com/github/gh-aw/actions/runs/26143483059), trace `254dd9d49f37603cdc9825c5c1ef4f91`, `gh-aw.agent.conclusion` 766,181 ms | Same: needs finish-reason or runner-side conclusion attribute |
| P2 | Documentation Noob Tester | Failed run on copilot engine, agent.conclusion 9.3 min | run [26142529471](https://github.com/github/gh-aw/actions/runs/26142529471), trace `a052ce52143244b41c0714e4331d9e68`, `gh-aw.agent.conclusion` 559,485 ms | Inspect; copilot engine is over-represented (5/7 of failures) |
| P3 | Sentry `errors` + `logs` datasets | Empty for 24h | Both `list_events` queries on `errors` and `logs` returned 0 results | Inconclusive — gh-aw does not currently emit error events or logs to Sentry; consider mirroring exporter failures as Sentry events to surface them outside the trace surface |

### Representative Traces

<details>
<summary>View representative traces</summary>

Verified continuity in each trace below — all expected conclusion sub-spans (`agent.conclusion → detection.conclusion → safe_outputs.conclusion → conclusion.conclusion`) share the same `trace` and `gh-aw.run.id`. Trace continuity is healthy; only the runtime outcome is impossible to fully characterise.

- **Safe Output Health Monitor (claude, failed)** — [trace](https://github.sentry.io/explore/traces/trace/254dd9d49f37603cdc9825c5c1ef4f91) · run [§26143483059](https://github.com/github/gh-aw/actions/runs/26143483059) · agent.conclusion 766,181 ms · 4 conclusion spans marked `failure`.
- **Documentation Noob Tester (copilot, failed)** — [trace](https://github.sentry.io/explore/traces/trace/a052ce52143244b41c0714e4331d9e68) · run [§26142529471](https://github.com/github/gh-aw/actions/runs/26142529471) · agent.conclusion 559,485 ms.
- **Daily Code Metrics and Trend Tracking Agent (claude, failed)** — [trace](https://github.sentry.io/explore/traces/trace/2f5411f17313c95e911449cd270b8854) · run [§26119008282](https://github.com/github/gh-aw/actions/runs/26119008282) · agent.conclusion 819,746 ms.
- Other failed runs (4 conclusion spans each, all `gh-aw.run.status:failure`): [§26146345218](https://github.com/github/gh-aw/actions/runs/26146345218) (PR Sous Chef), [§26148162505](https://github.com/github/gh-aw/actions/runs/26148162505) (Sub-Issue Closer), [§26111831694](https://github.com/github/gh-aw/actions/runs/26111831694) (Daily SPDD Spec Planner), [§26155398474](https://github.com/github/gh-aw/actions/runs/26155398474) (Contribution Check).

</details>

<details>
<summary>View latency landscape (informational, not a regression)</summary>

Top span groups by count, 24h window (avg / max duration):

| Span | Count | Avg (ms) | Max (ms) |
| --- | ---: | ---: | ---: |
| gh-aw.agent.conclusion | 359 | 277,567 | 1,554,868 |
| gh-aw.pre_activation.conclusion | 603 | 20,400 | 766,509 |
| gh-aw.detection.conclusion | 260 | 60,105 | 212,912 |
| gh-aw.safe_outputs.conclusion | 327 | 9,726 | 90,927 |
| gh-aw.upload_assets.conclusion | 9 | 36,663 | 48,142 |

The 25.9-min `gh-aw.agent.conclusion` max is a **success** (`Copilot Agent Prompt Clustering Analysis`, run [§26157150065](https://github.com/github/gh-aw/actions/runs/26157150065)), so the top-end latency reflects legitimate long agentic work, not timeouts. Six of the seven longest agent.conclusion spans (>15 min) are `gh-aw.run.status:success`.

</details>

### Recommendations

1. **Wire `gen_ai.response.finish_reasons` on every agent-job conclusion span**, not only when an engine writes `stop_reason` to `agent-stdio.log`. For copilot/codex engines, derive `length` / `tool_use` / `end_turn` from runner outcome or set a sentinel `unknown` so length-truncation is queryable. Code site: `actions/setup/js/send_otlp_span.cjs:1787-1799`.
2. **Emit a real `service.version`** (commit SHA or release tag) for OTLP resource attrs so Sentry can populate `release` and surface "this regression started at version X". Code site: `actions/setup/js/send_otlp_span.cjs:319-324`.
3. **Triage the 7 failed runs above against runner logs** to confirm whether agent.conclusion durations >9 min represent genuine model failure or runner-side process termination; this validates whether recommendation `#1` would have caught them.
4. **Add a periodic `gh-aw.run.status:failure` dashboard panel grouped by `gh-aw.engine.id`** — current sample shows 5/7 of failures on copilot engine vs. 2/7 on claude, which is worth watching but not yet a confirmed regression.

### Notes

<details>
<summary>View notes</summary>

- **Sentry MCP tool surface**: `search_events` and `get_trace_details` are not available in this build of the MCP server; trace inspection used `list_events` with `trace:<id>` filter, which preserves full continuity verification.
- **Run-status mapping clarification**: the prompt's checklist named `gh_aw.workflow_name`, but the emit-side attribute key is `gh-aw.workflow.name` (`send_otlp_span.cjs:1105,1744`). The dotted form is present on every span; the underscored form does not exist. Treat this as a docs/spec inconsistency, not an instrumentation gap.
- **`span.status` field is null on all sampled spans**. OTLP `status.code` is set on the emit side (`send_otlp_span.cjs:1725-1738`), but Sentry's span-search surface does not appear to map it to the `span.status` column for this project. `gh-aw.run.status` is the reliable failure indicator on this backend.
- **No `timeout` or `cancelled` values** seen on `gh-aw.run.status` in 24h. Either no such runs occurred, or `agentConclusion` / `workflowRunConclusion` never produced those raw values in the window (`send_otlp_span.cjs:1720-1729`).
- **Datasets `errors` and `logs` are empty for this window** — explicit observability finding, not silent skip. gh-aw does not currently emit non-trace telemetry to Sentry.
- **Inconclusive runtime outcome**: per the operating contract, the 7 failed runs are reported as **confirmed failures + confirmed instrumentation gap** (no finish-reason, no release pinning) rather than as confirmed timeouts. The cluster is small and one-shot per workflow, so it is **not yet a recurring pattern**.

**References:**
- [§26143483059](https://github.com/github/gh-aw/actions/runs/26143483059) — Safe Output Health Monitor (claude, failure)
- [§26119008282](https://github.com/github/gh-aw/actions/runs/26119008282) — Daily Code Metrics (claude, failure)
- [§26142529471](https://github.com/github/gh-aw/actions/runs/26142529471) — Documentation Noob Tester (copilot, failure)

</details>







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/26161474928) · ● 15.5M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on May 22, 2026, 12:17 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-05-20 #33525

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
P1	All gh-aw spans	`gen_ai.response.finish_reasons` missing on every span (24h, 5,187 spans)	`has:gen_ai.response.finish_reasons` returns 0 results; emit-side only fires when `runtimeMetrics.stopReason` is parsed from agent stdio log (`send_otlp_span.cjs:1795-1798`)	Backfill `stopReason` for copilot/codex engines so finish-reason-based failure classification works
P1	All gh-aw spans	`service.version` / `release` null on every span (5,187/5,187)	`has:service.version` → 0 results; `has:release` → 1 group, all `null`; `send_otlp_span.cjs:319-324` only emits `service.version` when `scopeVersion && scopeVersion !== "unknown"`	Resolve a real version (git SHA or release tag) and pass as `scopeVersion`; map to Sentry `release` for regression correlation
P2	Daily Code Metrics and Trend Tracking Agent	Failed run on claude engine, agent.conclusion 13.7 min	run 26119008282, trace `2f5411f17313c95e911449cd270b8854`, `gh-aw.agent.conclusion` 819,746 ms	Inspect run logs; without `finish_reasons` cannot distinguish model truncation vs. runtime error
P2	Safe Output Health Monitor	Failed run on claude engine, agent.conclusion 12.8 min	run 26143483059, trace `254dd9d49f37603cdc9825c5c1ef4f91`, `gh-aw.agent.conclusion` 766,181 ms	Same: needs finish-reason or runner-side conclusion attribute
P2	Documentation Noob Tester	Failed run on copilot engine, agent.conclusion 9.3 min	run 26142529471, trace `a052ce52143244b41c0714e4331d9e68`, `gh-aw.agent.conclusion` 559,485 ms	Inspect; copilot engine is over-represented (5/7 of failures)
P3	Sentry `errors` + `logs` datasets	Empty for 24h	Both `list_events` queries on `errors` and `logs` returned 0 results	Inconclusive — gh-aw does not currently emit error events or logs to Sentry; consider mirroring exporter failures as Sentry events to surface them outside the trace surface

Span	Count	Avg (ms)	Max (ms)
gh-aw.agent.conclusion	359	277,567	1,554,868
gh-aw.pre_activation.conclusion	603	20,400	766,509
gh-aw.detection.conclusion	260	60,105	212,912
gh-aw.safe_outputs.conclusion	327	9,726	90,927
gh-aw.upload_assets.conclusion	9	36,663	48,142

[reliability] Daily Reliability Review - 2026-05-20 #33525

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions