[reliability] Daily Reliability Review - 2026-05-20

### Executive Summary

Window: last 24h ending 2026-05-20 (UTC). Sentry org `github`, project `gh-aw`. Spans dataset is healthy (5,078 spans; 4,998 `gen_ai`, 80 `default`). `errors` and `logs` datasets are **empty** for this project over the window — treat that as a calm signal, not necessarily a clean bill of health.

No timeouts, cancellations, or OTLP exporter failures were observed. **29 spans across 7 workflows** carry `gh-aw.run.status:failure`. The same trace can contain both `failure` and `success` spans, so this represents per-span runtime outcome rather than a count of failed runs.

Four instrumentation gaps recur across the window and limit what conclusions can be drawn from spans alone: `span.status` is null on every span (OTLP `status.code` → Sentry mapping not applied to spans dataset), `release` is null on every span (resource `service.version` not surfacing as Sentry `release`), `service.version` is not queryable via `has:`, and `gen_ai.response.finish_reasons` is absent on all 5,078 spans (emit-side gate on `runtimeMetrics.stopReason` — see `actions/setup/js/send_otlp_span.cjs:1795-1797`).

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P1 | Daily Code Metrics and Trend Tracking Agent | Mixed-status run: trace contains both `failure` and `success` `gen_ai` spans | 5 failure spans / 24h, trace `2f5411f17313c95e911449cd270b8854` | Inspect agent loop; capture failing tool/output for the failure spans |
| P2 | PR Sous Chef | 4 `failure` spans/24h; also exhibits a long ~19m `gen_ai` span on a different trace | 4 failure spans (trace `f7ba070a7742343f13dd9eb23ea7d3ba`); 1,162s span on trace `870f84211d4adcadbce47409aad81f73` | Verify whether failures correlate with long-running iterations |
| P2 | Safe Output Health Monitor | 4 `failure` spans on one trace | trace `254dd9d49f37603cdc9825c5c1ef4f91` | Check safe-output emission path; relevant since safe-outputs are this project's primary write surface |
| P2 | Sub-Issue Closer, Documentation Noob Tester, Contribution Check, Daily SPDD Spec Planner | 4 `failure` spans each | aggregate query shown below | Group failures by workflow to see whether they share a common failing engine/event |
| P3 | Test Quality Sentinel | Token outlier: 834,846 total tokens (818,761 input / 16,085 output) on one trace | trace `50762cc694e924c0eb15d76837bf098d` | Add prompt caching or scope reduction; high input:output ratio suggests caching wins |
| P3 | Copilot Agent Prompt Clustering Analysis | Latency outlier: single `gen_ai` span 1,402,449ms (~23.4 min) | trace `cdee6c6becf338409e2aa12d3c335f91` | Confirm this is intended (long agent loop) vs. a single LLM call stuck; absence of `finish_reasons:length` makes truncation diagnosis impossible |
| P3 (gap) | All workflows | `gen_ai.response.finish_reasons` absent on **0 of 5,078** spans | `has:gen_ai.response.finish_reasons` → no results | Verify `runtimeMetrics.stopReason` is being parsed; emit a default value (`stop` / `unknown`) so truncation is detectable |
| P3 (gap) | All workflows | `release` and `service.version` not queryable on spans (5,078 spans, all null) | `has:release` → 5,078 with `release:null`; `has:service.version` → no results | Confirm resource attribute → Sentry release mapping; without it, regression-to-deploy correlation is blocked |
| P3 (gap) | All workflows | `span.status` null on every span | aggregate by `span.status` returns only `null` | OTLP `status.code` set in `send_otlp_span.cjs` is not flowing through to the spans dataset; queries must rely on `gh-aw.run.status` instead |

### Representative Traces

<details>
<summary>Mixed-status run — Daily Code Metrics and Trend Tracking Agent</summary>

Trace `2f5411f17313c95e911449cd270b8854` (2026-05-19 19:05-19:21 UTC). Across the trace, the agent emits multiple `gen_ai` spans; some carry `gh-aw.run.status:success` and others `gh-aw.run.status:failure`, indicating per-span runtime outcome rather than a single run-level verdict. Continuity by `trace` filter is intact (all spans share the trace id and the `gh-aw.workflow.name` attribute).

View: https://github.sentry.io/explore/traces/trace/2f5411f17313c95e911449cd270b8854

</details>

<details>
<summary>Latency outlier — Copilot Agent Prompt Clustering Analysis</summary>

Single `gen_ai` span `320f6ae05e19c650` on trace `cdee6c6becf338409e2aa12d3c335f91` with `span.duration` = 1,402,449ms (~23.4 min). No `gen_ai.response.finish_reasons` attribute, so we cannot tell from telemetry alone whether this was a hit-the-ceiling agent loop, a stuck call, or expected long-running work.

View: https://github.sentry.io/explore/traces/trace/cdee6c6becf338409e2aa12d3c335f91

</details>

<details>
<summary>Token outlier — Test Quality Sentinel</summary>

Trace `50762cc694e924c0eb15d76837bf098d` reports 834,846 total tokens (818,761 input / 16,085 output) across multiple `gen_ai` spans. Input-heavy ratio is a classic prompt-caching candidate.

View: https://github.sentry.io/explore/traces/trace/50762cc694e924c0eb15d76837bf098d

</details>

<details>
<summary>Failure cluster — PR Sous Chef</summary>

Trace `f7ba070a7742343f13dd9eb23ea7d3ba` (2026-05-20 06:50-07:03 UTC) contains 4 spans tagged `gh-aw.run.status:failure`. Separately, trace `870f84211d4adcadbce47409aad81f73` shows the same workflow ran a ~19.4 min `gen_ai` span; correlation between the two is plausible but not provable without `release`/`service.version` to anchor versions.

View: https://github.sentry.io/explore/traces/trace/f7ba070a7742343f13dd9eb23ea7d3ba

</details>

### Recommendations

1. **Emit `gen_ai.response.finish_reasons` with a default.** In `actions/setup/js/send_otlp_span.cjs:1795-1797`, the attribute is only emitted when `runtimeMetrics.stopReason` is present or the agent times out. As a result, 0/5,078 spans surface this field in 24h, which makes truncation undetectable. Emit `["unknown"]` or `["stop"]` when no stop reason is parsed, so `:length` filters become meaningful.
2. **Restore Sentry `release` correlation.** All 5,078 spans return `release:null`. `service.version` is set as a resource attribute at `send_otlp_span.cjs:322` but is not appearing as Sentry `release`. Verify the resource→release mapping in the OTLP→Sentry ingest path; without it, deploy regressions cannot be attributed.
3. **Make OTLP `status.code` visible as `span.status`.** Every span returns `span.status:null`, so reliability queries must currently rely on the gh-aw-specific `gh-aw.run.status` attribute. Either document this as the official failure signal, or fix the OTLP status code mapping so generic Sentry tooling works.
4. **Triage the 7 failing workflows.** Five of them (`Daily Code Metrics`, `Sub-Issue Closer`, `Documentation Noob Tester`, `Safe Output Health Monitor`, `Contribution Check`, `PR Sous Chef`, `Daily SPDD Spec Planner`) each carry 4-5 failure-tagged spans in 24h. Start with `Safe Output Health Monitor` since it gates the project's primary write surface.

### Notes

- `get_trace_details` is **not available** in this Sentry MCP build (`Error -32602: unknown tool`); trace continuity was instead verified via `list_events` filtered by `trace:<id>`.
- `errors` and `logs` datasets returned **no results** for the 24h window — that's an explicit observability finding, not implicit health. If errors are expected (e.g., from a Sentry SDK in the agent runtime), the SDK may not be reporting.
- All findings about runtime outcome rely on `gh-aw.run.status` because `span.status` is null project-wide. This is reported as an instrumentation/mapping gap, not a runtime failure.
- `gen_ai.usage.total_tokens` is populated and queryable, so token budget tracking is functional.
- No OTLP exporter or auth failures were observed in the spans, errors, or logs datasets for the window.

**References:**
- [§26158216534](https://github.com/github/gh-aw/actions/runs/26158216534)







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/26158216534) · ● 6.7M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on May 22, 2026, 11:07 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-05-20 #33517

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
P1	Daily Code Metrics and Trend Tracking Agent	Mixed-status run: trace contains both `failure` and `success` `gen_ai` spans	5 failure spans / 24h, trace `2f5411f17313c95e911449cd270b8854`	Inspect agent loop; capture failing tool/output for the failure spans
P2	PR Sous Chef	4 `failure` spans/24h; also exhibits a long ~19m `gen_ai` span on a different trace	4 failure spans (trace `f7ba070a7742343f13dd9eb23ea7d3ba`); 1,162s span on trace `870f84211d4adcadbce47409aad81f73`	Verify whether failures correlate with long-running iterations
P2	Safe Output Health Monitor	4 `failure` spans on one trace	trace `254dd9d49f37603cdc9825c5c1ef4f91`	Check safe-output emission path; relevant since safe-outputs are this project's primary write surface
P2	Sub-Issue Closer, Documentation Noob Tester, Contribution Check, Daily SPDD Spec Planner	4 `failure` spans each	aggregate query shown below	Group failures by workflow to see whether they share a common failing engine/event
P3	Test Quality Sentinel	Token outlier: 834,846 total tokens (818,761 input / 16,085 output) on one trace	trace `50762cc694e924c0eb15d76837bf098d`	Add prompt caching or scope reduction; high input:output ratio suggests caching wins
P3	Copilot Agent Prompt Clustering Analysis	Latency outlier: single `gen_ai` span 1,402,449ms (~23.4 min)	trace `cdee6c6becf338409e2aa12d3c335f91`	Confirm this is intended (long agent loop) vs. a single LLM call stuck; absence of `finish_reasons:length` makes truncation diagnosis impossible
P3 (gap)	All workflows	`gen_ai.response.finish_reasons` absent on 0 of 5,078 spans	`has:gen_ai.response.finish_reasons` → no results	Verify `runtimeMetrics.stopReason` is being parsed; emit a default value (`stop` / `unknown`) so truncation is detectable
P3 (gap)	All workflows	`release` and `service.version` not queryable on spans (5,078 spans, all null)	`has:release` → 5,078 with `release:null`; `has:service.version` → no results	Confirm resource attribute → Sentry release mapping; without it, regression-to-deploy correlation is blocked
P3 (gap)	All workflows	`span.status` null on every span	aggregate by `span.status` returns only `null`	OTLP `status.code` set in `send_otlp_span.cjs` is not flowing through to the spans dataset; queries must rely on `gh-aw.run.status` instead

[reliability] Daily Reliability Review - 2026-05-20 #33517

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions