fix(otlp): always emit gen_ai.response.finish_reasons; use GITHUB_SHA as service.version fallback#33528
Conversation
…SHA as service.version fallback - gen_ai.response.finish_reasons: always present on agent conclusion spans. Previously omitted when the engine didn't write a `stop_reason` to agent-stdio.log (copilot, codex). Now falls back to "unknown" so length-truncation is always queryable in Sentry/dashboards. - service.version: add GITHUB_SHA as a final fallback in both the setup and conclusion span paths. When no gh-aw release tag is available the commit SHA is used, enabling Sentry to populate `release` and surface per-commit regressions. Addresses P1 findings from the 2026-05-20 Daily Reliability Review. Co-authored-by: mnkiefer <8320933+mnkiefer@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates OTLP span emission to improve reliability telemetry completeness by ensuring gen_ai.response.finish_reasons is always present on agent-related spans and by providing a deterministic service.version fallback when a gh-aw version isn’t available.
Changes:
- Always emit
gen_ai.response.finish_reasonson agent spans, defaulting to"unknown"whenstop_reasonis not available (and"timeout"for timed-out runs). - Fall back to
GITHUB_SHAforservice.version/scope.versionin both job setup and job conclusion spans when gh-aw version sources are absent. - Update and add tests covering the new fallbacks and the “unknown” finish-reason behavior.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/send_otlp_span.cjs | Adds GITHUB_SHA version fallback for setup/conclusion payload scope/resource versioning; ensures finish reasons are always emitted for agent jobs. |
| actions/setup/js/send_otlp_span.test.cjs | Adds/updates tests validating service.version fallback to GITHUB_SHA and finish_reasons=["unknown"] behavior when stop_reason is missing. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 0
|
✅ PR Code Quality Reviewer completed the code quality review. No action needed: PR #33528 was already merged before this review. Code quality is excellent - no issues found. Changes properly address P1 observability gaps with comprehensive test coverage. |
|
🧪 Test Quality Sentinel completed test quality analysis. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #33528 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (≤100 threshold). |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #33528 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (threshold: 100). |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd based on this being an observability bug fix with strong test coverage.
Key Themes
Strengths:
- ✅ Root cause properly addressed: The fix directly tackles the conditional emission that left 5,187 spans without
finish_reasons - ✅ Sentry regression correlation: Adding GITHUB_SHA fallback provides the missing commit anchor
- ✅ Test coverage is excellent: New tests for GITHUB_SHA fallback on both setup and conclusion spans
- ✅ Test names are descriptive: Clear intent (e.g., "falls back to GITHUB_SHA for service.version when no gh-aw version is available")
- ✅ Regression prevention: Updated existing test to verify "unknown" is emitted, preventing future regressions
Opportunities:
- Sentinel value clarity:
"unknown"is generic — consider"no_result_entry"to make dashboards more actionable - Edge case testing: Add test for when GITHUB_SHA itself is unavailable (local dev, non-GitHub runners)
- Sentry readability: Consider truncating GITHUB_SHA to 7-8 chars for cleaner Sentry release displays
- Test documentation: Add comments distinguishing agent vs conclusion span assertions in the updated test
/diagnose Highlights
This fix follows the discipline loop well:
- ✅ Reproduce: Daily reliability review identified the gap (5,187 missing spans)
- ✅ Root cause: Copilot/Codex engines don't emit result entries
- ✅ Fix: Changed conditional logic to always-emit with sentinel
- ✅ Regression test: Tests now verify the attribute is always present
One gap: No test for the final fallback ("unknown") when all env vars are missing — this would complete the regression coverage.
/tdd Highlights
Test structure is clean and follows best practices:
- ✅ Descriptive names that read as specifications
- ✅ Arrange-Act-Assert structure is clear
- ✅ Tests verify both positive and negative cases
⚠️ Missing edge case: GITHUB_SHA unavailable (see inline comment)
Verdict
APPROVED — This is a solid observability fix with excellent test coverage. The inline suggestions are refinements, not blockers. The fix directly addresses the P1 gaps and provides strong regression protection.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 4.5M
| const agentFinishAttr = agentSpan.attributes.find(a => a.key === "gen_ai.response.finish_reasons"); | ||
| expect(agentFinishAttr).toBeDefined(); | ||
| expect(agentFinishAttr.value.arrayValue.values).toEqual([{ stringValue: "unknown" }]); | ||
|
|
There was a problem hiding this comment.
[/tdd] The test verifies that gen_ai.response.finish_reasons is present with value ["unknown"], but it does not verify the conclusion span behavior when stop_reason is absent.
The PR description mentions this affected "5/7 of the failed runs" — the test should validate both the agent span and the conclusion span independently to ensure they both emit the attribute. Currently, lines 3084-3087 test the conclusion span, but consider adding a comment clarifying which span is under test.
Suggestion: Add an explicit assertion comment:
// Agent span should always emit finish_reasons, even when stop_reason is absent
const agentFinishAttr = agentSpan.attributes.find(a => a.key === "gen_ai.response.finish_reasons");
// Conclusion span should mirror the same behavior
const conclusionFinishAttr = conclusionSpan.attributes.find(a => a.key === "gen_ai.response.finish_reasons");| // use "unknown" as a sentinel for engines (e.g. copilot, codex) that do not emit | ||
| // a result entry at all, so that gen_ai.response.finish_reasons is always present | ||
| // and length-truncation is always queryable in Sentry/dashboards. | ||
| const effectiveStopReason = runtimeMetrics.stopReason || (isAgentTimedOut ? "timeout" : "unknown"); |
There was a problem hiding this comment.
[/diagnose] The fix correctly addresses the root cause by making gen_ai.response.finish_reasons always present. However, the sentinel value "unknown" is now conflated with two distinct scenarios:
- Engines that never emit a result entry (Copilot, Codex)
- Edge cases where the span is emitted before the agent writes a result entry
Consider: Would "no_result_entry" be a more specific sentinel than "unknown"? This would make dashboards more actionable — "unknown" could mean many things, while "no_result_entry" explicitly identifies the Copilot/Codex engine limitation.
Alternative: Document in a comment that "unknown" specifically means "engine did not emit result entry", distinguishing it from other potential unknown states.
| const readFileSpy = vi.spyOn(fs, "readFileSync").mockImplementation(() => { | ||
| throw Object.assign(new Error("ENOENT"), { code: "ENOENT" }); | ||
| }); | ||
|
|
There was a problem hiding this comment.
[/tdd] This test validates the GITHUB_SHA fallback by mocking readFileSync to simulate missing aw_info.json. Excellent regression coverage!
Edge case to consider: What happens when GITHUB_SHA itself is undefined or empty? This could occur in local development or non-GitHub-hosted runners.
Suggestion: Add a test case:
it("falls back to 'unknown' for service.version when GITHUB_SHA is also unavailable", async () => {
const mockFetch = vi.fn().mockResolvedValue({ ok: true, status: 200, statusText: "OK" });
vi.stubGlobal("fetch", mockFetch);
process.env.GH_AW_OTLP_ENDPOINTS = JSON.stringify([{ url: "(traces.example.com/redacted)" }]);
delete process.env.GITHUB_SHA;
delete process.env.GH_AW_INFO_VERSION;
delete process.env.GH_AW_INFO_CLI_VERSION;
const readFileSpy = vi.spyOn(fs, "readFileSync").mockImplementation(() => {
throw Object.assign(new Error("ENOENT"), { code: "ENOENT" });
});
await sendJobConclusionSpan("gh-aw.job.conclusion");
readFileSpy.mockRestore();
const body = JSON.parse(mockFetch.mock.calls[0][1].body);
const resourceAttrs = body.resourceSpans[0].resource.attributes;
expect(resourceAttrs).toContainEqual({ key: "service.version", value: { stringValue: "unknown" } });
});This ensures the fallback chain is fully tested.
| @@ -1627,7 +1627,7 @@ async function sendJobConclusionSpan(spanName, options = {}) { | |||
| const effectiveTokens = rawET ? parseInt(rawET, 10) : NaN; | |||
There was a problem hiding this comment.
[/diagnose] The GITHUB_SHA fallback correctly provides a commit anchor for Sentry. One observability enhancement to consider:
Current behavior: GITHUB_SHA is a 40-character full commit hash. Sentry release identifiers are typically shorter for readability in dashboards.
Suggestion: Consider truncating to the first 7-8 characters when using GITHUB_SHA as a fallback:
const gitShaFallback = process.env.GITHUB_SHA?.substring(0, 8);
const version =
awInfo.agent_version || awInfo.version ||
process.env.GH_AW_INFO_VERSION ||
awInfo.cli_version ||
process.env.GH_AW_INFO_CLI_VERSION ||
gitShaFallback ||
"unknown";This would make Sentry releases more readable while still providing a unique commit anchor. However, full SHAs are also valid — document the decision either way.
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
🧪 Test Quality Sentinel completed test quality analysis. |
There was a problem hiding this comment.
Code Quality Review
This PR delivers clean observability improvements with excellent test coverage. Both changes address real gaps in telemetry data collection.
Strengths:
- Clear problem statement with concrete impact (5,187 spans missing finish-reason)
- Comprehensive test coverage for both changes
- Consistent fallback pattern applied across setup and conclusion spans
- Well-documented rationale in comments
No blocking issues identified. The code is production-ready.
Note: Reviewed post-merge as PR was already merged at review time.
🔎 Code quality review by PR Code Quality Reviewer · ● 2.6M
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd based on this P1 observability bug fix.
Key Themes
✅ Root cause properly addressed: The fix targets the actual problem (Copilot/Codex not emitting type: "result" entries) rather than symptoms
✅ Strong test coverage: Added 2 new tests for GITHUB_SHA fallback behavior and updated existing tests to reflect new "always present" behavior for finish_reasons
✅ Clear instrumentation: Inline comments explain the priority fallback chain and why "unknown" exists as a sentinel value
Positive Highlights
- ✅ Test naming is descriptive and reads like specifications (e.g., "emits gen_ai.response.finish_reasons=[unknown] on the agent span when stop_reason is absent")
- ✅ Assertions are precise — using
.find()and.toBeDefined()rather than indirect checks - ✅ Both code paths (setup span + conclusion span) are validated for the
GITHUB_SHAfallback - ✅ The PR description clearly documents the impact (5,187 spans affected) and the fix priority
Minor Suggestion (Optional)
I posted one inline comment suggesting a priority order validation test to document the stopReason → timeout → unknown fallback chain in executable form. This is a nice-to-have, not a blocker — the current tests already cover each case individually.
/diagnose Perspective
This fix demonstrates the disciplined debugging loop well:
- Reproduce: Identified 5,187 spans missing the attribute
- Minimize: Narrowed to Copilot/Codex engines specifically
- Hypothesize: Engines don't emit
type: "result"entries - Fix: Add
"unknown"sentinel as final fallback - Regression test: Updated existing test + added new
GITHUB_SHAtests
/tdd Perspective
Test quality is excellent:
- Tests follow Arrange/Act/Assert pattern consistently
- Mock setup is clear and isolated per test case
- Assertions check both the attribute presence and its exact structure
- Test names are self-documenting
Verdict
No blocking issues. This is a well-executed observability fix with strong test coverage and clear documentation. The optional suggestion is a refinement, not a requirement. Ready to merge once the [WIP] tag is removed.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 3.4M
| }); | ||
|
|
||
| it("omits gen_ai.response.finish_reasons from the agent span when stop_reason is absent in agent-stdio.log", async () => { | ||
| it("emits gen_ai.response.finish_reasons=[unknown] on the agent span when stop_reason is absent from agent-stdio.log", async () => { |
There was a problem hiding this comment.
[/tdd] Consider adding a test that explicitly validates the documented priority order for finish_reasons: stopReason → "timeout" → "unknown". Currently, each case is tested in isolation, but the fallback chain isn't verified in a single test.
Example:
it("applies finish_reasons priority: stopReason → timeout → unknown", async () => {
const mockFetch = vi.fn().mockResolvedValue({ ok: true, status: 200, statusText: "OK" });
vi.stubGlobal("fetch", mockFetch);
process.env.GH_AW_OTLP_ENDPOINTS = JSON.stringify([{ url: "(traces.example.com/redacted)" }]);
// Case 1: stop_reason present (highest priority)
const stdio1 = JSON.stringify({ type: "result", result: { stop_reason: "end_turn" } }) + "\n";
fs.writeFileSync("/tmp/gh-aw/test/agent-stdio.log", stdio1, "utf-8");
await sendJobConclusionSpan("gh-aw.job.conclusion");
let finishAttr = JSON.parse(mockFetch.mock.calls[0][1].body)
.resourceSpans[0].scopeSpans[0].spans[0].attributes
.find(a => a.key === "gen_ai.response.finish_reasons");
expect(finishAttr.value.arrayValue.values).toEqual([{ stringValue: "end_turn" }]);
// Case 2: timeout (no stop_reason, agent timed out)
fs.writeFileSync("/tmp/gh-aw/test/agent-stdio.log", "", "utf-8");
process.env.GITHUB_AW_AGENT_TIMEOUT_TRIGGERED = "true";
await sendJobConclusionSpan("gh-aw.job.conclusion");
finishAttr = JSON.parse(mockFetch.mock.calls[1][1].body)
.resourceSpans[0].scopeSpans[0].spans[0].attributes
.find(a => a.key === "gen_ai.response.finish_reasons");
expect(finishAttr.value.arrayValue.values).toEqual([{ stringValue: "timeout" }]);
// Case 3: unknown (no stop_reason, not timed out)
delete process.env.GITHUB_AW_AGENT_TIMEOUT_TRIGGERED;
await sendJobConclusionSpan("gh-aw.job.conclusion");
finishAttr = JSON.parse(mockFetch.mock.calls[2][1].body)
.resourceSpans[0].scopeSpans[0].spans[0].attributes
.find(a => a.key === "gen_ai.response.finish_reasons");
expect(finishAttr.value.arrayValue.values).toEqual([{ stringValue: "unknown" }]);
});This would serve as executable documentation of the fallback chain and catch regressions if the priority order ever changes unintentionally.
🧪 Test Quality Sentinel ReportTest Quality Score: 90/100✅ Excellent test quality
Test Classification Details
AnalysisAll three test changes in this PR are high-quality design tests that verify behavioral contracts: ✅ Test 1: GITHUB_SHA fallback for service.version (setup span)What design invariant does this test enforce? What would break if deleted? Why it's a design test: ✅ Test 2: gen_ai.response.finish_reasons always presentWhat design invariant does this test enforce? What would break if deleted? Why it's a design test: ✅ Test 3: GITHUB_SHA fallback for service.version (conclusion span)What design invariant does this test enforce? What would break if deleted? Why it's a design test: Test Inflation NoteWhile this exceeds the 2:1 threshold (resulting in a 10-point penalty), the inflation is contextually justified:
Language SupportTests analyzed:
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
|
Two P1 observability gaps identified in the 2026-05-20 Daily Reliability Review left 5,187 spans with no finish-reason classification and no version anchor for regression correlation.
gen_ai.response.finish_reasons— always present on agent spansPreviously emitted only when the engine wrote a
type: "result"entry toagent-stdio.log. Copilot and Codex do not emit this entry, so 5/7 of the failed runs had no finish-reason at all — impossible to distinguish model truncation from runner termination.Priority order is now: engine
stop_reason→"timeout"(timed-out conclusion) →"unknown"(sentinel for engines that never emit a result entry).service.version—GITHUB_SHAas final fallbackBoth the setup and conclusion span paths fell through to
"unknown"when no gh-aw release tag was available;buildOTLPResourceAttributesexplicitly skips"unknown", so Sentry'sreleasefield was never populated.GITHUB_SHAis now the last resort before"unknown", giving Sentry a commit anchor for every run.