📡 OTel Instrumentation Improvement: Emit error attributes for partial failures and cancelled conclusions
Analysis Date: 2026-04-21
Priority: High
Effort: Small (< 2h)
Problem
In sendJobConclusionSpan (actions/setup/js/send_otlp_span.cjs lines 704–754), error-enrichment attributes (gh-aw.error.count, gh-aw.error.messages) and OTel exception span events are gated behind isAgentFailure, which is only true when GH_AW_AGENT_CONCLUSION is "failure" or "timed_out".
Two important scenarios produce silent or misleading spans:
-
Partial failures on "success" conclusions — When the agent concludes as "success" but agent_output.json contains non-empty errors[] (items the agent attempted but failed to create), the conclusion span records statusCode=OK with zero error attributes. A DevOps engineer reviewing the trace sees a green span with no indication that any items failed.
-
Cancelled runs — When GH_AW_AGENT_CONCLUSION === "cancelled", isAgentFailure is false, so statusCode=1 (OK) is emitted. Cancelled runs are visually indistinguishable from fully successful runs in Grafana/Honeycomb/Datadog, making it impossible to alert on or trend cancellation rates.
Why This Matters (DevOps Perspective)
- Silent partial failures are invisible at query time. Engineers cannot write a Grafana panel or Honeycomb derived column to count "runs with output errors" because
gh-aw.error.count is absent from success-concluded spans.
- Cancelled runs mask systemic issues. A spike in cancellations (e.g., from GitHub Actions quota exhaustion or upstream timeouts) shows up as a flat, healthy success rate — delaying root-cause discovery and inflating MTTR.
- Debugging requires artifact downloads. Without error attributes on partial-failure spans, engineers must download the
agent_output.json artifact and parse it manually instead of reading the span directly in the tracing backend.
Current Behavior
// actions/setup/js/send_otlp_span.cjs lines 704–754
const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
const statusCode = isAgentFailure ? 2 : 1; // cancelled → OK (misleading)
// ... later ...
if (isAgentFailure && errorMessages.length > 0) {
// ❌ Skipped for "success" with errors, and for "cancelled"
attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));
}
// buildSpanEvents() also guards on isAgentFailure:
const buildSpanEvents = eventTimeMs => {
if (!isAgentFailure) {
return []; // ❌ No exception events for partial failures on "success"
}
// ...
};
Proposed Change
// actions/setup/js/send_otlp_span.cjs — proposed update
const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
// Treat cancelled as a non-OK outcome so it is distinct from success in backends
const isAgentCancelled = agentConclusion === "cancelled";
const statusCode = (isAgentFailure || isAgentCancelled) ? 2 : 1;
let statusMessage = isAgentFailure ? `agent \$\{agentConclusion}` :
isAgentCancelled ? "agent cancelled" : undefined;
// ... after building errorMessages ...
// Emit error attributes whenever there are errors, regardless of conclusion
if (errorMessages.length > 0) {
if (isAgentFailure && errorMessages.length > 0) {
statusMessage = `agent \$\{agentConclusion}: \$\{errorMessages[0]}`.slice(0, 256);
}
attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));
}
// Include exception events for partial failures too (not just hard failures)
const buildSpanEvents = eventTimeMs => {
if (outputErrors.length === 0) {
return [];
}
// ... (rest unchanged) ...
};
Expected Outcome
After this change:
- In Grafana / Honeycomb / Datadog:
gh-aw.error.count > 0 becomes a reliable signal for any run with output errors — whether concluded as success, failure, or cancelled. Engineers can create alerts and SLO burn-rate panels based on this attribute without needing artifact-level access. gh-aw.agent.conclusion = "cancelled" spans will appear as errors (red) rather than successes (green), making cancellation spikes detectable.
- In the JSONL mirror: Cancelled and partial-failure runs will have
status.code = 2 and populated gh-aw.error.* attributes, making offline post-hoc debugging from artifacts significantly richer.
- For on-call engineers: A single span query (
gh-aw.error.count > 0 AND gh-aw.agent.conclusion = "success") immediately surfaces all workflows with silent partial failures, reducing artifact-chase time from minutes to seconds.
Implementation Steps
Evidence from Live Sentry Data
The Sentry MCP server has no configured tools in this environment (empty tool list at /home/runner/work/_temp/gh-aw/mcp-cli/tools/sentry.json), so live span sampling was not possible. This recommendation is based on static code analysis of the current instrumentation.
The gap is confirmed by the code: isAgentFailure on line 704 of send_otlp_span.cjs is the sole gate for all error-diagnostic attributes, and "cancelled" is explicitly absent from its condition. No conditional branch in the function emits gh-aw.error.count or gh-aw.error.messages for agentConclusion === "success" with non-empty outputErrors.
Related Files
actions/setup/js/send_otlp_span.cjs — primary change site (lines 704–830)
actions/setup/js/action_conclusion_otlp.cjs — calls sendJobConclusionSpan; no changes needed
- Test file for
send_otlp_span.cjs — assertions for new behavior
Generated by the Daily OTel Instrumentation Advisor workflow
Generated by Daily OTel Instrumentation Advisor · ● 168K · ◷
📡 OTel Instrumentation Improvement: Emit error attributes for partial failures and cancelled conclusions
Analysis Date: 2026-04-21
Priority: High
Effort: Small (< 2h)
Problem
In
sendJobConclusionSpan(actions/setup/js/send_otlp_span.cjslines 704–754), error-enrichment attributes (gh-aw.error.count,gh-aw.error.messages) and OTel exception span events are gated behindisAgentFailure, which is onlytruewhenGH_AW_AGENT_CONCLUSIONis"failure"or"timed_out".Two important scenarios produce silent or misleading spans:
Partial failures on "success" conclusions — When the agent concludes as
"success"butagent_output.jsoncontains non-emptyerrors[](items the agent attempted but failed to create), the conclusion span recordsstatusCode=OKwith zero error attributes. A DevOps engineer reviewing the trace sees a green span with no indication that any items failed.Cancelled runs — When
GH_AW_AGENT_CONCLUSION === "cancelled",isAgentFailureisfalse, sostatusCode=1 (OK)is emitted. Cancelled runs are visually indistinguishable from fully successful runs in Grafana/Honeycomb/Datadog, making it impossible to alert on or trend cancellation rates.Why This Matters (DevOps Perspective)
gh-aw.error.countis absent from success-concluded spans.agent_output.jsonartifact and parse it manually instead of reading the span directly in the tracing backend.Current Behavior
Proposed Change
Expected Outcome
After this change:
gh-aw.error.count > 0becomes a reliable signal for any run with output errors — whether concluded as success, failure, or cancelled. Engineers can create alerts and SLO burn-rate panels based on this attribute without needing artifact-level access.gh-aw.agent.conclusion = "cancelled"spans will appear as errors (red) rather than successes (green), making cancellation spikes detectable.status.code = 2and populatedgh-aw.error.*attributes, making offline post-hoc debugging from artifacts significantly richer.gh-aw.error.count > 0 AND gh-aw.agent.conclusion = "success") immediately surfaces all workflows with silent partial failures, reducing artifact-chase time from minutes to seconds.Implementation Steps
actions/setup/js/send_otlp_span.cjs:isAgentCancelledconstant (line ~705)statusCodeto includeisAgentCancelled(line ~706)statusMessageinitialization to cover cancelled (line ~707)if (isAgentFailure && errorMessages.length > 0)guard at line 715 to updatestatusMessageonly for failures (keep as-is)if (isAgentFailure && errorMessages.length > 0)guard at lines 751–754 toif (errorMessages.length > 0)so attributes are emitted for all non-empty error listsif (!isAgentFailure) { return []; }inbuildSpanEvents(line ~807) toif (outputErrors.length === 0) { return []; }actions/setup/js/send_otlp_span.test.cjs(or equivalent) to assert:gh-aw.error.countandgh-aw.error.messagesattributesstatusCode = 2andstatusMessage = "agent cancelled"cd actions/setup/js && npx vitest runto confirm tests passmake fmtto ensure formattingEvidence from Live Sentry Data
The Sentry MCP server has no configured tools in this environment (empty tool list at
/home/runner/work/_temp/gh-aw/mcp-cli/tools/sentry.json), so live span sampling was not possible. This recommendation is based on static code analysis of the current instrumentation.The gap is confirmed by the code:
isAgentFailureon line 704 ofsend_otlp_span.cjsis the sole gate for all error-diagnostic attributes, and"cancelled"is explicitly absent from its condition. No conditional branch in the function emitsgh-aw.error.countorgh-aw.error.messagesforagentConclusion === "success"with non-emptyoutputErrors.Related Files
actions/setup/js/send_otlp_span.cjs— primary change site (lines 704–830)actions/setup/js/action_conclusion_otlp.cjs— callssendJobConclusionSpan; no changes neededsend_otlp_span.cjs— assertions for new behaviorGenerated by the Daily OTel Instrumentation Advisor workflow