Skip to content

[otel-advisor] OTel improvement: emit error attributes for partial failures and cancelled conclusions #27685

@github-actions

Description

@github-actions

📡 OTel Instrumentation Improvement: Emit error attributes for partial failures and cancelled conclusions

Analysis Date: 2026-04-21
Priority: High
Effort: Small (< 2h)

Problem

In sendJobConclusionSpan (actions/setup/js/send_otlp_span.cjs lines 704–754), error-enrichment attributes (gh-aw.error.count, gh-aw.error.messages) and OTel exception span events are gated behind isAgentFailure, which is only true when GH_AW_AGENT_CONCLUSION is "failure" or "timed_out".

Two important scenarios produce silent or misleading spans:

  1. Partial failures on "success" conclusions — When the agent concludes as "success" but agent_output.json contains non-empty errors[] (items the agent attempted but failed to create), the conclusion span records statusCode=OK with zero error attributes. A DevOps engineer reviewing the trace sees a green span with no indication that any items failed.

  2. Cancelled runs — When GH_AW_AGENT_CONCLUSION === "cancelled", isAgentFailure is false, so statusCode=1 (OK) is emitted. Cancelled runs are visually indistinguishable from fully successful runs in Grafana/Honeycomb/Datadog, making it impossible to alert on or trend cancellation rates.

Why This Matters (DevOps Perspective)

  • Silent partial failures are invisible at query time. Engineers cannot write a Grafana panel or Honeycomb derived column to count "runs with output errors" because gh-aw.error.count is absent from success-concluded spans.
  • Cancelled runs mask systemic issues. A spike in cancellations (e.g., from GitHub Actions quota exhaustion or upstream timeouts) shows up as a flat, healthy success rate — delaying root-cause discovery and inflating MTTR.
  • Debugging requires artifact downloads. Without error attributes on partial-failure spans, engineers must download the agent_output.json artifact and parse it manually instead of reading the span directly in the tracing backend.

Current Behavior

// actions/setup/js/send_otlp_span.cjs  lines 704–754

const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
const statusCode = isAgentFailure ? 2 : 1;  // cancelled → OK (misleading)

// ... later ...

if (isAgentFailure && errorMessages.length > 0) {
  // ❌ Skipped for "success" with errors, and for "cancelled"
  attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
  attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));
}

// buildSpanEvents() also guards on isAgentFailure:
const buildSpanEvents = eventTimeMs => {
  if (!isAgentFailure) {
    return [];   // ❌ No exception events for partial failures on "success"
  }
  // ...
};

Proposed Change

// actions/setup/js/send_otlp_span.cjs — proposed update

const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
// Treat cancelled as a non-OK outcome so it is distinct from success in backends
const isAgentCancelled = agentConclusion === "cancelled";
const statusCode = (isAgentFailure || isAgentCancelled) ? 2 : 1;
let statusMessage = isAgentFailure ? `agent \$\{agentConclusion}` :
                    isAgentCancelled ? "agent cancelled" : undefined;

// ... after building errorMessages ...

// Emit error attributes whenever there are errors, regardless of conclusion
if (errorMessages.length > 0) {
  if (isAgentFailure && errorMessages.length > 0) {
    statusMessage = `agent \$\{agentConclusion}: \$\{errorMessages[0]}`.slice(0, 256);
  }
  attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
  attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));
}

// Include exception events for partial failures too (not just hard failures)
const buildSpanEvents = eventTimeMs => {
  if (outputErrors.length === 0) {
    return [];
  }
  // ... (rest unchanged) ...
};

Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog: gh-aw.error.count > 0 becomes a reliable signal for any run with output errors — whether concluded as success, failure, or cancelled. Engineers can create alerts and SLO burn-rate panels based on this attribute without needing artifact-level access. gh-aw.agent.conclusion = "cancelled" spans will appear as errors (red) rather than successes (green), making cancellation spikes detectable.
  • In the JSONL mirror: Cancelled and partial-failure runs will have status.code = 2 and populated gh-aw.error.* attributes, making offline post-hoc debugging from artifacts significantly richer.
  • For on-call engineers: A single span query (gh-aw.error.count > 0 AND gh-aw.agent.conclusion = "success") immediately surfaces all workflows with silent partial failures, reducing artifact-chase time from minutes to seconds.
Implementation Steps
  • In actions/setup/js/send_otlp_span.cjs:
    • Add isAgentCancelled constant (line ~705)
    • Update statusCode to include isAgentCancelled (line ~706)
    • Update statusMessage initialization to cover cancelled (line ~707)
    • Change if (isAgentFailure && errorMessages.length > 0) guard at line 715 to update statusMessage only for failures (keep as-is)
    • Change if (isAgentFailure && errorMessages.length > 0) guard at lines 751–754 to if (errorMessages.length > 0) so attributes are emitted for all non-empty error lists
    • Change if (!isAgentFailure) { return []; } in buildSpanEvents (line ~807) to if (outputErrors.length === 0) { return []; }
  • Update actions/setup/js/send_otlp_span.test.cjs (or equivalent) to assert:
    • A "success" conclusion with errors produces gh-aw.error.count and gh-aw.error.messages attributes
    • A "cancelled" conclusion produces statusCode = 2 and statusMessage = "agent cancelled"
  • Run cd actions/setup/js && npx vitest run to confirm tests pass
  • Run make fmt to ensure formatting
  • Open a PR referencing this issue

Evidence from Live Sentry Data

The Sentry MCP server has no configured tools in this environment (empty tool list at /home/runner/work/_temp/gh-aw/mcp-cli/tools/sentry.json), so live span sampling was not possible. This recommendation is based on static code analysis of the current instrumentation.

The gap is confirmed by the code: isAgentFailure on line 704 of send_otlp_span.cjs is the sole gate for all error-diagnostic attributes, and "cancelled" is explicitly absent from its condition. No conditional branch in the function emits gh-aw.error.count or gh-aw.error.messages for agentConclusion === "success" with non-empty outputErrors.

Related Files

  • actions/setup/js/send_otlp_span.cjs — primary change site (lines 704–830)
  • actions/setup/js/action_conclusion_otlp.cjs — calls sendJobConclusionSpan; no changes needed
  • Test file for send_otlp_span.cjs — assertions for new behavior

Generated by the Daily OTel Instrumentation Advisor workflow

Generated by Daily OTel Instrumentation Advisor · ● 168K ·

  • expires on Apr 28, 2026, 9:28 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions