[otel-advisor] OTel improvement: emit error attributes for partial failures and cancelled conclusions

### 📡 OTel Instrumentation Improvement: Emit error attributes for partial failures and cancelled conclusions

**Analysis Date**: 2026-04-21
**Priority**: High
**Effort**: Small (< 2h)

### Problem

In `sendJobConclusionSpan` (`actions/setup/js/send_otlp_span.cjs` lines 704–754), error-enrichment attributes (`gh-aw.error.count`, `gh-aw.error.messages`) and OTel exception span events are gated behind `isAgentFailure`, which is only `true` when `GH_AW_AGENT_CONCLUSION` is `"failure"` or `"timed_out"`.

Two important scenarios produce silent or misleading spans:

1. **Partial failures on "success" conclusions** — When the agent concludes as `"success"` but `agent_output.json` contains non-empty `errors[]` (items the agent attempted but failed to create), the conclusion span records `statusCode=OK` with zero error attributes. A DevOps engineer reviewing the trace sees a green span with no indication that any items failed.

2. **Cancelled runs** — When `GH_AW_AGENT_CONCLUSION === "cancelled"`, `isAgentFailure` is `false`, so `statusCode=1 (OK)` is emitted. Cancelled runs are visually indistinguishable from fully successful runs in Grafana/Honeycomb/Datadog, making it impossible to alert on or trend cancellation rates.

### Why This Matters (DevOps Perspective)

- **Silent partial failures are invisible at query time.** Engineers cannot write a Grafana panel or Honeycomb derived column to count "runs with output errors" because `gh-aw.error.count` is absent from success-concluded spans.
- **Cancelled runs mask systemic issues.** A spike in cancellations (e.g., from GitHub Actions quota exhaustion or upstream timeouts) shows up as a flat, healthy success rate — delaying root-cause discovery and inflating MTTR.
- **Debugging requires artifact downloads.** Without error attributes on partial-failure spans, engineers must download the `agent_output.json` artifact and parse it manually instead of reading the span directly in the tracing backend.

### Current Behavior

```javascript
// actions/setup/js/send_otlp_span.cjs  lines 704–754

const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
const statusCode = isAgentFailure ? 2 : 1;  // cancelled → OK (misleading)

// ... later ...

if (isAgentFailure && errorMessages.length > 0) {
  // ❌ Skipped for "success" with errors, and for "cancelled"
  attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
  attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));
}

// buildSpanEvents() also guards on isAgentFailure:
const buildSpanEvents = eventTimeMs => {
  if (!isAgentFailure) {
    return [];   // ❌ No exception events for partial failures on "success"
  }
  // ...
};
```

### Proposed Change

```javascript
// actions/setup/js/send_otlp_span.cjs — proposed update

const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
// Treat cancelled as a non-OK outcome so it is distinct from success in backends
const isAgentCancelled = agentConclusion === "cancelled";
const statusCode = (isAgentFailure || isAgentCancelled) ? 2 : 1;
let statusMessage = isAgentFailure ? `agent \$\{agentConclusion}` :
                    isAgentCancelled ? "agent cancelled" : undefined;

// ... after building errorMessages ...

// Emit error attributes whenever there are errors, regardless of conclusion
if (errorMessages.length > 0) {
  if (isAgentFailure && errorMessages.length > 0) {
    statusMessage = `agent \$\{agentConclusion}: \$\{errorMessages[0]}`.slice(0, 256);
  }
  attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
  attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));
}

// Include exception events for partial failures too (not just hard failures)
const buildSpanEvents = eventTimeMs => {
  if (outputErrors.length === 0) {
    return [];
  }
  // ... (rest unchanged) ...
};
```

### Expected Outcome

After this change:

- **In Grafana / Honeycomb / Datadog**: `gh-aw.error.count > 0` becomes a reliable signal for _any_ run with output errors — whether concluded as success, failure, or cancelled. Engineers can create alerts and SLO burn-rate panels based on this attribute without needing artifact-level access. `gh-aw.agent.conclusion = "cancelled"` spans will appear as errors (red) rather than successes (green), making cancellation spikes detectable.
- **In the JSONL mirror**: Cancelled and partial-failure runs will have `status.code = 2` and populated `gh-aw.error.*` attributes, making offline post-hoc debugging from artifacts significantly richer.
- **For on-call engineers**: A single span query (`gh-aw.error.count > 0 AND gh-aw.agent.conclusion = "success"`) immediately surfaces all workflows with silent partial failures, reducing artifact-chase time from minutes to seconds.

<details>
<summary><b>Implementation Steps</b></summary>

- [ ] In `actions/setup/js/send_otlp_span.cjs`:
  - Add `isAgentCancelled` constant (line ~705)
  - Update `statusCode` to include `isAgentCancelled` (line ~706)
  - Update `statusMessage` initialization to cover cancelled (line ~707)
  - Change `if (isAgentFailure && errorMessages.length > 0)` guard at line 715 to update `statusMessage` only for failures (keep as-is)
  - Change `if (isAgentFailure && errorMessages.length > 0)` guard at lines 751–754 to `if (errorMessages.length > 0)` so attributes are emitted for all non-empty error lists
  - Change `if (!isAgentFailure) { return []; }` in `buildSpanEvents` (line ~807) to `if (outputErrors.length === 0) { return []; }`
- [ ] Update `actions/setup/js/send_otlp_span.test.cjs` (or equivalent) to assert:
  - A "success" conclusion with errors produces `gh-aw.error.count` and `gh-aw.error.messages` attributes
  - A "cancelled" conclusion produces `statusCode = 2` and `statusMessage = "agent cancelled"`
- [ ] Run `cd actions/setup/js && npx vitest run` to confirm tests pass
- [ ] Run `make fmt` to ensure formatting
- [ ] Open a PR referencing this issue

</details>

### Evidence from Live Sentry Data

The Sentry MCP server has no configured tools in this environment (empty tool list at `/home/runner/work/_temp/gh-aw/mcp-cli/tools/sentry.json`), so live span sampling was not possible. This recommendation is based on static code analysis of the current instrumentation.

The gap is confirmed by the code: `isAgentFailure` on line 704 of `send_otlp_span.cjs` is the sole gate for all error-diagnostic attributes, and `"cancelled"` is explicitly absent from its condition. No conditional branch in the function emits `gh-aw.error.count` or `gh-aw.error.messages` for `agentConclusion === "success"` with non-empty `outputErrors`.

### Related Files

- `actions/setup/js/send_otlp_span.cjs` — primary change site (lines 704–830)
- `actions/setup/js/action_conclusion_otlp.cjs` — calls `sendJobConclusionSpan`; no changes needed
- Test file for `send_otlp_span.cjs` — assertions for new behavior

---

*Generated by the [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24747134243) workflow*







> Generated by [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24747134243/agentic_workflow) · ● 168K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on Apr 28, 2026, 9:28 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otel-advisor] OTel improvement: emit error attributes for partial failures and cancelled conclusions #27685

📡 OTel Instrumentation Improvement: Emit error attributes for partial failures and cancelled conclusions

Problem

Why This Matters (DevOps Perspective)

Current Behavior

Proposed Change

Expected Outcome

Evidence from Live Sentry Data

Related Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[otel-advisor] OTel improvement: emit error attributes for partial failures and cancelled conclusions #27685

Description

📡 OTel Instrumentation Improvement: Emit error attributes for partial failures and cancelled conclusions

Problem

Why This Matters (DevOps Perspective)

Current Behavior

Proposed Change

Expected Outcome

Evidence from Live Sentry Data

Related Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions