[otel-advisor] OTel improvement: emit agent execution span for timed-out runs where agent_output.json is absent

### 📡 OTel Instrumentation Improvement: emit `gh-aw.agent.agent` span for timed-out runs

**Analysis Date**: 2026-04-19
**Priority**: High
**Effort**: Small (< 2h)

### Problem

The `gh-aw.agent.agent` sub-span — which measures pure AI execution latency — is **only emitted when `agent_output.json` exists with a valid mtime**. For timed-out runs (`GH_AW_AGENT_CONCLUSION=timed_out`), the agent process is killed before `agent_output.json` is written, so `fs.statSync` throws and `agentEndMs` stays `null`. The guard condition on line 837 of `send_otlp_span.cjs` then fails silently, and **no agent span is emitted for the most operationally critical failure mode**.

A DevOps engineer today cannot answer: *"Did this workflow time out after 5 minutes (misconfigured) or after 50 minutes (model ran long)?"* — that distinction is invisible in timed-out traces.

### Why This Matters (DevOps Perspective)

Timed-out runs are the failure mode most likely to hide cost and latency regressions. Without the agent span for timeouts:

- **Grafana / Honeycomb / Datadog**: you cannot plot AI execution duration for failed runs, making it impossible to set duration-based alerts that catch runaway agents before they exhaust budget.
- **MTTR**: engineers triaging a timeout must mentally subtract setup overhead from the conclusion span duration rather than reading the AI latency directly.
- **Trace consistency**: successful traces have 3 spans (setup, agent, conclusion); timed-out traces have only 2 (setup, conclusion). The missing span breaks span-count-based dashboards and makes trace shapes inconsistent.

### Current Behavior

```javascript
// actions/setup/js/send_otlp_span.cjs (lines 827–837)
const agentStartMs = options.startMs;
let agentEndMs = null;
try {
  agentEndMs = fs.statSync("/tmp/gh-aw/agent_output.json").mtimeMs;
} catch {
  // agent_output.json may not exist for non-agent jobs; skip dedicated span.
}

if (jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0
    && typeof agentEndMs === "number" && agentEndMs > agentStartMs) {
  // ... emit agent span (never reached for timed-out runs)
}
```

For `GH_AW_AGENT_CONCLUSION=timed_out`, `agent_output.json` is absent → `statSync` throws → `agentEndMs` is `null` → the `typeof agentEndMs === "number"` guard fails → **no agent span emitted**.

### Proposed Change

Fall back to `nowMs()` as the agent span end time when the run is a timed-out failure and `agent_output.json` is absent. This bounds the AI execution duration to `[setup-end, conclusion-start]`, which is a useful lower bound even if slightly larger than the true agent wall-clock time.

```javascript
// Proposed change to actions/setup/js/send_otlp_span.cjs (around line 827)
const agentStartMs = options.startMs;
let agentEndMs = null;
try {
  agentEndMs = fs.statSync("/tmp/gh-aw/agent_output.json").mtimeMs;
} catch {
  // agent_output.json absent (e.g. timed-out run where the agent process was killed
  // before writing output): fall back to nowMs() so the agent span still bounds
  // execution duration. Only do this for agent failures — non-agent jobs (safe-outputs,
  // activation) should not emit an agent span.
  if (isAgentFailure && jobName === "agent"
      && typeof agentStartMs === "number" && agentStartMs > 0) {
    agentEndMs = nowMs();
  }
}

if (jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0
    && typeof agentEndMs === "number" && agentEndMs > agentStartMs) {
  // ... emit agent span — now also runs for timed-out jobs
}
```

### Expected Outcome

After this change:

- **In Grafana / Honeycomb / Datadog**: timed-out traces now have 3 spans (setup, agent, conclusion), matching successful traces. You can plot `gh-aw.agent.agent` span duration across all outcomes and alert when AI latency exceeds a threshold regardless of whether the run succeeded.
- **In the JSONL mirror**: `otel.jsonl` gains an `agent` span entry for every timed-out run, improving post-hoc artifact-based debugging.
- **For on-call engineers**: "How long did the AI run before timing out?" becomes a one-click query on the `gh-aw.agent.agent` span duration rather than a manual subtraction from conclusion span duration.

<details>
<summary><b>Implementation Steps</b></summary>

- [ ] In `actions/setup/js/send_otlp_span.cjs` (lines 828–836): update the catch block to set `agentEndMs = nowMs()` when `isAgentFailure && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0`
- [ ] Update `actions/setup/js/send_otlp_span.test.cjs` (around line 1614, the `"does not emit a dedicated agent span when agent_output mtime is unavailable"` test): add a sibling test that asserts an agent span IS emitted when `GH_AW_AGENT_CONCLUSION=timed_out` and `statSync` throws
- [ ] Keep the existing test at line 1614 but scope it to non-failure cases (e.g. `GH_AW_AGENT_CONCLUSION` unset) to preserve the "non-agent jobs skip the span" invariant
- [ ] Run `cd actions/setup/js && npx vitest run` to confirm tests pass
- [ ] Run `make fmt` to ensure formatting
- [ ] Open a PR referencing this issue

</details>

### Evidence from Live Sentry Data

The Sentry MCP server returned 0 available tools during this analysis run and could not be queried. The finding is based entirely on static code analysis of `send_otlp_span.cjs` (lines 827–858). The gap is confirmed by the existing test at line 1614 of `send_otlp_span.test.cjs`, which explicitly tests that no agent span is emitted when `statSync` throws — and that test passes today, documenting the missing span as known (but intentional-seeming) behavior. No comparable test asserts the span IS emitted for timed-out failure runs.

### Related Files

- `actions/setup/js/send_otlp_span.cjs` — primary change (lines 827–837)
- `actions/setup/js/send_otlp_span.test.cjs` — add test for timed-out agent span emission
- `actions/setup/js/action_conclusion_otlp.cjs` — no change needed (orchestrates `sendJobConclusionSpan` which handles the logic)

---

*Generated by the [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24639358428) workflow*







> Generated by [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24639358428/agentic_workflow) · ● 186.3K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on Apr 26, 2026, 9:24 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otel-advisor] OTel improvement: emit agent execution span for timed-out runs where agent_output.json is absent #27228

📡 OTel Instrumentation Improvement: emit `gh-aw.agent.agent` span for timed-out runs

Problem

Why This Matters (DevOps Perspective)

Current Behavior

Proposed Change

Expected Outcome

Evidence from Live Sentry Data

Related Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[otel-advisor] OTel improvement: emit agent execution span for timed-out runs where agent_output.json is absent #27228

Description

📡 OTel Instrumentation Improvement: emit gh-aw.agent.agent span for timed-out runs

Problem

Why This Matters (DevOps Perspective)

Current Behavior

Proposed Change

Expected Outcome

Evidence from Live Sentry Data

Related Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

📡 OTel Instrumentation Improvement: emit `gh-aw.agent.agent` span for timed-out runs