[otel-advisor] OTel improvement: capture per-endpoint OTLP export failure details (host + status + reason)

### 📡 OTel Instrumentation Improvement: capture per-endpoint OTLP export failure details

**Analysis Date**: 2026-05-16 
**Priority**: High 
**Effort**: Small (< 2h)

### Problem

`send_otlp_span.cjs` records a *count* of failed OTLP exports in `/tmp/gh-aw/otlp-export-errors.count` (via `recordOTLPExportError()` at `send_otlp_span.cjs:862` and `:870`) and surfaces it on the conclusion span as `gh-aw.otlp.export_errors`. The count tells operators *that* exports failed, but it does not tell them **which endpoint failed**, **what the HTTP status was**, or **what the error message said**. With multiple endpoints fanned out by `sendOTLPToAllEndpoints` (one entry per backend in `GH_AW_OTLP_ENDPOINTS`), the counter cannot distinguish between "Sentry is broken" and "Grafana is broken" — both yield the same opaque integer.

This is not a hypothetical gap. Running this very advisor produced concrete evidence in the local JSONL mirror **during this run**:

- `/tmp/gh-aw/otlp-export-errors.count` = `2`
- `/tmp/gh-aw/otel.jsonl` contains exactly **one** `gh-aw.agent.setup` span (`trace_id=f6b8e8f341d8728e5f8d82cee504af03`) with no metadata about the two failures
- There is no way to determine, from the artifacts alone, which of the two configured endpoints rejected the export or what the response code was

<details>
<summary>Why This Matters (DevOps Perspective)</summary>

OTLP export is **silently best-effort by design** — errors are deliberately swallowed so they cannot break the workflow (`send_otlp_span.cjs:825–874`). That is correct for availability, but it shifts the entire diagnostic burden onto post-hoc inspection. Today the on-call path for "why is half my data missing from Grafana?" is:

1. Open the failed workflow run in the GitHub UI
2. Expand the relevant job, locate the setup or post step
3. Scroll the log for `OTLP export ... failed` lines
4. Cross-reference the message back to which endpoint URL it came from

That path **does not work** once the run is gc'd, when the user only has the artifact, or when the failure happens on a worker that produces large logs. The fix is to make the failure mode queryable from the same artifacts and spans that the advisor already trusts as the source of truth. With per-endpoint failure data:

- An alert can fire on `gh-aw.otlp.export.last_failure.endpoint_host = "sentry-otlp.example.com"` instead of a generic counter
- A Grafana panel can group failures by host and HTTP status, separating auth (401/403) from transport (DNS / timeout) from quota (429)
- MTTR for an exporter outage drops from "page someone to read raw logs" to "read one attribute"

</details>

<details>
<summary>Current Behavior</summary>

The retry loop only persists a count and a `console.warn`. No structured artifact is written, and no endpoint identity survives:

```javascript
// actions/setup/js/send_otlp_span.cjs:856–872 (current)
if (response.ok) {
 return;
}
const msg = `HTTP ${response.status} ${response.statusText}`;
if (attempt < maxRetries) {
 console.warn(`OTLP export attempt ${attempt + 1}/${maxRetries + 1} failed: ${msg}, retrying...`);
} else {
 console.warn(`OTLP export failed after ${maxRetries + 1} attempts: ${msg}`);
 recordOTLPExportError();
}
// ... (catch branch is symmetric — also only logs + bumps the counter)
```

```javascript
// actions/setup/js/send_otlp_span.cjs:1216–1223 (current)
function recordOTLPExportError() {
 try {
 fs.mkdirSync("/tmp/gh-aw", { recursive: true });
 fs.writeFileSync(OTLP_EXPORT_ERRORS_PATH, String(readOTLPExportErrorCount() + 1));
 } catch {
 // Export-health tracking is best-effort only.
 }
}
```

The conclusion span at `send_otlp_span.cjs:1533` reads only that count:

```javascript
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
```

</details>

<details>
<summary>Proposed Change</summary>

1. Extract the endpoint **host** from the URL (avoid leaking full URLs with embedded credentials) and pass it, plus the failure reason, into `recordOTLPExportError`.
2. In addition to incrementing the count, append a structured JSONL line to `/tmp/gh-aw/otlp-export-errors.jsonl` so the failure list is debuggable from the artifact alone.
3. Add a `readLastOTLPExportError()` helper that reads the last line of that file.
4. Surface two new conclusion-span attributes (`gh-aw.otlp.export.last_failure.endpoint_host` and `gh-aw.otlp.export.last_failure.reason`) so a single span query identifies the culprit.

```javascript
// actions/setup/js/send_otlp_span.cjs — proposed

const OTLP_EXPORT_ERRORS_JSONL_PATH = "/tmp/gh-aw/otlp-export-errors.jsonl";

function extractEndpointHost(url) {
 try {
 return new URL(url).host;
 } catch {
 return "";
 }
}

function recordOTLPExportError({ endpoint, status, reason } = {}) {
 try {
 fs.mkdirSync("/tmp/gh-aw", { recursive: true });
 fs.writeFileSync(OTLP_EXPORT_ERRORS_PATH, String(readOTLPExportErrorCount() + 1));
 const entry = {
 ts: new Date().toISOString(),
 endpoint_host: endpoint ? extractEndpointHost(endpoint) : "",
 status: typeof status === "number" ? status : 0,
 reason: typeof reason === "string" ? reason.slice(0, MAX_ATTR_VALUE_LENGTH) : "",
 };
 fs.appendFileSync(OTLP_EXPORT_ERRORS_JSONL_PATH, JSON.stringify(entry) + "\n");
 } catch {
 // Export-health tracking is best-effort only.
 }
}

function readLastOTLPExportError() {
 try {
 const lines = fs.readFileSync(OTLP_EXPORT_ERRORS_JSONL_PATH, "utf8").split("\n").filter(Boolean);
 return lines.length > 0 ? JSON.parse(lines[lines.length - 1]) : null;
 } catch {
 return null;
 }
}
```

Update the two `recordOTLPExportError()` call sites in `sendOTLPSpan` (lines 862 and 870) to pass the endpoint and reason:

```javascript
recordOTLPExportError({ endpoint: url, status: response.status, reason: msg });
// and in the catch branch:
recordOTLPExportError({ endpoint: url, status: 0, reason: msg });
```

Enrich the conclusion span near `send_otlp_span.cjs:1533`:

```javascript
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
const lastExportError = readLastOTLPExportError();
if (lastExportError) {
 if (lastExportError.endpoint_host) {
 attributes.push(buildAttr("gh-aw.otlp.export.last_failure.endpoint_host", lastExportError.endpoint_host));
 }
 if (typeof lastExportError.status === "number" && lastExportError.status > 0) {
 attributes.push(buildAttr("gh-aw.otlp.export.last_failure.status", lastExportError.status));
 }
 if (lastExportError.reason) {
 attributes.push(buildAttr("gh-aw.otlp.export.last_failure.reason", lastExportError.reason));
 }
}
```

The host (not the full URL) is intentional — the URL may carry path tokens, and the existing `sanitizeAttrs` regex (`send_otlp_span.cjs:660`) does not redact URLs, so emitting the host avoids any chance of credential leakage.

</details>

<details>
<summary>Expected Outcome</summary>

After this change:

- **In Grafana / Honeycomb / Datadog**: queries like `gh-aw.otlp.export.last_failure.endpoint_host = "<host>"` and group-by `gh-aw.otlp.export.last_failure.status` become possible. Alerting on a *specific* failing endpoint replaces alerting on "any export failed somewhere."
- **In the JSONL mirror**: `/tmp/gh-aw/otlp-export-errors.jsonl` exists alongside the existing `.count` file, giving the artifact downloader a per-failure record (timestamp + host + status + reason) with no live collector required.
- **For on-call engineers**: the conclusion span answers "which backend rejected my data, and what did it say" in one trace lookup, rather than requiring a log-dive into the failing job.

</details>

<details>
<summary>Implementation Steps</summary>

- [ ] Edit `actions/setup/js/send_otlp_span.cjs`:
 - Add `OTLP_EXPORT_ERRORS_JSONL_PATH` constant near `OTLP_EXPORT_ERRORS_PATH` (line 1159)
 - Add `extractEndpointHost(url)` helper
 - Change `recordOTLPExportError()` signature to accept `{ endpoint, status, reason }` and append a JSONL entry
 - Add `readLastOTLPExportError()` helper near `readOTLPExportErrorCount` (line 1201)
 - Update both call sites at line 862 and 870 to pass `{ endpoint: url, status, reason }`
 - Enrich the conclusion-span attribute block near line 1533 with the three new attributes
- [ ] Update `actions/setup/js/send_otlp_span.test.cjs` to assert the new attributes are present when failures are simulated and absent when no failures occurred
- [ ] Run `cd actions/setup/js && npx vitest run send_otlp_span.test.cjs action_conclusion_otlp.test.cjs` to confirm tests pass
- [ ] Run `make fmt` to ensure formatting
- [ ] Open a PR referencing this issue

</details>

<details>
<summary>Evidence from Live Telemetry</summary>

Live JSONL mirror collected during this advisor run (workflow run [§25958660960](https://github.com/github/gh-aw/actions/runs/25958660960)) — telemetry source #2 in `.github/skills/otel-queries/SKILL.md` (the local JSONL mirror, used because the Sentry MCP CLI exposed an empty tool list in this environment, so the static playbook fell back to local artifacts):

- `/tmp/gh-aw/otlp-export-errors.count` → **`2`** (proves real export failures occurred during this very run)
- `/tmp/gh-aw/otel.jsonl` → exactly one span:
 - `trace_id`: `f6b8e8f341d8728e5f8d82cee504af03`
 - `span name`: `gh-aw.agent.setup`
 - `parent_span_id`: `941d27aac50e588e`
 - `status.code`: `1` (OK)
 - All five canonical resource attributes (`service.version=2.1.142`, `github.repository=github/gh-aw`, `github.run_id=25958660960`, `github.event_name=schedule`, `deployment.environment=production`) are present
- The span carries no reference to the two failures, and the JSONL file does not include any record of the failed endpoints or HTTP responses

The gap is therefore confirmed by the live data, not just by code inspection.

</details>

<details>
<summary>Related Files</summary>

- `actions/setup/js/send_otlp_span.cjs` (primary change)
- `actions/setup/js/send_otlp_span.test.cjs` (test updates)
- `actions/setup/js/action_conclusion_otlp.cjs` (no change — relies on `sendJobConclusionSpan`)
- `actions/setup/js/generate_observability_summary.cjs` (optional follow-up: surface `last_failure.endpoint_host` in the job summary)
- `.github/skills/otel-queries/SKILL.md` (optional follow-up: add an "OTLP export health" query shape)

</details>

---

*Generated by the [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/25958660960) workflow*







> Generated by [📊 Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/25958660960) · ● 8.5M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on May 23, 2026, 9:45 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otel-advisor] OTel improvement: capture per-endpoint OTLP export failure details (host + status + reason) #32597

📡 OTel Instrumentation Improvement: capture per-endpoint OTLP export failure details

Problem

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[otel-advisor] OTel improvement: capture per-endpoint OTLP export failure details (host + status + reason) #32597

Description

📡 OTel Instrumentation Improvement: capture per-endpoint OTLP export failure details

Problem

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions