📡 OTel Instrumentation Improvement: capture per-endpoint OTLP export failure details
Analysis Date: 2026-05-16
Priority: High
Effort: Small (< 2h)
Problem
send_otlp_span.cjs records a count of failed OTLP exports in /tmp/gh-aw/otlp-export-errors.count (via recordOTLPExportError() at send_otlp_span.cjs:862 and :870) and surfaces it on the conclusion span as gh-aw.otlp.export_errors. The count tells operators that exports failed, but it does not tell them which endpoint failed, what the HTTP status was, or what the error message said. With multiple endpoints fanned out by sendOTLPToAllEndpoints (one entry per backend in GH_AW_OTLP_ENDPOINTS), the counter cannot distinguish between "Sentry is broken" and "Grafana is broken" — both yield the same opaque integer.
This is not a hypothetical gap. Running this very advisor produced concrete evidence in the local JSONL mirror during this run:
/tmp/gh-aw/otlp-export-errors.count = 2
/tmp/gh-aw/otel.jsonl contains exactly one gh-aw.agent.setup span (trace_id=f6b8e8f341d8728e5f8d82cee504af03) with no metadata about the two failures
- There is no way to determine, from the artifacts alone, which of the two configured endpoints rejected the export or what the response code was
Why This Matters (DevOps Perspective)
OTLP export is silently best-effort by design — errors are deliberately swallowed so they cannot break the workflow (send_otlp_span.cjs:825–874). That is correct for availability, but it shifts the entire diagnostic burden onto post-hoc inspection. Today the on-call path for "why is half my data missing from Grafana?" is:
- Open the failed workflow run in the GitHub UI
- Expand the relevant job, locate the setup or post step
- Scroll the log for
OTLP export ... failed lines
- Cross-reference the message back to which endpoint URL it came from
That path does not work once the run is gc'd, when the user only has the artifact, or when the failure happens on a worker that produces large logs. The fix is to make the failure mode queryable from the same artifacts and spans that the advisor already trusts as the source of truth. With per-endpoint failure data:
- An alert can fire on
gh-aw.otlp.export.last_failure.endpoint_host = "sentry-otlp.example.com" instead of a generic counter
- A Grafana panel can group failures by host and HTTP status, separating auth (401/403) from transport (DNS / timeout) from quota (429)
- MTTR for an exporter outage drops from "page someone to read raw logs" to "read one attribute"
Current Behavior
The retry loop only persists a count and a console.warn. No structured artifact is written, and no endpoint identity survives:
// actions/setup/js/send_otlp_span.cjs:856–872 (current)
if (response.ok) {
return;
}
const msg = `HTTP ${response.status} ${response.statusText}`;
if (attempt < maxRetries) {
console.warn(`OTLP export attempt ${attempt + 1}/${maxRetries + 1} failed: ${msg}, retrying...`);
} else {
console.warn(`OTLP export failed after ${maxRetries + 1} attempts: ${msg}`);
recordOTLPExportError();
}
// ... (catch branch is symmetric — also only logs + bumps the counter)
// actions/setup/js/send_otlp_span.cjs:1216–1223 (current)
function recordOTLPExportError() {
try {
fs.mkdirSync("/tmp/gh-aw", { recursive: true });
fs.writeFileSync(OTLP_EXPORT_ERRORS_PATH, String(readOTLPExportErrorCount() + 1));
} catch {
// Export-health tracking is best-effort only.
}
}
The conclusion span at send_otlp_span.cjs:1533 reads only that count:
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
Proposed Change
- Extract the endpoint host from the URL (avoid leaking full URLs with embedded credentials) and pass it, plus the failure reason, into
recordOTLPExportError.
- In addition to incrementing the count, append a structured JSONL line to
/tmp/gh-aw/otlp-export-errors.jsonl so the failure list is debuggable from the artifact alone.
- Add a
readLastOTLPExportError() helper that reads the last line of that file.
- Surface two new conclusion-span attributes (
gh-aw.otlp.export.last_failure.endpoint_host and gh-aw.otlp.export.last_failure.reason) so a single span query identifies the culprit.
// actions/setup/js/send_otlp_span.cjs — proposed
const OTLP_EXPORT_ERRORS_JSONL_PATH = "/tmp/gh-aw/otlp-export-errors.jsonl";
function extractEndpointHost(url) {
try {
return new URL(url).host;
} catch {
return "";
}
}
function recordOTLPExportError({ endpoint, status, reason } = {}) {
try {
fs.mkdirSync("/tmp/gh-aw", { recursive: true });
fs.writeFileSync(OTLP_EXPORT_ERRORS_PATH, String(readOTLPExportErrorCount() + 1));
const entry = {
ts: new Date().toISOString(),
endpoint_host: endpoint ? extractEndpointHost(endpoint) : "",
status: typeof status === "number" ? status : 0,
reason: typeof reason === "string" ? reason.slice(0, MAX_ATTR_VALUE_LENGTH) : "",
};
fs.appendFileSync(OTLP_EXPORT_ERRORS_JSONL_PATH, JSON.stringify(entry) + "\n");
} catch {
// Export-health tracking is best-effort only.
}
}
function readLastOTLPExportError() {
try {
const lines = fs.readFileSync(OTLP_EXPORT_ERRORS_JSONL_PATH, "utf8").split("\n").filter(Boolean);
return lines.length > 0 ? JSON.parse(lines[lines.length - 1]) : null;
} catch {
return null;
}
}
Update the two recordOTLPExportError() call sites in sendOTLPSpan (lines 862 and 870) to pass the endpoint and reason:
recordOTLPExportError({ endpoint: url, status: response.status, reason: msg });
// and in the catch branch:
recordOTLPExportError({ endpoint: url, status: 0, reason: msg });
Enrich the conclusion span near send_otlp_span.cjs:1533:
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
const lastExportError = readLastOTLPExportError();
if (lastExportError) {
if (lastExportError.endpoint_host) {
attributes.push(buildAttr("gh-aw.otlp.export.last_failure.endpoint_host", lastExportError.endpoint_host));
}
if (typeof lastExportError.status === "number" && lastExportError.status > 0) {
attributes.push(buildAttr("gh-aw.otlp.export.last_failure.status", lastExportError.status));
}
if (lastExportError.reason) {
attributes.push(buildAttr("gh-aw.otlp.export.last_failure.reason", lastExportError.reason));
}
}
The host (not the full URL) is intentional — the URL may carry path tokens, and the existing sanitizeAttrs regex (send_otlp_span.cjs:660) does not redact URLs, so emitting the host avoids any chance of credential leakage.
Expected Outcome
After this change:
- In Grafana / Honeycomb / Datadog: queries like
gh-aw.otlp.export.last_failure.endpoint_host = "<host>" and group-by gh-aw.otlp.export.last_failure.status become possible. Alerting on a specific failing endpoint replaces alerting on "any export failed somewhere."
- In the JSONL mirror:
/tmp/gh-aw/otlp-export-errors.jsonl exists alongside the existing .count file, giving the artifact downloader a per-failure record (timestamp + host + status + reason) with no live collector required.
- For on-call engineers: the conclusion span answers "which backend rejected my data, and what did it say" in one trace lookup, rather than requiring a log-dive into the failing job.
Implementation Steps
Evidence from Live Telemetry
Live JSONL mirror collected during this advisor run (workflow run §25958660960) — telemetry source #2 in .github/skills/otel-queries/SKILL.md (the local JSONL mirror, used because the Sentry MCP CLI exposed an empty tool list in this environment, so the static playbook fell back to local artifacts):
/tmp/gh-aw/otlp-export-errors.count → 2 (proves real export failures occurred during this very run)
/tmp/gh-aw/otel.jsonl → exactly one span:
trace_id: f6b8e8f341d8728e5f8d82cee504af03
span name: gh-aw.agent.setup
parent_span_id: 941d27aac50e588e
status.code: 1 (OK)
- All five canonical resource attributes (
service.version=2.1.142, github.repository=github/gh-aw, github.run_id=25958660960, github.event_name=schedule, deployment.environment=production) are present
- The span carries no reference to the two failures, and the JSONL file does not include any record of the failed endpoints or HTTP responses
The gap is therefore confirmed by the live data, not just by code inspection.
Related Files
actions/setup/js/send_otlp_span.cjs (primary change)
actions/setup/js/send_otlp_span.test.cjs (test updates)
actions/setup/js/action_conclusion_otlp.cjs (no change — relies on sendJobConclusionSpan)
actions/setup/js/generate_observability_summary.cjs (optional follow-up: surface last_failure.endpoint_host in the job summary)
.github/skills/otel-queries/SKILL.md (optional follow-up: add an "OTLP export health" query shape)
Generated by the Daily OTel Instrumentation Advisor workflow
Generated by 📊 Daily OTel Instrumentation Advisor · ● 8.5M · ◷
📡 OTel Instrumentation Improvement: capture per-endpoint OTLP export failure details
Analysis Date: 2026-05-16
Priority: High
Effort: Small (< 2h)
Problem
send_otlp_span.cjsrecords a count of failed OTLP exports in/tmp/gh-aw/otlp-export-errors.count(viarecordOTLPExportError()atsend_otlp_span.cjs:862and:870) and surfaces it on the conclusion span asgh-aw.otlp.export_errors. The count tells operators that exports failed, but it does not tell them which endpoint failed, what the HTTP status was, or what the error message said. With multiple endpoints fanned out bysendOTLPToAllEndpoints(one entry per backend inGH_AW_OTLP_ENDPOINTS), the counter cannot distinguish between "Sentry is broken" and "Grafana is broken" — both yield the same opaque integer.This is not a hypothetical gap. Running this very advisor produced concrete evidence in the local JSONL mirror during this run:
/tmp/gh-aw/otlp-export-errors.count=2/tmp/gh-aw/otel.jsonlcontains exactly onegh-aw.agent.setupspan (trace_id=f6b8e8f341d8728e5f8d82cee504af03) with no metadata about the two failuresWhy This Matters (DevOps Perspective)
OTLP export is silently best-effort by design — errors are deliberately swallowed so they cannot break the workflow (
send_otlp_span.cjs:825–874). That is correct for availability, but it shifts the entire diagnostic burden onto post-hoc inspection. Today the on-call path for "why is half my data missing from Grafana?" is:OTLP export ... failedlinesThat path does not work once the run is gc'd, when the user only has the artifact, or when the failure happens on a worker that produces large logs. The fix is to make the failure mode queryable from the same artifacts and spans that the advisor already trusts as the source of truth. With per-endpoint failure data:
gh-aw.otlp.export.last_failure.endpoint_host = "sentry-otlp.example.com"instead of a generic counterCurrent Behavior
The retry loop only persists a count and a
console.warn. No structured artifact is written, and no endpoint identity survives:The conclusion span at
send_otlp_span.cjs:1533reads only that count:Proposed Change
recordOTLPExportError./tmp/gh-aw/otlp-export-errors.jsonlso the failure list is debuggable from the artifact alone.readLastOTLPExportError()helper that reads the last line of that file.gh-aw.otlp.export.last_failure.endpoint_hostandgh-aw.otlp.export.last_failure.reason) so a single span query identifies the culprit.Update the two
recordOTLPExportError()call sites insendOTLPSpan(lines 862 and 870) to pass the endpoint and reason:Enrich the conclusion span near
send_otlp_span.cjs:1533:The host (not the full URL) is intentional — the URL may carry path tokens, and the existing
sanitizeAttrsregex (send_otlp_span.cjs:660) does not redact URLs, so emitting the host avoids any chance of credential leakage.Expected Outcome
After this change:
gh-aw.otlp.export.last_failure.endpoint_host = "<host>"and group-bygh-aw.otlp.export.last_failure.statusbecome possible. Alerting on a specific failing endpoint replaces alerting on "any export failed somewhere."/tmp/gh-aw/otlp-export-errors.jsonlexists alongside the existing.countfile, giving the artifact downloader a per-failure record (timestamp + host + status + reason) with no live collector required.Implementation Steps
actions/setup/js/send_otlp_span.cjs:OTLP_EXPORT_ERRORS_JSONL_PATHconstant nearOTLP_EXPORT_ERRORS_PATH(line 1159)extractEndpointHost(url)helperrecordOTLPExportError()signature to accept{ endpoint, status, reason }and append a JSONL entryreadLastOTLPExportError()helper nearreadOTLPExportErrorCount(line 1201){ endpoint: url, status, reason }actions/setup/js/send_otlp_span.test.cjsto assert the new attributes are present when failures are simulated and absent when no failures occurredcd actions/setup/js && npx vitest run send_otlp_span.test.cjs action_conclusion_otlp.test.cjsto confirm tests passmake fmtto ensure formattingEvidence from Live Telemetry
Live JSONL mirror collected during this advisor run (workflow run §25958660960) — telemetry source #2 in
.github/skills/otel-queries/SKILL.md(the local JSONL mirror, used because the Sentry MCP CLI exposed an empty tool list in this environment, so the static playbook fell back to local artifacts):/tmp/gh-aw/otlp-export-errors.count→2(proves real export failures occurred during this very run)/tmp/gh-aw/otel.jsonl→ exactly one span:trace_id:f6b8e8f341d8728e5f8d82cee504af03span name:gh-aw.agent.setupparent_span_id:941d27aac50e588estatus.code:1(OK)service.version=2.1.142,github.repository=github/gh-aw,github.run_id=25958660960,github.event_name=schedule,deployment.environment=production) are presentThe gap is therefore confirmed by the live data, not just by code inspection.
Related Files
actions/setup/js/send_otlp_span.cjs(primary change)actions/setup/js/send_otlp_span.test.cjs(test updates)actions/setup/js/action_conclusion_otlp.cjs(no change — relies onsendJobConclusionSpan)actions/setup/js/generate_observability_summary.cjs(optional follow-up: surfacelast_failure.endpoint_hostin the job summary).github/skills/otel-queries/SKILL.md(optional follow-up: add an "OTLP export health" query shape)Generated by the Daily OTel Instrumentation Advisor workflow