Skip to content

[otel-advisor] OTel improvement: capture per-endpoint OTLP export failure details (host + status + reason) #32597

@github-actions

Description

@github-actions

📡 OTel Instrumentation Improvement: capture per-endpoint OTLP export failure details

Analysis Date: 2026-05-16
Priority: High
Effort: Small (< 2h)

Problem

send_otlp_span.cjs records a count of failed OTLP exports in /tmp/gh-aw/otlp-export-errors.count (via recordOTLPExportError() at send_otlp_span.cjs:862 and :870) and surfaces it on the conclusion span as gh-aw.otlp.export_errors. The count tells operators that exports failed, but it does not tell them which endpoint failed, what the HTTP status was, or what the error message said. With multiple endpoints fanned out by sendOTLPToAllEndpoints (one entry per backend in GH_AW_OTLP_ENDPOINTS), the counter cannot distinguish between "Sentry is broken" and "Grafana is broken" — both yield the same opaque integer.

This is not a hypothetical gap. Running this very advisor produced concrete evidence in the local JSONL mirror during this run:

  • /tmp/gh-aw/otlp-export-errors.count = 2
  • /tmp/gh-aw/otel.jsonl contains exactly one gh-aw.agent.setup span (trace_id=f6b8e8f341d8728e5f8d82cee504af03) with no metadata about the two failures
  • There is no way to determine, from the artifacts alone, which of the two configured endpoints rejected the export or what the response code was
Why This Matters (DevOps Perspective)

OTLP export is silently best-effort by design — errors are deliberately swallowed so they cannot break the workflow (send_otlp_span.cjs:825–874). That is correct for availability, but it shifts the entire diagnostic burden onto post-hoc inspection. Today the on-call path for "why is half my data missing from Grafana?" is:

  1. Open the failed workflow run in the GitHub UI
  2. Expand the relevant job, locate the setup or post step
  3. Scroll the log for OTLP export ... failed lines
  4. Cross-reference the message back to which endpoint URL it came from

That path does not work once the run is gc'd, when the user only has the artifact, or when the failure happens on a worker that produces large logs. The fix is to make the failure mode queryable from the same artifacts and spans that the advisor already trusts as the source of truth. With per-endpoint failure data:

  • An alert can fire on gh-aw.otlp.export.last_failure.endpoint_host = "sentry-otlp.example.com" instead of a generic counter
  • A Grafana panel can group failures by host and HTTP status, separating auth (401/403) from transport (DNS / timeout) from quota (429)
  • MTTR for an exporter outage drops from "page someone to read raw logs" to "read one attribute"
Current Behavior

The retry loop only persists a count and a console.warn. No structured artifact is written, and no endpoint identity survives:

// actions/setup/js/send_otlp_span.cjs:856–872 (current)
if (response.ok) {
  return;
}
const msg = `HTTP ${response.status} ${response.statusText}`;
if (attempt < maxRetries) {
  console.warn(`OTLP export attempt ${attempt + 1}/${maxRetries + 1} failed: ${msg}, retrying...`);
} else {
  console.warn(`OTLP export failed after ${maxRetries + 1} attempts: ${msg}`);
  recordOTLPExportError();
}
// ... (catch branch is symmetric — also only logs + bumps the counter)
// actions/setup/js/send_otlp_span.cjs:1216–1223 (current)
function recordOTLPExportError() {
  try {
    fs.mkdirSync("/tmp/gh-aw", { recursive: true });
    fs.writeFileSync(OTLP_EXPORT_ERRORS_PATH, String(readOTLPExportErrorCount() + 1));
  } catch {
    // Export-health tracking is best-effort only.
  }
}

The conclusion span at send_otlp_span.cjs:1533 reads only that count:

attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
Proposed Change
  1. Extract the endpoint host from the URL (avoid leaking full URLs with embedded credentials) and pass it, plus the failure reason, into recordOTLPExportError.
  2. In addition to incrementing the count, append a structured JSONL line to /tmp/gh-aw/otlp-export-errors.jsonl so the failure list is debuggable from the artifact alone.
  3. Add a readLastOTLPExportError() helper that reads the last line of that file.
  4. Surface two new conclusion-span attributes (gh-aw.otlp.export.last_failure.endpoint_host and gh-aw.otlp.export.last_failure.reason) so a single span query identifies the culprit.
// actions/setup/js/send_otlp_span.cjs — proposed

const OTLP_EXPORT_ERRORS_JSONL_PATH = "/tmp/gh-aw/otlp-export-errors.jsonl";

function extractEndpointHost(url) {
  try {
    return new URL(url).host;
  } catch {
    return "";
  }
}

function recordOTLPExportError({ endpoint, status, reason } = {}) {
  try {
    fs.mkdirSync("/tmp/gh-aw", { recursive: true });
    fs.writeFileSync(OTLP_EXPORT_ERRORS_PATH, String(readOTLPExportErrorCount() + 1));
    const entry = {
      ts: new Date().toISOString(),
      endpoint_host: endpoint ? extractEndpointHost(endpoint) : "",
      status: typeof status === "number" ? status : 0,
      reason: typeof reason === "string" ? reason.slice(0, MAX_ATTR_VALUE_LENGTH) : "",
    };
    fs.appendFileSync(OTLP_EXPORT_ERRORS_JSONL_PATH, JSON.stringify(entry) + "\n");
  } catch {
    // Export-health tracking is best-effort only.
  }
}

function readLastOTLPExportError() {
  try {
    const lines = fs.readFileSync(OTLP_EXPORT_ERRORS_JSONL_PATH, "utf8").split("\n").filter(Boolean);
    return lines.length > 0 ? JSON.parse(lines[lines.length - 1]) : null;
  } catch {
    return null;
  }
}

Update the two recordOTLPExportError() call sites in sendOTLPSpan (lines 862 and 870) to pass the endpoint and reason:

recordOTLPExportError({ endpoint: url, status: response.status, reason: msg });
// and in the catch branch:
recordOTLPExportError({ endpoint: url, status: 0, reason: msg });

Enrich the conclusion span near send_otlp_span.cjs:1533:

attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
const lastExportError = readLastOTLPExportError();
if (lastExportError) {
  if (lastExportError.endpoint_host) {
    attributes.push(buildAttr("gh-aw.otlp.export.last_failure.endpoint_host", lastExportError.endpoint_host));
  }
  if (typeof lastExportError.status === "number" && lastExportError.status > 0) {
    attributes.push(buildAttr("gh-aw.otlp.export.last_failure.status", lastExportError.status));
  }
  if (lastExportError.reason) {
    attributes.push(buildAttr("gh-aw.otlp.export.last_failure.reason", lastExportError.reason));
  }
}

The host (not the full URL) is intentional — the URL may carry path tokens, and the existing sanitizeAttrs regex (send_otlp_span.cjs:660) does not redact URLs, so emitting the host avoids any chance of credential leakage.

Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog: queries like gh-aw.otlp.export.last_failure.endpoint_host = "<host>" and group-by gh-aw.otlp.export.last_failure.status become possible. Alerting on a specific failing endpoint replaces alerting on "any export failed somewhere."
  • In the JSONL mirror: /tmp/gh-aw/otlp-export-errors.jsonl exists alongside the existing .count file, giving the artifact downloader a per-failure record (timestamp + host + status + reason) with no live collector required.
  • For on-call engineers: the conclusion span answers "which backend rejected my data, and what did it say" in one trace lookup, rather than requiring a log-dive into the failing job.
Implementation Steps
  • Edit actions/setup/js/send_otlp_span.cjs:
    • Add OTLP_EXPORT_ERRORS_JSONL_PATH constant near OTLP_EXPORT_ERRORS_PATH (line 1159)
    • Add extractEndpointHost(url) helper
    • Change recordOTLPExportError() signature to accept { endpoint, status, reason } and append a JSONL entry
    • Add readLastOTLPExportError() helper near readOTLPExportErrorCount (line 1201)
    • Update both call sites at line 862 and 870 to pass { endpoint: url, status, reason }
    • Enrich the conclusion-span attribute block near line 1533 with the three new attributes
  • Update actions/setup/js/send_otlp_span.test.cjs to assert the new attributes are present when failures are simulated and absent when no failures occurred
  • Run cd actions/setup/js && npx vitest run send_otlp_span.test.cjs action_conclusion_otlp.test.cjs to confirm tests pass
  • Run make fmt to ensure formatting
  • Open a PR referencing this issue
Evidence from Live Telemetry

Live JSONL mirror collected during this advisor run (workflow run §25958660960) — telemetry source #2 in .github/skills/otel-queries/SKILL.md (the local JSONL mirror, used because the Sentry MCP CLI exposed an empty tool list in this environment, so the static playbook fell back to local artifacts):

  • /tmp/gh-aw/otlp-export-errors.count2 (proves real export failures occurred during this very run)
  • /tmp/gh-aw/otel.jsonl → exactly one span:
    • trace_id: f6b8e8f341d8728e5f8d82cee504af03
    • span name: gh-aw.agent.setup
    • parent_span_id: 941d27aac50e588e
    • status.code: 1 (OK)
    • All five canonical resource attributes (service.version=2.1.142, github.repository=github/gh-aw, github.run_id=25958660960, github.event_name=schedule, deployment.environment=production) are present
  • The span carries no reference to the two failures, and the JSONL file does not include any record of the failed endpoints or HTTP responses

The gap is therefore confirmed by the live data, not just by code inspection.

Related Files
  • actions/setup/js/send_otlp_span.cjs (primary change)
  • actions/setup/js/send_otlp_span.test.cjs (test updates)
  • actions/setup/js/action_conclusion_otlp.cjs (no change — relies on sendJobConclusionSpan)
  • actions/setup/js/generate_observability_summary.cjs (optional follow-up: surface last_failure.endpoint_host in the job summary)
  • .github/skills/otel-queries/SKILL.md (optional follow-up: add an "OTLP export health" query shape)

Generated by the Daily OTel Instrumentation Advisor workflow

Generated by 📊 Daily OTel Instrumentation Advisor · ● 8.5M ·

  • expires on May 23, 2026, 9:45 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions