Skip to content

Capture per-endpoint OTLP export failure details on conclusion spans#32631

Merged
pelikhan merged 10 commits into
mainfrom
copilot/otel-advisor-capture-otlp-export-details
May 16, 2026
Merged

Capture per-endpoint OTLP export failure details on conclusion spans#32631
pelikhan merged 10 commits into
mainfrom
copilot/otel-advisor-capture-otlp-export-details

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 16, 2026

OTLP export failures were only surfaced as a single counter, which made multi-endpoint failures opaque. When one backend rejected exports, operators could not tell which collector failed or why from the run artifacts or conclusion span.

  • Persist per-endpoint failure details

    • Extend OTLP export error recording to append a JSONL entry for each terminal failure.
    • Each entry captures:
      • collector host
      • HTTP status when available
      • failure reason/message
  • Surface details on conclusion spans

    • Keep the existing gh-aw.otlp.export_errors count.
    • Add gh-aw.otlp.export_error_details so the conclusion span carries endpoint-specific failure context alongside the aggregate count.
  • Preserve current export behavior

    • Export failures remain non-fatal.
    • Multi-endpoint fan-out behavior is unchanged; the improvement is observability only.
  • Focused coverage

    • Add tests for:
      • recording host/status/reason on failed OTLP exports
      • emitting aggregated per-endpoint failure details on conclusion spans
      • coexistence with the existing JSONL mirror and failure counter
recordOTLPExportError({
  endpoint: "https://collector.example.com:4318",
  status: 401,
  reason: "Unauthorized",
});

// conclusion span attribute
"gh-aw.otlp.export_error_details":
  "collector.example.com:4318 status=401 reason=Unauthorized"

Copilot AI and others added 2 commits May 16, 2026 14:31
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve OTel export failure details for endpoints Capture per-endpoint OTLP export failure details on conclusion spans May 16, 2026
Copilot AI requested a review from pelikhan May 16, 2026 14:34
@pelikhan pelikhan marked this pull request as ready for review May 16, 2026 14:38
Copilot AI review requested due to automatic review settings May 16, 2026 14:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-endpoint OTLP export failure detail capture and surfaces those details on conclusion spans.

Changes:

  • Records terminal OTLP export failures with endpoint host, status, and reason in a JSONL file.
  • Adds gh-aw.otlp.export_error_details to conclusion spans from persisted failure details.
  • Updates tests for failure detail recording, JSONL mirror coexistence, and conclusion span attributes.
Show a summary per file
File Description
actions/setup/js/send_otlp_span.cjs Implements OTLP export error detail persistence and conclusion span enrichment.
actions/setup/js/send_otlp_span.test.cjs Adds and updates tests for export error details and mirror behavior.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 6

Comment thread actions/setup/js/send_otlp_span.cjs Outdated
* @returns {entry is OTLPExportErrorDetail}
*/
function isValidOTLPExportErrorDetail(entry) {
return entry !== null && typeof entry === "object" && !Array.isArray(entry) && typeof entry.host === "string" && entry.host.trim() !== "" && typeof entry.reason === "string" && entry.reason.trim() !== "";
Comment thread actions/setup/js/send_otlp_span.cjs Outdated
return content
.split("\n")
.filter(line => line.trim() !== "")
.map(line => JSON.parse(line))
Comment thread actions/setup/js/send_otlp_span.cjs Outdated
recordOTLPExportError({
endpoint,
...(Number.isInteger(response.status) && response.status > 0 ? { status: response.status } : {}),
reason: response.statusText || msg,
Comment on lines 880 to +881
console.warn(`OTLP export error after ${maxRetries + 1} attempts: ${msg}`);
recordOTLPExportError();
recordOTLPExportError({ endpoint, reason: msg });
Comment thread actions/setup/js/send_otlp_span.cjs Outdated
if (details.length === 0) {
return "";
}
return details.map(formatOTLPExportErrorDetail).join(" | ");
Comment on lines +1172 to +1176
/**
* Path to the persisted OTLP export failure detail log.
* @type {string}
*/
const OTLP_EXPORT_ERROR_DETAILS_PATH = "/tmp/gh-aw/otlp-export-errors.jsonl";
@github-actions github-actions Bot mentioned this pull request May 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 92/100

Excellent test quality

Metric Value
New/modified tests analyzed 2
✅ Design tests (behavioral contracts) 2 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 2 (100%)
Duplicate test clusters 0
Test inflation detected No (ratio 0.35:1)
🚨 Coding-guideline violations None

Test Classification Details

View test classification table
Test File Classification Issues Detected
"records OTLP export failure host, status, and reason details" actions/setup/js/send_otlp_span.test.cjs ✅ Design None — verifies file write side-effect with correct host/status/reason payload
"emits gh-aw.otlp.export_error_details on the conclusion job span" actions/setup/js/send_otlp_span.test.cjs ✅ Design None — asserts observable span attribute value and format

Scoring Breakdown

Component Score Reason
Behavioral Coverage (40 pts) 40/40 Both new tests verify observable outputs and side-effects
Error/Edge Case Coverage (30 pts) 28/30 Both cover failure paths; minor deduction since test 1 uses only a single error code
Low Duplication (20 pts) 20/20 No duplicate assertion clusters detected
Proportional Growth (10 pts) 4/10 Test file added 44 lines vs. 126 in production — well under the 2:1 inflation threshold; deduction applied because the new tests only cover 2 of the several new code paths added (e.g., no test for the export_error_details truncation when details exceed a character limit, if any)

Final score: 92/100


Flagged Tests — Requires Review

No tests flagged. Both new tests are strong behavioral tests.

Minor observation — Test 1 (records OTLP export failure host, status, and reason details) asserts via a fs.appendFileSync spy. Spying on Node.js built-in I/O is an accepted pattern in this codebase; the assertion on the content written (the exact JSON object with host, status, reason keys) makes this a behavioral contract rather than an implementation detail.


Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests
  • 🟨 JavaScript (*.test.cjs): 2 new tests (vitest)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). Both new tests directly verify the behavioral contract introduced by this PR: OTLP export failure details are (1) persisted to a JSONL file and (2) surfaced as a formatted gh-aw.otlp.export_error_details span attribute on conclusion spans.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25964635944

🧪 Test quality analysis by Test Quality Sentinel · ● 7.3M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /tdd (feature addition with tests) and a light /zoom-out pass (observability pipeline changes).

Key Themes

  • Test coverage gap on the exception path: The new test exercises the HTTP-error branch of sendOTLPSpan but not the catch branch (network exception), which also calls recordOTLPExportError with different arguments. One more test would give full behavioral coverage of both paths. See inline comment.
  • Silent truncation of aggregate attribute: formatOTLPExportErrorDetails has no explicit length cap; the sanitization layer will silently truncate mid-entry when there are many failing endpoints. Adding an explicit .slice(0, MAX_ATTR_VALUE_LENGTH) (or capping entry count) makes the boundary intentional and testable. See inline comment.
  • Minor normalization inconsistency: reason is sliced but not trimmed, while host is trimmed. Easily fixed and worth a small test for the whitespace edge case.

Positive Highlights

  • ✅ Clean decomposition into small, single-purpose helpers (getOTLPExportErrorHost, isValidOTLPExportErrorStatus, normalizeOTLPExportErrorDetail, formatOTLPExportErrorDetail) — each is testable in isolation.
  • ✅ The fix to the existing JSONL-mirror retry test (using find instead of toHaveBeenCalledOnce) is a nice improvement that makes that test correctly scoped to its one concern.
  • ✅ Graceful degradation: all recording is best-effort (catch {}) with no change to export behavior.
  • ✅ The isValidOTLPExportErrorDetail guard on readOTLPExportErrorDetails means a corrupted JSONL file can't cause a conclusion-span failure.

Verdict

No blocking issues; the two substantive observations (exception-path test, truncation boundary) are worth addressing but the core implementation is solid.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 5.3M

expect(appendSpy).toHaveBeenCalledWith("/tmp/gh-aw/otlp-export-errors.jsonl", `${JSON.stringify({ host: "collector.example.com:4318", status: 401, reason: "Unauthorized" })}\n`);

appendSpy.mockRestore();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The new test covers the HTTP-error response path but not the network exception path (the catch block in sendOTLPSpan). Both branches now call recordOTLPExportError with slightly different arguments — the catch path omits status. A second test exercising a rejected fetch promise would complete the behavioral specification:

it("records OTLP export failure details on network error", async () => {
  const mockFetch = vi.fn().mockRejectedValue(new Error("connect ECONNREFUSED"));
  vi.stubGlobal("fetch", mockFetch);
  const appendSpy = vi.spyOn(fs, "appendFileSync").mockImplementation(() => {});

  await sendOTLPSpan("(collector.example.com/redacted)", {}, { maxRetries: 0, skipJSONL: true });

  expect(appendSpy).toHaveBeenCalledWith(
    "/tmp/gh-aw/otlp-export-errors.jsonl",
    expect.stringContaining('"host":"collector.example.com:4318"')
  );
  appendSpy.mockRestore();
});

Without this, a regression where only one branch records details would go undetected.

Comment thread actions/setup/js/send_otlp_span.cjs Outdated
if (details.length === 0) {
return "";
}
return details.map(formatOTLPExportErrorDetail).join(" | ");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] formatOTLPExportErrorDetails joins all endpoint strings with " | " but applies no explicit length cap before the value is passed to buildAttr. With enough failing endpoints the string can exceed MAX_ATTR_VALUE_LENGTH (1024), and it will be silently truncated mid-entry by the sanitizeOTLPPayload layer — leaving an operator with partial, potentially unparseable output.

Consider capping the aggregate string explicitly so the truncation is predictable:

function formatOTLPExportErrorDetails() {
  const details = readOTLPExportErrorDetails();
  if (details.length === 0) return "";
  return details
    .map(formatOTLPExportErrorDetail)
    .join(" | ")
    .slice(0, MAX_ATTR_VALUE_LENGTH);
}

Alternatively, limit the number of entries read (e.g. details.slice(0, 10)) to keep each entry intact.

Comment thread actions/setup/js/send_otlp_span.cjs Outdated
return {
host: detail.host.trim(),
...(isValidOTLPExportErrorStatus(detail.status) ? { status: detail.status } : {}),
reason: detail.reason.slice(0, MAX_ATTR_VALUE_LENGTH),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] normalizeOTLPExportErrorDetail trims host (detail.host.trim()) but only slices reason — leaving potential leading/trailing whitespace in the stored value. This is a minor inconsistency but could produce unexpected output if a statusText value happens to contain surrounding spaces.

// suggested
reason: detail.reason.trim().slice(0, MAX_ATTR_VALUE_LENGTH),

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review all comments

@pelikhan
Copy link
Copy Markdown
Collaborator

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 16, 2026

@copilot review all comments

Addressed the actionable review items in 9f8b848.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review all comments and review comments

Copilot AI and others added 2 commits May 16, 2026 15:11
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
This reverts commit 839423a.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot merge main and recompile

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 16, 2026

@copilot review all comments and review comments

Done in 9f8b848. I re-checked the remaining PR comments/review threads, and the actionable items are covered there (typecheck-safe guard, per-line JSONL parsing, curl/non-2xx reason handling, reason sanitization, bounded conclusion attribute, and artifact upload path).

@copilot lint js, fix https://github.com/github/gh-aw/actions/runs/25964635976/job/76326091878#step:6:1

Addressed in 9f8b848. I re-ran npm run typecheck and the targeted send_otlp_span.test.cjs suite locally; both pass. npm run lint:cjs still reports the pre-existing parse_copilot_log.test.cjs formatting issue, but the touched OTLP files pass Prettier.

Copilot AI and others added 4 commits May 16, 2026 15:18
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
This reverts commit 21a596d.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…capture-otlp-export-details

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 16, 2026

@copilot merge main and recompile

Done in 1e4dfa4 and 8726b34.

@pelikhan pelikhan merged commit a54099c into main May 16, 2026
@pelikhan pelikhan deleted the copilot/otel-advisor-capture-otlp-export-details branch May 16, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[otel-advisor] OTel improvement: capture per-endpoint OTLP export failure details (host + status + reason)

3 participants