Skip to content

OTel: propagate GenAI semantic attrs to gh-aw.agent.conclusion spans#33245

Merged
pelikhan merged 3 commits into
mainfrom
copilot/grafana-otel-advisor-otel-improvement-surface-gen
May 19, 2026
Merged

OTel: propagate GenAI semantic attrs to gh-aw.agent.conclusion spans#33245
pelikhan merged 3 commits into
mainfrom
copilot/grafana-otel-advisor-otel-improvement-surface-gen

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 19, 2026

gh-aw currently emits GenAI semantic attributes only on the dedicated gh-aw.<job>.agent span, but production traces frequently surface only gh-aw.agent.conclusion. As a result, inference spans are not consistently recognized by OTel GenAI dashboards.

  • What changed

    • Added GenAI inference attributes to the shared attribute set used by agent conclusion spans (jobName === "agent"):
      • gen_ai.operation.name = "chat"
      • gen_ai.workflow.name (when available)
      • gen_ai.response.finish_reasons (when stop_reason is available)
    • This ensures gh-aw.agent.conclusion carries required/expected GenAI semantics even when the dedicated agent span is absent.
  • Deduplication cleanup

    • Removed redundant per-agent-span pushes for the same GenAI attrs in the dedicated span block.
    • Dedicated agent span now inherits these fields from the shared attribute list, keeping behavior aligned across both span types.
  • Test updates

    • Extended send_otlp_span.test.cjs assertions to validate GenAI attrs on the conclusion span:
      • present for operation/workflow when available
      • present/absent for finish reasons based on runtime stop_reason
if (jobName === "agent") {
  attributes.push(buildAttr("gen_ai.operation.name", "chat"));
  if (workflowName) attributes.push(buildAttr("gen_ai.workflow.name", workflowName));
  if (runtimeMetrics.stopReason) {
    attributes.push(buildArrayAttr("gen_ai.response.finish_reasons", [runtimeMetrics.stopReason]));
  }
}

Copilot AI and others added 2 commits May 19, 2026 09:08
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve OTel instrumentation to surface GenAI attributes on spans OTel: propagate GenAI semantic attrs to gh-aw.agent.conclusion spans May 19, 2026
Copilot AI requested a review from pelikhan May 19, 2026 09:17
@pelikhan pelikhan marked this pull request as ready for review May 19, 2026 09:18
Copilot AI review requested due to automatic review settings May 19, 2026 09:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Propagates OpenTelemetry GenAI semantic attributes onto gh-aw.agent.conclusion spans so inference activity is recognized by GenAI dashboards even when the dedicated agent sub-span is missing.

Changes:

  • Add gen_ai.operation.name, gen_ai.workflow.name, and gen_ai.response.finish_reasons (when available) to the shared attribute set for jobName === "agent" conclusion spans.
  • Remove redundant GenAI attribute pushes from the dedicated agent span path so it inherits from the shared attribute list.
  • Extend unit tests to assert GenAI attributes on the conclusion span (plus several workflow/action lockfile updates).
Show a summary per file
File Description
actions/setup/js/send_otlp_span.cjs Adds GenAI semantic attrs to agent conclusion shared attributes; deduplicates agent-span-specific pushes.
actions/setup/js/send_otlp_span.test.cjs Adds assertions that conclusion spans include GenAI semantic attrs (operation/workflow/finish reasons).
.github/workflows/stale-repo-identifier.lock.yml Updates pinned github/stale-repos action SHA/version in the lock workflow.
.github/workflows/smoke-otel-backends.lock.yml Adjusts Datadog MCP header value interpolation to use DD_APP_KEY.
.github/workflows/daily-geo-optimizer.lock.yml Updates pinned actions/checkout and actions/setup-python versions/SHAs for the geo audit job.
.github/aw/actions-lock.json Adds lock entries for additional action versions referenced by the updated lock workflows.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

expect(conclusionAttrs["gen_ai.operation.name"]).toBe("chat");
expect(conclusionAttrs["gen_ai.workflow.name"]).toBe("otel-advisor");
});

Comment on lines 492 to 495
- name: Run stale-repos
id: stale-repos
uses: github/stale-repos@56718cc2bafc8c0fc5fd66751e0b32a33ed36c22 # v9.0.11 (source v9.0.8)
uses: github/stale-repos@5f2e18fc5432823f96c1feb69327f665c2acab59 # v9.0.8
env:
@github-actions github-actions Bot mentioned this pull request May 19, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /tdd and /zoom-out based on the observability refactor + test coverage in this PR.

Key Themes

  • Test coverage gap: The new jobName === "agent" conditional lacks a negative test validating that non-agent jobs don't receive gen_ai.* attributes
  • Documentation context: Inline comments could better explain the production issue motivating this change and provide explicit line references for deduplication
  • OTel compliance: Strong adherence to OpenTelemetry GenAI semantic conventions

Positive Highlights

  • Excellent test discipline: All three existing tests extended to validate conclusion span attributes — both presence (when data available) and absence (when not)
  • Clean deduplication: Removing redundant attribute pushes improves maintainability and establishes a single source of truth
  • Spec-driven design: GenAI attributes follow OTel semantic conventions precisely (operation.name, workflow.name, finish_reasons)
  • Edge case coverage: Tests validate both stop_reason present and absent scenarios

Verdict

The implementation is solid and the test coverage is thorough for the happy path. The suggestions above are minor improvements to guard against regressions and improve code archaeology for future maintainers.

Learn more about the skills applied:

  • /tdd — Test-driven development: red-green-refactor loop
  • /zoom-out — Architectural consistency and broader context

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · ● 3.3M


const conclusionBody = JSON.parse(mockFetch.mock.calls[1][1].body);
const conclusionSpan = conclusionBody.resourceSpans[0].scopeSpans[0].spans[0];
const conclusionKeys = conclusionSpan.attributes.map(a => a.key);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The new if (jobName === "agent") conditional at line 1750 lacks an explicit negative test. Consider adding a test case that validates gen_ai.* attributes are NOT added to non-agent conclusion spans:

it("does not add gen_ai attributes to non-agent conclusion spans", async () => {
  fs.writeFileSync(logPath, "job_name=activation\n");
  await sendJobConclusionSpan("gh-aw.agent.conclusion");
  
  const conclusionBody = JSON.parse(mockFetch.mock.calls[0][1].body);
  const conclusionSpan = conclusionBody.resourceSpans[0].scopeSpans[0].spans[0];
  const keys = conclusionSpan.attributes.map(a => a.key);
  expect(keys).not.toContain("gen_ai.operation.name");
  expect(keys).not.toContain("gen_ai.workflow.name");
});

This guards against accidental attribute leakage to non-LLM jobs.

}
if (jobName === "agent") {
// Emit OTel GenAI semantic attributes on agent conclusion spans even when the
// dedicated agent sub-span is missing, so LLM activity remains discoverable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] The comment mentions "when the dedicated agent sub-span is missing" — is this a known production issue? Consider adding a reference to a GitHub issue, runbook, or ADR explaining:

  • Why the dedicated agent span might be absent in production traces
  • What monitoring alerts or dashboards motivated this change
  • Whether this is a temporary workaround or permanent architecture

This helps future maintainers understand the context when they encounter this code during debugging.

agentAttributes.push(buildArrayAttr("gen_ai.response.finish_reasons", [runtimeMetrics.stopReason]));
}
// gen_ai.operation.name / gen_ai.workflow.name / gen_ai.response.finish_reasons
// are already included via the shared attributes list above.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] This comment could be more specific. Instead of "shared attributes list above", consider:

// gen_ai.operation.name, gen_ai.workflow.name, and gen_ai.response.finish_reasons
// are already included via the jobName === "agent" block (lines 1750-1757).

Explicit line references make it easier for future maintainers to trace the deduplication during code review or debugging.

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 80/100

Excellent test quality

Metric Value
New/modified tests analyzed 3
✅ Design tests (behavioral contracts) 3 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 1 (33%)
Duplicate test clusters 0
Test inflation detected No
🚨 Coding-guideline violations 0

Test Classification Details

Test File Classification Issues Detected
GenAI attributes on conclusion span (operation & workflow) send_otlp_span.test.cjs:2823 ✅ Design None
GenAI finish_reasons present when stop_reason available send_otlp_span.test.cjs:2979 ✅ Design None
GenAI finish_reasons absent when stop_reason missing send_otlp_span.test.cjs:3014 ✅ Design None

Strengths

All tests verify behavioral contracts — Every test addition validates observable OTel span attributes sent to endpoints, not internal implementation details.

Both positive and negative cases covered — Tests verify attributes are present when expected AND absent when they shouldn't be (e.g., finish_reasons only when stop_reason is available).

Follows existing test patterns — Assertions parse the actual OTLP payload from mock fetch calls, ensuring end-to-end correctness.

Proportional test growth — 17 lines of test code validate 3 different conditional branches in production code that actually decreased in size (refactoring that moved attributes to shared list).


Suggestions for Improvement

While the test quality is excellent, here are opportunities to strengthen edge case coverage:

Missing edge case: jobName !== "agent" path

The production code has this conditional:

if (jobName === "agent") {
  attributes.push(buildAttr("gen_ai.operation.name", "chat"));
  // ...
}

Consider adding a test that verifies GenAI attributes are NOT added to conclusion spans when jobName is something other than "agent" (e.g., "setup", "teardown"). This would validate the guard condition and catch regressions if someone accidentally removes the jobName === "agent" check.

Example test:

it("does not add GenAI attributes to non-agent job conclusion spans", async () => {
  // Test with jobName="setup" or another non-agent job
  // Assert gen_ai.operation.name is NOT present on conclusion span
});

This would bring edge case coverage from 33% to 50%, improving the overall score to 85/100.


Language Support

Tests analyzed:

  • 🟨 JavaScript (*.test.cjs): 3 test additions (vitest)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All test additions verify behavioral contracts and observable outputs.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

🧪 Test quality analysis by Test Quality Sentinel · ● 7.4M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 80/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All tests verify behavioral contracts with observable outputs.

@pelikhan pelikhan merged commit 176ca18 into main May 19, 2026
61 of 67 checks passed
@pelikhan pelikhan deleted the copilot/grafana-otel-advisor-otel-improvement-surface-gen branch May 19, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants