Skip to content

refactor(workflow): per-step OTEL spans; delete CallToolInternal StartToolSpan#696

Merged
QuentinBisson merged 2 commits into
mainfrom
feat/workflow-step-spans
May 19, 2026
Merged

refactor(workflow): per-step OTEL spans; delete CallToolInternal StartToolSpan#696
QuentinBisson merged 2 commits into
mainfrom
feat/workflow-step-spans

Conversation

@QuentinBisson
Copy link
Copy Markdown
Contributor

@QuentinBisson QuentinBisson commented May 18, 2026

Why

CallToolInternal in internal/aggregator/server.go was opening a tool.<name> span via instrument.StartToolSpan for every internal dispatch. Two problems:

  1. Duplicate span on the MCP-wire path — mcp-go's tool-handler middleware already opens an identical tool.<name> span, so a regular tools/call over the wire produced two same-named child spans.
  2. Wrong layer on the workflow path — the workflow executor is the logical action ("running step X"), but the span lived in the aggregator dispatch function. A span at that layer doesn't carry workflow context.

What

  • New internal/workflow/tracing.go::startStepSpan opens a workflow.step span per executed step (and per step-condition tool call). Attributes: workflow.name, workflow.step.id, mcp.tool.name. Span outcome is recorded from the tool result (IsError) and any handler error.
  • internal/workflow/executor.go wraps both CallToolInternal sites (condition + step) in the new helper.
  • instrument.StartToolSpan and its tests are deleted. The unit-level TestMetrics_HistogramExemplarAttachesTraceID that relied on StartToolSpan is also dropped — the production wiring is covered end-to-end by TestMCPServerOptions_HistogramExemplarAttachesToolHandlerSpan (added in refactor(aggregator): adopt mcp-go native OTEL tracing hooks #684). instrument.TracerName and instrument.AttrToolName stay shared.
  • CallToolInternal becomes a transparent dispatch function with no span of its own. Direct API callers that want span granularity open one in their own layer.

Trace shape after this PR

workflow execution (caller's span, if any)
  └─ workflow.step (workflow.name=…, step.id=…, mcp.tool.name=…)   ← new
       └─ mcp client SendRequest "mcp.tools/call" (kind=Client)     ← from #685

For the MCP-wire tools/call path, the duplicate tool.<name> child span goes away — mcp-go's own middleware (adopted in #684) is the only owner of that span now.

Review follow-ups

  • TestStartStepSpan table-drives the three outcome branches (err / IsError / ok) and asserts workflow.name, workflow.step.id, and mcp.tool.name land on the span. Captured via tracetest.InMemoryExporter.
  • Workflow tracer scope folded into the shared instrument.TracerName (github.com/giantswarm/muster) — span name and workflow.* attributes already encode the emitter, so a workflow-local scope was redundant.
  • Narrative comment in executor.go stripped (the doc on startStepSpan carries the present invariant).

Stack

This PR is now stacked on #685. Land order: #685 (server + client tracing, includes the TracerName rename) → #696 (this PR).

Validation

  • go test ./... clean (every package).
  • make lint clean (no new issues in changed files).
  • 121/121 behavioural scenarios pass via ./muster test --parallel 50 --base-port 30000.

@QuentinBisson QuentinBisson force-pushed the feat/workflow-step-spans branch from 0cf2e10 to 540662b Compare May 18, 2026 15:34
@QuentinBisson QuentinBisson marked this pull request as ready for review May 18, 2026 18:10
@QuentinBisson QuentinBisson requested a review from a team as a code owner May 18, 2026 18:10
Copy link
Copy Markdown
Contributor Author

@QuentinBisson QuentinBisson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Light review (DDD / SOLID / YAGNI / DRY / don't-reinvent / tests).

Blocking — coverage regression

  • No test for startStepSpan (internal/workflow/tracing.go). You correctly deleted TestStartToolSpan, but didn't replace it. The new helper has three distinct outcome branches (err, IsError, ok) and three required attributes (workflow.name, workflow.step.id, mcp.tool.name) — exactly the table-driven test shape the deleted test had. This is a real coverage regression on a span the user-facing trace explicitly relies on.

Non-blocking

  • internal/workflow/tracing.go imports internal/aggregator/instrument for two constants. TracerName = "github.com/giantswarm/muster/internal/aggregator" — that's the wrong scope for spans emitted from internal/workflow. Use a workflow-local tracer name ("github.com/giantswarm/muster/internal/workflow") so trace consumers can filter by emitting package. Keep AttrToolName shared — it's a semantic key, not a scope.

Suggestion

  • The narrative comment inside executor.go ("wrapped in a workflow.step span so the trace shows the workflow → step → backend hierarchy") is exactly the historical/roadmap commentary your CLAUDE.md says to strip — it explains the change, not a present invariant. The doc on startStepSpan already covers what the span does.

OAuth test-constant churn from the stack still bleeds in.

@QuentinBisson QuentinBisson force-pushed the feat/workflow-step-spans branch from 657a48a to 7d4cc6a Compare May 19, 2026 12:15
@QuentinBisson QuentinBisson changed the base branch from main to feat/adopt-mcp-go-otel-client May 19, 2026 12:15
@QuentinBisson QuentinBisson force-pushed the feat/adopt-mcp-go-otel-client branch from 3655808 to f8a6ad6 Compare May 19, 2026 12:24
@QuentinBisson QuentinBisson force-pushed the feat/workflow-step-spans branch from 7d4cc6a to 8e4a6e3 Compare May 19, 2026 12:24
Base automatically changed from feat/adopt-mcp-go-otel-client to main May 19, 2026 12:28
@QuentinBisson QuentinBisson force-pushed the feat/workflow-step-spans branch from eb6f4fa to 5b56768 Compare May 19, 2026 12:49
@QuentinBisson QuentinBisson changed the base branch from main to chore/otel-scope-package May 19, 2026 12:50
@QuentinBisson QuentinBisson force-pushed the chore/otel-scope-package branch from 6aa3571 to d0839d5 Compare May 19, 2026 13:01
@QuentinBisson QuentinBisson force-pushed the feat/workflow-step-spans branch 2 times, most recently from 23eb622 to dcde1a1 Compare May 19, 2026 13:07
Base automatically changed from chore/otel-scope-package to main May 19, 2026 13:08
…tToolSpan

CallToolInternal was opening a "tool.<name>" span around every internal
dispatch via instrument.StartToolSpan. That duplicated the tool-handler
span mcp-go already opens on the MCP-wire path and put the only
dispatch-level span for the workflow path on the wrong layer (the
workflow executor is the logical action, not the aggregator dispatch).

Move the span into internal/workflow.startStepSpan so each workflow
step (and each step condition) gets a workflow.step span carrying the
workflow name, step ID, and tool name. Emit under the shared
instrument.TracerName so the scope is consistent with server-side and
client-side mcp-go spans. Drop StartToolSpan along with its unit test
and the TestMetrics_HistogramExemplarAttachesTraceID variant that
relied on it — the production wiring is already covered end-to-end by
TestMCPServerOptions_HistogramExemplarAttachesToolHandlerSpan.

CallToolInternal becomes a transparent dispatch function. Non-workflow
callers that want a dispatch-level span open one in their own layer.
Add a table-driven test asserting the three outcome branches of
startStepSpan (err / IsError / ok) and the three required attributes
(workflow.name, workflow.step.id, mcp.tool.name) land on the
workflow.step span. Captured via tracetest.InMemoryExporter.

The InstrumentationScope assertion pins the span to
instrument.TracerName so a regression that splits the workflow
package onto a separate scope fails the test.
@QuentinBisson QuentinBisson force-pushed the feat/workflow-step-spans branch from dcde1a1 to e9f5002 Compare May 19, 2026 13:10
@QuentinBisson QuentinBisson enabled auto-merge (squash) May 19, 2026 13:14
@QuentinBisson QuentinBisson merged commit c7c2f1b into main May 19, 2026
4 of 5 checks passed
@QuentinBisson QuentinBisson deleted the feat/workflow-step-spans branch May 19, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants