refactor(workflow): per-step OTEL spans; delete CallToolInternal StartToolSpan#696
Merged
Conversation
0cf2e10 to
540662b
Compare
QuentinBisson
commented
May 19, 2026
Contributor
Author
QuentinBisson
left a comment
There was a problem hiding this comment.
Light review (DDD / SOLID / YAGNI / DRY / don't-reinvent / tests).
Blocking — coverage regression
- No test for
startStepSpan(internal/workflow/tracing.go). You correctly deletedTestStartToolSpan, but didn't replace it. The new helper has three distinct outcome branches (err,IsError, ok) and three required attributes (workflow.name,workflow.step.id,mcp.tool.name) — exactly the table-driven test shape the deleted test had. This is a real coverage regression on a span the user-facing trace explicitly relies on.
Non-blocking
internal/workflow/tracing.goimportsinternal/aggregator/instrumentfor two constants.TracerName = "github.com/giantswarm/muster/internal/aggregator"— that's the wrong scope for spans emitted frominternal/workflow. Use a workflow-local tracer name ("github.com/giantswarm/muster/internal/workflow") so trace consumers can filter by emitting package. KeepAttrToolNameshared — it's a semantic key, not a scope.
Suggestion
- The narrative comment inside
executor.go("wrapped in a workflow.step span so the trace shows the workflow → step → backend hierarchy") is exactly the historical/roadmap commentary your CLAUDE.md says to strip — it explains the change, not a present invariant. The doc onstartStepSpanalready covers what the span does.
OAuth test-constant churn from the stack still bleeds in.
paurosello
approved these changes
May 19, 2026
657a48a to
7d4cc6a
Compare
3655808 to
f8a6ad6
Compare
7d4cc6a to
8e4a6e3
Compare
eb6f4fa to
5b56768
Compare
6aa3571 to
d0839d5
Compare
23eb622 to
dcde1a1
Compare
…tToolSpan CallToolInternal was opening a "tool.<name>" span around every internal dispatch via instrument.StartToolSpan. That duplicated the tool-handler span mcp-go already opens on the MCP-wire path and put the only dispatch-level span for the workflow path on the wrong layer (the workflow executor is the logical action, not the aggregator dispatch). Move the span into internal/workflow.startStepSpan so each workflow step (and each step condition) gets a workflow.step span carrying the workflow name, step ID, and tool name. Emit under the shared instrument.TracerName so the scope is consistent with server-side and client-side mcp-go spans. Drop StartToolSpan along with its unit test and the TestMetrics_HistogramExemplarAttachesTraceID variant that relied on it — the production wiring is already covered end-to-end by TestMCPServerOptions_HistogramExemplarAttachesToolHandlerSpan. CallToolInternal becomes a transparent dispatch function. Non-workflow callers that want a dispatch-level span open one in their own layer.
Add a table-driven test asserting the three outcome branches of startStepSpan (err / IsError / ok) and the three required attributes (workflow.name, workflow.step.id, mcp.tool.name) land on the workflow.step span. Captured via tracetest.InMemoryExporter. The InstrumentationScope assertion pins the span to instrument.TracerName so a regression that splits the workflow package onto a separate scope fails the test.
dcde1a1 to
e9f5002
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
CallToolInternalininternal/aggregator/server.gowas opening atool.<name>span viainstrument.StartToolSpanfor every internal dispatch. Two problems:tool.<name>span, so a regulartools/callover the wire produced two same-named child spans.What
internal/workflow/tracing.go::startStepSpanopens aworkflow.stepspan per executed step (and per step-condition tool call). Attributes:workflow.name,workflow.step.id,mcp.tool.name. Span outcome is recorded from the tool result (IsError) and any handler error.internal/workflow/executor.gowraps bothCallToolInternalsites (condition + step) in the new helper.instrument.StartToolSpanand its tests are deleted. The unit-levelTestMetrics_HistogramExemplarAttachesTraceIDthat relied onStartToolSpanis also dropped — the production wiring is covered end-to-end byTestMCPServerOptions_HistogramExemplarAttachesToolHandlerSpan(added in refactor(aggregator): adopt mcp-go native OTEL tracing hooks #684).instrument.TracerNameandinstrument.AttrToolNamestay shared.CallToolInternalbecomes a transparent dispatch function with no span of its own. Direct API callers that want span granularity open one in their own layer.Trace shape after this PR
For the MCP-wire
tools/callpath, the duplicatetool.<name>child span goes away — mcp-go's own middleware (adopted in #684) is the only owner of that span now.Review follow-ups
TestStartStepSpantable-drives the three outcome branches (err/IsError/ok) and assertsworkflow.name,workflow.step.id, andmcp.tool.nameland on the span. Captured viatracetest.InMemoryExporter.instrument.TracerName(github.com/giantswarm/muster) — span name andworkflow.*attributes already encode the emitter, so a workflow-local scope was redundant.executor.gostripped (the doc onstartStepSpancarries the present invariant).Stack
This PR is now stacked on #685. Land order: #685 (server + client tracing, includes the
TracerNamerename) → #696 (this PR).Validation
go test ./...clean (every package).make lintclean (no new issues in changed files)../muster test --parallel 50 --base-port 30000.