Implement a user tracing API for custom spans#6608
Conversation
|
ProviderModelNotFoundError |
|
@danlapid Bonk workflow failed. Check the logs for details. View workflow run · To retry, trigger Bonk again. |
9acd0cb to
b35566d
Compare
|
/bonk review this PR |
There was a problem hiding this comment.
Review Summary
Well-structured fix for a real async-context propagation bug in user tracing. The UserTraceSpanHolder indirection is the right approach for decoupling span identity from span lifetime - cleanly avoids extending BaseTracer lifetime into the IoContext's delete queue on actor IoContexts.
Findings:
- [MEDIUM] Pre-existing gap:
HibernatableWebSocketCustomEvent::run(),TailStreamTarget::report(), andJsRpcTargetBaseall dispatch to user JS handlers viacontext.run()without eithermakeAsyncTraceScopeormakeUserAsyncTraceScope. Not introduced by this PR, but these are actor-relevant entrypoints that would benefit from the same treatment. Consider a follow-up. - [LOW]
makeUserAsyncTraceScopealways allocates aUserTraceSpanHolder+IoOwn+ opaque V8 wrapper even when tracing is inactive, unlikemakeAsyncTraceScopewhich elides the heap allocation viadummySpanParentwithNullDisposerwhen the span is unobserved. See inline comment for suggested optimization. - [LOW]
UserTraceSpanHolder::getSpan()could beconst- minor style nit, see inline.
Test expectation changes are correct: the getAlarm call inside waitForAlarm() (after await promise resumes) is logically part of fetch()'s async call graph and now correctly attributes to the fetch trace.
This review was generated by an AI assistant and may contain inaccuracies.
|
Review posted on PR #6608. Here's a summary of the findings: Key analysis:
|
Merging this PR will improve performance by 39.14%
Performance Changes
Comparing Footnotes
|
9b18411 to
4508530
Compare
|
/bonk review this PR commit by commit |
There was a problem hiding this comment.
Review Summary
Thorough rework that correctly solves the async-context propagation problem for user tracing spans. The UserTraceSpanHolder indirection is the right design for decoupling span identity from span lifetime on actor IoContexts, and the enterSpan callback-based API is cleaner than the old startSpan/end() approach. The hierarchy tests (including async nested spans across microtask boundaries, and runtime-created fetch spans nesting under user spans) provide confidence in correctness.
Findings:
- [HIGH]
enterSpancapturesjsSpanviaaddRef()in both the success and failure promise continuations. If the promise never settles (not just a rare corner case -- this happens with abandoned promises in actors), multipleRef<Span>closures hold strong refs. TheIoOwn<SpanImpl>inside them will be destroyed via the delete queue eventually, but only at actor shutdown. See inline comment. - [MEDIUM] After
SpanImpl::end()clears the async context holder, code running in that same async context that callsgetCurrentUserTraceSpan()will getSpanParent(nullptr), silently dropping any further child spans. See inline comment. - [MEDIUM]
tracing.d.tsdeclaresenterSpanas a bare function rather than a method on the default export. See inline comment. - [LOW]
makeUserAsyncTraceScopealways allocates even when no tracer is active. Likely cause of the 29.9% CodSpeed regression. See inline comment. - [LOW]
getTracing()allocates a new instance on every call; verifyJSG_LAZY_INSTANCE_PROPERTYcaches it. See inline comment.
This review was generated by an AI assistant and may contain inaccuracies.
|
Review posted successfully on PR #6608. Here's a summary of the findings: 5 inline comments posted across 4 files:
|
f38e799 to
82fb934
Compare
|
The generated output of |
82fb934 to
ad5df7c
Compare
|
/bonk review this PR commit by commit |
|
UnknownError: ProviderInitError |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6608 +/- ##
==========================================
- Coverage 66.56% 66.48% -0.09%
==========================================
Files 405 405
Lines 117564 117768 +204
Branches 19388 19423 +35
==========================================
+ Hits 78258 78296 +38
- Misses 27729 27889 +160
- Partials 11577 11583 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
mar-cf
left a comment
There was a problem hiding this comment.
re: makeUserTraceSpan intentionally unchanged
I'm genuinely unsure if that's correct. On one hand, I see why it's skipped - it would really only be useful situationally and incurs a cost. But, there's also a benefit to a consistent concepts of getting a "current" span context that is reflective of sync -> async span context.
For example, if storage.transaction() creates a runtime span and the user callback inside it calls other traced APIs, those child spans would parent under the root rather than the transaction span - because the transaction span doesn't push into async context.
I always imagined it would be a part of this change. Do you have a strong opinion on this?
That's a very fair point and a good example, but there's very few of those and we're not making anything any worse - I'd rather deal with it in a followup |
|
Looked at the changes – I have few things to suggest in addition to what was pointed out previously. I didn't look closely at memory management in the second commit which looks like it can be tricky to get right, but it looks like the previous reviewers did.
|
ad5df7c to
e4af585
Compare
|
/bonk review this PR commit by commit |
|
UnknownError: ProviderInitError |
1 similar comment
|
UnknownError: ProviderInitError |
|
/bonk review this PR commit by commit |
|
UnknownError: ProviderInitError |
88aeca7 to
d3dc855
Compare
…cing
User tracing was asymmetric with internal/Jaeger tracing in a way that
produced incorrect parent/child relationships in streaming tail worker
events.
## Background
workerd has two tracing systems:
* Internal (Jaeger) tracing uses `jsg::AsyncContextFrame::StorageKey`
to propagate the current span through JS async boundaries. The root
is pushed at every handler entrypoint via
`IoContext::makeAsyncTraceScope(lock)`, and `getCurrentTraceSpan()`
reads from whichever AsyncContextFrame is live when a span is built.
* User tracing (visible in streaming tail worker events) did *not* use
AsyncContextFrame. `currentUserTraceSpan` was a plain member of
`IoContext_IncomingRequest` set once in `delivered()`, and
`getCurrentUserTraceSpan()` returned it via
`getCurrentIncomingRequest()` - which just walks the IoContext's
intrusive list and returns the most recently delivered request.
So user tracing had no awareness of the JS async call graph. Whichever
request was delivered most recently on a given IoContext "won" as the
parent for any span created anywhere on that IoContext, even for
spans originating from a promise continuation that logically belongs
to a different, earlier request. This is most visible on Durable
Objects where one IoContext serves many requests, but applies any
time a promise chain outlives the dispatch of another request on the
same IoContext.
## What this change does
### New AsyncContextFrame key and `makeUserAsyncTraceScope`
`Worker::Isolate` owns a second `AsyncContextFrame::StorageKey`,
`userTraceAsyncContextKey`, alongside the existing
`traceAsyncContextKey`. A new `IoContext::makeUserAsyncTraceScope` is
the exact analogue of `makeAsyncTraceScope`: it returns an RAII
`StorageScope` that pushes a frame carrying the user-trace root. V8
captures the frame when a promise continuation is enqueued inside the
scope, so the continuation sees the same root when it later runs.
`makeUserAsyncTraceScope` is called immediately after
`makeAsyncTraceScope` at every call site in the codebase:
`WorkerEntrypoint::request`/`connect`/`runScheduled`/`runAlarmImpl`/
`test`, `QueueCustomEventImpl::run`, and `sendTracesToExportedHandler`.
### `getCurrentUserTraceSpan` is now async-context-aware
If the JS lock is held, look up the current AsyncContextFrame under
the user-trace key and return `holder->getSpan()` (see below).
Otherwise fall back to the current IncomingRequest's root, preserving
behavior for code paths without the JS lock.
### `UserTraceSpanHolder` - preventing tracer lifetime extension
A naive implementation (`IoOwn<SpanParent>` keyed off the new storage
key) works for stateless workers but is catastrophically broken for
actors. The `UserSpanObserver` holds a `kj::Own<BaseTracer>` via its
submitter. Things in `IoOwn<T>` live on the IoContext's delete queue
and are only destroyed at IoContext teardown - for an actor, that can
be hours or days after any individual request. And the `BaseTracer`'s
destructor is what emits the final `outcome` streaming-tail event, so
a leaked tracer means the request's tail stream never receives an
`outcome`.
`UserTraceSpanHolder` is a refcounted indirection that decouples the
identity of the slot from the lifetime of its contents:
class UserTraceSpanHolder final: public kj::Refcounted {
SpanParent getSpan(); // null if cleared
void setSpan(SpanParent);
void clear();
private:
kj::Maybe<SpanParent> span;
};
The IncomingRequest owns the holder, populates it in `delivered()`,
and `clear()`s it in its destructor before `metrics`
(`RequestObserverWithTracer`) is destroyed. Other references to the
holder - specifically the one placed in `IoOwn<UserTraceSpanHolder>`
via the AsyncContextFrame - observe the same object but cannot
prevent the span and its tracer ref from being released at
end-of-request. The ordering of `clear()` before `metrics` destruction
ensures the tracer's refcount hits zero right after
`setOutcome` is called, so its destructor emits the outcome promptly.
### `makeUserTraceSpan` intentionally unchanged
Same semantics as before and as `makeTraceSpan`: build a span whose
parent is `getCurrentUserTraceSpan()`, but do *not* push it. The ~45
existing `makeUserTraceSpan` call sites in `api/*.c++` are unchanged
and continue to parent under the IncomingRequest root via the
fallback path when no enclosing user-trace scope has been pushed.
Future work can (a) opt specific call sites into
`makeUserAsyncTraceScope` where nested scopes are desired and (b) add
a JS user-tracing API analogous to `InternalJaegerTrace::enterSpan`
that wraps user callbacks in a `makeUserAsyncTraceScope`.
## Test expectation change
`src/workerd/api/tests/tail-worker-test.js` expectations for the
actor-alarm test (expected lines 111 and 112) are updated. This test
is the only place in the workerd test corpus that exercises
cross-request async context propagation for user spans, and its old
expectations encoded the pre-existing bug.
The test case (`actor-alarms-test.js`) does 1 setAlarm + 3 getAlarm:
async fetch() {
await setAlarm(time) [F1]
await getAlarm() [F2]
await waitForAlarm(time):
await promise [W1] yields
// alarm() runs here, calls getAlarm() [A1], resolves promise
await getAlarm() [W2]
}
Under the old code, [W2] ran after alarm() had been delivered on the
same IoContext. `getCurrentIncomingRequest()` returned the alarm
request (most recently delivered), so [W2]'s span was parented under
alarm's root even though it is logically part of fetch's call chain.
Under the new code, fetch's entrypoint pushed a
`makeUserAsyncTraceScope` seeded with fetch's root holder. V8
captured that frame when [W1] suspended. When [W2] runs on resumption,
the frame is restored and `getCurrentUserTraceSpan()` returns fetch's
root.
Old expectations: fetch=[F1, F2], alarm=[A1, W2] // W2 misattributed
New expectations: fetch=[F1, F2, W2], alarm=[A1]
Total getAlarm count is unchanged (3). Only attribution changed, and
the new attribution matches the async call graph - which is what every
other tracing system (internal Jaeger in workerd itself,
OpenTelemetry, Sentry, Zipkin) does, and what developers expect when
reading their code.
This is a user-observable change in streaming tail worker event
structure for any case where a promise chain from one request is
resumed after another request on the same IoContext has been
delivered. If staged rollout is desired, gating the async-context
lookup in `getCurrentUserTraceSpan()` behind an autogate or compat
flag would be a small targeted change; this commit does not add one -
defer to the reviewer.
Replaces `tracing.startSpan(name) -> JsSpan` with `tracing.enterSpan(name, callback, ...args)`, which pushes the new span onto the AsyncContextFrame for the callback's duration and auto-ends it on return, throw, or promise settlement. This makes nested spans and runtime-generated spans (e.g. `fetch`) correctly parent on the surrounding user span. Child spans are wrapped in a fresh UserTraceSpanHolder before being pushed onto the AsyncContextFrame; the holder is cleared by SpanImpl::end() so actor IoContexts don't leak tracer refs across requests. `Span` and `SpanImpl` live in a new `workerd::api::user_tracing` sub-namespace to avoid collision with `workerd::Span`. The JS class name is still `Span` via `JSG_NESTED_TYPE_NAMED`. `withSpan` in tracing-helpers.ts becomes a one-line passthrough over `enterSpan`, so downstream callers (d1, images, etc.) need no changes. Adds tracing-hierarchy-test which verifies parent/child relationships in the streaming-tail event stream for nested enterSpan (sync and async), fetch-inside-enterSpan, fetch-inside-nested-enterSpan, and sibling enterSpan calls.
The class name is what users see in generated .d.ts output and in
`typeof` on the JS-side object. The 'Module' suffix was a vestige
from when the API was only reachable as an ESM import
(`import { tracing } from 'cloudflare:workers'`); it no longer
describes the type accurately.
Pure refactor: no behavior change and no new JS surface.
The module-registration helpers (`registerTracingModule`,
`getInternalTracingModuleBundle`) keep the 'Module' suffix — they
describe the JS module registration, not the class.
d3dc855 to
c374a86
Compare
Adds `ctx.tracing` (gated on the `workerdExperimental` compat flag
in `getTracing`, which returns `undefined` outside the flag) and a
`tracing` named export on `cloudflare:workers` that re-exports the
`cloudflare-internal:tracing` module.
The BUILD.bazel change folds `tracing.{h,c++}` into the
`api:srcs` / `:hdrs` filegroups and drops the standalone
`wd_cc_library(name = "tracing-module")` target. This is the
idiomatic workaround for the api<->io circular dep documented in
`src/workerd/io/BUILD.bazel`: `global-scope.c++` now needs the
full `Tracing` definition to implement `getTracing`, and
`global-scope.c++` lives inside `//src/workerd/io` via that
filegroup, so a standalone tracing-module library that (transitively)
depends on `//src/workerd/io` would cycle.
c374a86 to
fa4180d
Compare
This PR implements a user tracing API for custom spans.
The PR is split to 4 separate logical commits for review convenience as well as for sane history management.
The commits are as follows:
import { tracing } from 'cloudflare:workers'