Skip to content

Implement a user tracing API for custom spans#6608

Merged
danlapid merged 4 commits intomainfrom
dlapid/asyncContextUserTracing
Apr 22, 2026
Merged

Implement a user tracing API for custom spans#6608
danlapid merged 4 commits intomainfrom
dlapid/asyncContextUserTracing

Conversation

@danlapid
Copy link
Copy Markdown
Collaborator

@danlapid danlapid commented Apr 19, 2026

This PR implements a user tracing API for custom spans.
The PR is split to 4 separate logical commits for review convenience as well as for sane history management.

The commits are as follows:

  1. Switch the user tracing instrumentation to be async local storage based in order to be able to have correct span hierarchy.
  2. Move our tracing module to use the new API correctly.
  3. Rename the tracing module to tracing in preparation of making it available as an API and not just a module.
  4. Make the tracing api available in ctx.tracing and import { tracing } from 'cloudflare:workers'

@danlapid danlapid requested review from a team as code owners April 19, 2026 02:11
@danlapid danlapid requested review from fhanau, jasnell and mar-cf April 19, 2026 02:11
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 19, 2026

ProviderModelNotFoundError

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 19, 2026

@danlapid Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch from 9acd0cb to b35566d Compare April 19, 2026 02:14
@danlapid
Copy link
Copy Markdown
Collaborator Author

/bonk review this PR

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Well-structured fix for a real async-context propagation bug in user tracing. The UserTraceSpanHolder indirection is the right approach for decoupling span identity from span lifetime - cleanly avoids extending BaseTracer lifetime into the IoContext's delete queue on actor IoContexts.

Findings:

  • [MEDIUM] Pre-existing gap: HibernatableWebSocketCustomEvent::run(), TailStreamTarget::report(), and JsRpcTargetBase all dispatch to user JS handlers via context.run() without either makeAsyncTraceScope or makeUserAsyncTraceScope. Not introduced by this PR, but these are actor-relevant entrypoints that would benefit from the same treatment. Consider a follow-up.
  • [LOW] makeUserAsyncTraceScope always allocates a UserTraceSpanHolder + IoOwn + opaque V8 wrapper even when tracing is inactive, unlike makeAsyncTraceScope which elides the heap allocation via dummySpanParent with NullDisposer when the span is unobserved. See inline comment for suggested optimization.
  • [LOW] UserTraceSpanHolder::getSpan() could be const - minor style nit, see inline.

Test expectation changes are correct: the getAlarm call inside waitForAlarm() (after await promise resumes) is logically part of fetch()'s async call graph and now correctly attributes to the fetch trace.


This review was generated by an AI assistant and may contain inaccuracies.

Comment thread src/workerd/io/io-context.c++
Comment thread src/workerd/io/io-context.h Outdated
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 19, 2026

Review posted on PR #6608. Here's a summary of the findings:

Key analysis:

  1. Correctness: The core approach is sound. UserTraceSpanHolder as a refcounted indirection correctly decouples the identity of the async-context slot from the lifetime of the SpanParent/BaseTracer. The clear() in ~IncomingRequest before metrics destruction ensures the tracer's refcount drops to zero at end-of-request, not at actor shutdown. All 7 makeAsyncTraceScope call sites are properly paired with makeUserAsyncTraceScope.

  2. [MEDIUM] Pre-existing missing entrypoints: Three handler dispatch paths (HibernatableWebSocketCustomEvent::run(), TailStreamTarget::report(), JsRpcTargetBase) call context.run() without either trace scope. Not introduced by this PR, but worth a follow-up since these are actor-relevant.

  3. [LOW] Allocation optimization: makeUserAsyncTraceScope always allocates even when no tracer is active, unlike makeAsyncTraceScope which has a dummySpanParent/NullDisposer fast path.

  4. [LOW] Style: getSpan() on UserTraceSpanHolder could be const.

  5. Test expectations: Verified correct - the getAlarm at the end of waitForAlarm() is part of fetch()'s async call chain and now properly attributes to fetch instead of alarm.

github run

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 19, 2026

Merging this PR will improve performance by 39.14%

⚡ 1 improved benchmark
✅ 71 untouched benchmarks
⏩ 129 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
simpleStringBody[Response] 27.2 µs 19.5 µs +39.14%

Comparing dlapid/asyncContextUserTracing (fa4180d) with main (764b4e8)

Open in CodSpeed

Footnotes

  1. 129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@danlapid danlapid requested a review from a team as a code owner April 19, 2026 23:26
@danlapid danlapid requested a review from NuroDev April 19, 2026 23:26
@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch from 9b18411 to 4508530 Compare April 19, 2026 23:35
@danlapid danlapid changed the title Make user tracing propagate through JS async context, like internal tracing Implement a user tracing API for custom spans Apr 19, 2026
@danlapid
Copy link
Copy Markdown
Collaborator Author

/bonk review this PR commit by commit

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thorough rework that correctly solves the async-context propagation problem for user tracing spans. The UserTraceSpanHolder indirection is the right design for decoupling span identity from span lifetime on actor IoContexts, and the enterSpan callback-based API is cleaner than the old startSpan/end() approach. The hierarchy tests (including async nested spans across microtask boundaries, and runtime-created fetch spans nesting under user spans) provide confidence in correctness.

Findings:

  • [HIGH] enterSpan captures jsSpan via addRef() in both the success and failure promise continuations. If the promise never settles (not just a rare corner case -- this happens with abandoned promises in actors), multiple Ref<Span> closures hold strong refs. The IoOwn<SpanImpl> inside them will be destroyed via the delete queue eventually, but only at actor shutdown. See inline comment.
  • [MEDIUM] After SpanImpl::end() clears the async context holder, code running in that same async context that calls getCurrentUserTraceSpan() will get SpanParent(nullptr), silently dropping any further child spans. See inline comment.
  • [MEDIUM] tracing.d.ts declares enterSpan as a bare function rather than a method on the default export. See inline comment.
  • [LOW] makeUserAsyncTraceScope always allocates even when no tracer is active. Likely cause of the 29.9% CodSpeed regression. See inline comment.
  • [LOW] getTracing() allocates a new instance on every call; verify JSG_LAZY_INSTANCE_PROPERTY caches it. See inline comment.

This review was generated by an AI assistant and may contain inaccuracies.

Comment thread src/workerd/api/tracing.c++
Comment thread src/workerd/api/tracing.c++ Outdated
Comment thread src/workerd/io/io-context.c++
Comment thread src/cloudflare/internal/tracing.d.ts Outdated
Comment thread src/workerd/api/global-scope.c++
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 19, 2026

Review posted successfully on PR #6608. Here's a summary of the findings:

5 inline comments posted across 4 files:

  1. [HIGH] src/workerd/api/tracing.c++:297 -- Abandoned promises in actors can accumulate SpanImpl + IoOwn + closure memory in the delete queue until actor shutdown. Each abandoned enterSpan keeps alive the original jsSpan ref plus two lambda captures with addRef(). Suggested adding auto-end when the owning IncomingRequest is destroyed, or a watchdog timeout.

  2. [MEDIUM] src/workerd/api/tracing.c++:83 -- After SpanImpl::end() clears the async context holder, any code in that async context calling getCurrentUserTraceSpan() gets SpanParent(nullptr). Analyzed the nested-span edge cases and concluded the StorageScope RAII makes this safe in practice, but suggested a clarifying comment.

  3. [MEDIUM] src/cloudflare/internal/tracing.d.ts:24 -- enterSpan is declared as a bare function but is actually a method on the Tracing class default export. The .d.ts should match the actual access pattern (method on instance, not top-level function).

  4. [LOW] src/workerd/io/io-context.c++:1095 -- makeUserAsyncTraceScope always heap-allocates even when no tracer is active. This is the likely cause of the 29.9% CodSpeed regression on simpleStringBody[Response]. Suggested adapting the dummySpanParent + NullDisposer pattern from makeAsyncTraceScope.

  5. [LOW] src/workerd/api/global-scope.c++:88 -- getTracing() allocates a new Tracing on every call. Since it uses JSG_LAZY_INSTANCE_PROPERTY this should be cached, but worth verifying.

github run

Copy link
Copy Markdown
Collaborator

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just because they're working more actively around tracing, I'd like either @mar-cf or @fhanau to review also but also good for this to merge.

@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch 2 times, most recently from f38e799 to 82fb934 Compare April 20, 2026 23:46
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch from 82fb934 to ad5df7c Compare April 21, 2026 01:03
@danlapid
Copy link
Copy Markdown
Collaborator Author

/bonk review this PR commit by commit

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 21, 2026

UnknownError: ProviderInitError

github run

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 22.58065% with 192 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.48%. Comparing base (764b4e8) to head (fa4180d).

Files with missing lines Patch % Lines
src/workerd/api/tracing.c++ 0.00% 175 Missing ⚠️
src/workerd/api/global-scope.c++ 0.00% 6 Missing ⚠️
src/workerd/server/server.c++ 66.66% 4 Missing and 2 partials ⚠️
src/workerd/io/io-context.c++ 84.61% 1 Missing and 3 partials ⚠️
src/workerd/io/io-context.h 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6608      +/-   ##
==========================================
- Coverage   66.56%   66.48%   -0.09%     
==========================================
  Files         405      405              
  Lines      117564   117768     +204     
  Branches    19388    19423      +35     
==========================================
+ Hits        78258    78296      +38     
- Misses      27729    27889     +160     
- Partials    11577    11583       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@mar-cf mar-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: makeUserTraceSpan intentionally unchanged

I'm genuinely unsure if that's correct. On one hand, I see why it's skipped - it would really only be useful situationally and incurs a cost. But, there's also a benefit to a consistent concepts of getting a "current" span context that is reflective of sync -> async span context.

For example, if storage.transaction() creates a runtime span and the user callback inside it calls other traced APIs, those child spans would parent under the root rather than the transaction span - because the transaction span doesn't push into async context.

I always imagined it would be a part of this change. Do you have a strong opinion on this?

Comment thread types/generated-snapshot/latest/index.d.ts
@danlapid
Copy link
Copy Markdown
Collaborator Author

danlapid commented Apr 21, 2026

For example, if storage.transaction() creates a runtime span and the user callback inside it calls other traced APIs, those child spans would parent under the root rather than the transaction span - because the transaction span doesn't push into async context.

That's a very fair point and a good example, but there's very few of those and we're not making anything any worse - I'd rather deal with it in a followup

@fhanau
Copy link
Copy Markdown
Contributor

fhanau commented Apr 21, 2026

Looked at the changes – I have few things to suggest in addition to what was pointed out previously. I didn't look closely at memory management in the second commit which looks like it can be tricky to get right, but it looks like the previous reviewers did.

  • Non-blocking, also mentioned by bot: For some event types we do not set up an async trace scope for internal or user tracing, notably hibernation, jsRpc and tailStream events. I looked at implementing this in STOR-4395 Improve span parenting for Hibernate events, add TODOs for other events #4947 but couldn't get a review for it. If this PR considers the previous implementation as "catastrophically broken for actors", we should implement this for all events that support them.
  • Non-blocking nit: I'd prefer to keep the tracing-module.c++ file name and EW_TRACING_MODULE_ISOLATE_TYPES the same, tracing-module.c++ is more descriptive and is easier to distinguish from trace.c++ than tracing.c++.
  • Non-blocking nit: Some of the tests are pretty verbose, tracing-hierarchy-instrumentation-test.js‎ could be simplified a bit by consolidating the different cases, and I believe tracing-hierarchy-mock.js‎ can be folded into tracing-hierarchy-mock.js‎.

Copy link
Copy Markdown
Member

@NuroDev NuroDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a types PoV

@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch from ad5df7c to e4af585 Compare April 22, 2026 00:36
@danlapid
Copy link
Copy Markdown
Collaborator Author

/bonk review this PR commit by commit

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 22, 2026

UnknownError: ProviderInitError

github run

1 similar comment
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 22, 2026

UnknownError: ProviderInitError

github run

@danlapid
Copy link
Copy Markdown
Collaborator Author

/bonk review this PR commit by commit

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Apr 22, 2026

UnknownError: ProviderInitError

github run

@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch 2 times, most recently from 88aeca7 to d3dc855 Compare April 22, 2026 10:55
…cing

User tracing was asymmetric with internal/Jaeger tracing in a way that
produced incorrect parent/child relationships in streaming tail worker
events.

## Background

workerd has two tracing systems:

* Internal (Jaeger) tracing uses `jsg::AsyncContextFrame::StorageKey`
  to propagate the current span through JS async boundaries. The root
  is pushed at every handler entrypoint via
  `IoContext::makeAsyncTraceScope(lock)`, and `getCurrentTraceSpan()`
  reads from whichever AsyncContextFrame is live when a span is built.

* User tracing (visible in streaming tail worker events) did *not* use
  AsyncContextFrame. `currentUserTraceSpan` was a plain member of
  `IoContext_IncomingRequest` set once in `delivered()`, and
  `getCurrentUserTraceSpan()` returned it via
  `getCurrentIncomingRequest()` - which just walks the IoContext's
  intrusive list and returns the most recently delivered request.

So user tracing had no awareness of the JS async call graph. Whichever
request was delivered most recently on a given IoContext "won" as the
parent for any span created anywhere on that IoContext, even for
spans originating from a promise continuation that logically belongs
to a different, earlier request. This is most visible on Durable
Objects where one IoContext serves many requests, but applies any
time a promise chain outlives the dispatch of another request on the
same IoContext.

## What this change does

### New AsyncContextFrame key and `makeUserAsyncTraceScope`

`Worker::Isolate` owns a second `AsyncContextFrame::StorageKey`,
`userTraceAsyncContextKey`, alongside the existing
`traceAsyncContextKey`. A new `IoContext::makeUserAsyncTraceScope` is
the exact analogue of `makeAsyncTraceScope`: it returns an RAII
`StorageScope` that pushes a frame carrying the user-trace root. V8
captures the frame when a promise continuation is enqueued inside the
scope, so the continuation sees the same root when it later runs.

`makeUserAsyncTraceScope` is called immediately after
`makeAsyncTraceScope` at every call site in the codebase:
`WorkerEntrypoint::request`/`connect`/`runScheduled`/`runAlarmImpl`/
`test`, `QueueCustomEventImpl::run`, and `sendTracesToExportedHandler`.

### `getCurrentUserTraceSpan` is now async-context-aware

If the JS lock is held, look up the current AsyncContextFrame under
the user-trace key and return `holder->getSpan()` (see below).
Otherwise fall back to the current IncomingRequest's root, preserving
behavior for code paths without the JS lock.

### `UserTraceSpanHolder` - preventing tracer lifetime extension

A naive implementation (`IoOwn<SpanParent>` keyed off the new storage
key) works for stateless workers but is catastrophically broken for
actors. The `UserSpanObserver` holds a `kj::Own<BaseTracer>` via its
submitter. Things in `IoOwn<T>` live on the IoContext's delete queue
and are only destroyed at IoContext teardown - for an actor, that can
be hours or days after any individual request. And the `BaseTracer`'s
destructor is what emits the final `outcome` streaming-tail event, so
a leaked tracer means the request's tail stream never receives an
`outcome`.

`UserTraceSpanHolder` is a refcounted indirection that decouples the
identity of the slot from the lifetime of its contents:

    class UserTraceSpanHolder final: public kj::Refcounted {
      SpanParent getSpan();    // null if cleared
      void setSpan(SpanParent);
      void clear();
     private:
      kj::Maybe<SpanParent> span;
    };

The IncomingRequest owns the holder, populates it in `delivered()`,
and `clear()`s it in its destructor before `metrics`
(`RequestObserverWithTracer`) is destroyed. Other references to the
holder - specifically the one placed in `IoOwn<UserTraceSpanHolder>`
via the AsyncContextFrame - observe the same object but cannot
prevent the span and its tracer ref from being released at
end-of-request. The ordering of `clear()` before `metrics` destruction
ensures the tracer's refcount hits zero right after
`setOutcome` is called, so its destructor emits the outcome promptly.

### `makeUserTraceSpan` intentionally unchanged

Same semantics as before and as `makeTraceSpan`: build a span whose
parent is `getCurrentUserTraceSpan()`, but do *not* push it. The ~45
existing `makeUserTraceSpan` call sites in `api/*.c++` are unchanged
and continue to parent under the IncomingRequest root via the
fallback path when no enclosing user-trace scope has been pushed.
Future work can (a) opt specific call sites into
`makeUserAsyncTraceScope` where nested scopes are desired and (b) add
a JS user-tracing API analogous to `InternalJaegerTrace::enterSpan`
that wraps user callbacks in a `makeUserAsyncTraceScope`.

## Test expectation change

`src/workerd/api/tests/tail-worker-test.js` expectations for the
actor-alarm test (expected lines 111 and 112) are updated. This test
is the only place in the workerd test corpus that exercises
cross-request async context propagation for user spans, and its old
expectations encoded the pre-existing bug.

The test case (`actor-alarms-test.js`) does 1 setAlarm + 3 getAlarm:

    async fetch() {
      await setAlarm(time)                   [F1]
      await getAlarm()                       [F2]
      await waitForAlarm(time):
        await promise                        [W1] yields
          // alarm() runs here, calls getAlarm() [A1], resolves promise
        await getAlarm()                     [W2]
    }

Under the old code, [W2] ran after alarm() had been delivered on the
same IoContext. `getCurrentIncomingRequest()` returned the alarm
request (most recently delivered), so [W2]'s span was parented under
alarm's root even though it is logically part of fetch's call chain.

Under the new code, fetch's entrypoint pushed a
`makeUserAsyncTraceScope` seeded with fetch's root holder. V8
captured that frame when [W1] suspended. When [W2] runs on resumption,
the frame is restored and `getCurrentUserTraceSpan()` returns fetch's
root.

    Old expectations:  fetch=[F1, F2], alarm=[A1, W2]  // W2 misattributed
    New expectations:  fetch=[F1, F2, W2], alarm=[A1]

Total getAlarm count is unchanged (3). Only attribution changed, and
the new attribution matches the async call graph - which is what every
other tracing system (internal Jaeger in workerd itself,
OpenTelemetry, Sentry, Zipkin) does, and what developers expect when
reading their code.

This is a user-observable change in streaming tail worker event
structure for any case where a promise chain from one request is
resumed after another request on the same IoContext has been
delivered. If staged rollout is desired, gating the async-context
lookup in `getCurrentUserTraceSpan()` behind an autogate or compat
flag would be a small targeted change; this commit does not add one -
defer to the reviewer.
Replaces `tracing.startSpan(name) -> JsSpan` with
`tracing.enterSpan(name, callback, ...args)`, which pushes the new span
onto the AsyncContextFrame for the callback's duration and auto-ends it
on return, throw, or promise settlement. This makes nested spans and
runtime-generated spans (e.g. `fetch`) correctly parent on the
surrounding user span.

Child spans are wrapped in a fresh UserTraceSpanHolder before being
pushed onto the AsyncContextFrame; the holder is cleared by
SpanImpl::end() so actor IoContexts don't leak tracer refs across
requests.

`Span` and `SpanImpl` live in a new `workerd::api::user_tracing`
sub-namespace to avoid collision with `workerd::Span`. The JS class
name is still `Span` via `JSG_NESTED_TYPE_NAMED`.

`withSpan` in tracing-helpers.ts becomes a one-line passthrough over
`enterSpan`, so downstream callers (d1, images, etc.) need no changes.

Adds tracing-hierarchy-test which verifies parent/child relationships
in the streaming-tail event stream for nested enterSpan (sync and
async), fetch-inside-enterSpan, fetch-inside-nested-enterSpan, and
sibling enterSpan calls.
The class name is what users see in generated .d.ts output and in
`typeof` on the JS-side object. The 'Module' suffix was a vestige
from when the API was only reachable as an ESM import
(`import { tracing } from 'cloudflare:workers'`); it no longer
describes the type accurately.

Pure refactor: no behavior change and no new JS surface.
The module-registration helpers (`registerTracingModule`,
`getInternalTracingModuleBundle`) keep the 'Module' suffix — they
describe the JS module registration, not the class.
@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch from d3dc855 to c374a86 Compare April 22, 2026 11:10
Adds `ctx.tracing` (gated on the `workerdExperimental` compat flag
in `getTracing`, which returns `undefined` outside the flag) and a
`tracing` named export on `cloudflare:workers` that re-exports the
`cloudflare-internal:tracing` module.

The BUILD.bazel change folds `tracing.{h,c++}` into the
`api:srcs` / `:hdrs` filegroups and drops the standalone
`wd_cc_library(name = "tracing-module")` target. This is the
idiomatic workaround for the api<->io circular dep documented in
`src/workerd/io/BUILD.bazel`: `global-scope.c++` now needs the
full `Tracing` definition to implement `getTracing`, and
`global-scope.c++` lives inside `//src/workerd/io` via that
filegroup, so a standalone tracing-module library that (transitively)
depends on `//src/workerd/io` would cycle.
@danlapid danlapid force-pushed the dlapid/asyncContextUserTracing branch from c374a86 to fa4180d Compare April 22, 2026 12:00
@danlapid danlapid merged commit dc25e37 into main Apr 22, 2026
22 of 23 checks passed
@danlapid danlapid deleted the dlapid/asyncContextUserTracing branch April 22, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants