[Diagnostics]: High-performance EventSource runtime async profiler.#127238
[Diagnostics]: High-performance EventSource runtime async profiler.#127238lateralusX wants to merge 4 commits intodotnet:mainfrom
Conversation
Commit adds AsyncProfilerBufferedEventSource - a high-performance EventSource for async method profiling that uses per-thread buffered event emission with centralized flush coordination. Key design: - Per-thread event buffers with lock-free acquire/release for zero-contention writes on the hot path. - Delta timestamp encoding using compressed variable-length integers, reducing per-event timestamp overhead from 8 bytes to typically 1-2 bytes under load. - Delta IP encoding using compressed variable length integers, reducing bytes used per frame IP. - Variable-length compressed integers using LEB128 and zigzag encoding. - Centralized AsyncThreadContextCache with background flush timer for idle and dead thread buffer reclamation. - Continuation wrapper table for compact async callstack representation, mapping runtime IPs to table indices. Makes it possible to match sync callstacks captured by OS CPU profiler with resume async callstack event. Event types cover the full async lifecycle: - async context: create/resume/suspend/complete. - async method: resume/complete. - exception unwind: unhandled/handled. - async callstacks: create/resume/suspend. Buffer management: - Configurable buffer size via DOTNET_AsyncProfilerBufferedEventSource_EventBufferSize (default 16KB - 256 bytes). - Optimized buffer serialization methods, low overhead serializing events. - SyncPoint mechanism for coordinated config changes across writer threads. - BlockContext flag for safe flush-thread access to live thread buffers with 100ms spin timeout to prevent flush thread stalls. Integration: - Async profiler wired into AsyncInstrumentation alongside the async debugger. - EventSource + AsyncProfiler, cross runtime support. Runtime specific parts implemented in a CoreCLR specific source file. - Mono stub for potential future platform support (AsyncV1). Includes comprehensive test coverage (AsyncProfilerTests) validating event correctness, buffer serialization, delta encoding, callstack capture, config changes, and multi-threaded stress scenarios.
* Fix 0 length room for callstack. * Have AsyncEventHeader return header start index. * Add qpc and utc time to metadata to make time conversion possible. * Harden buffer allocation failure. * AI review feedback. * Adjust tests.
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new high-performance, buffered EventSource-based async profiler pipeline for Runtime Async (AsyncV2), integrates it into the CoreCLR runtime-async dispatch loop (including continuation-wrapper anchoring), and adds a comprehensive test suite to validate the emitted buffer format and event semantics.
Changes:
- Add
AsyncProfilerBufferedEventSource+ sharedAsyncProfilerbuffering/encoding infrastructure in CoreLib. - Integrate async-profiler emission into CoreCLR runtime-async dispatching (resume/suspend/complete, unwinds, callstack capture, wrapper dispatch).
- Add
AsyncProfilerTests(CoreCLR-only) and wire them intoSystem.Threading.Tasks.Tests; add Mono stubs and project wiring for multi-runtime builds.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mono/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.Mono.cs | Mono stub implementation for async-profiler partials (wrappers + syncpoint). |
| src/mono/System.Private.CoreLib/System.Private.CoreLib.csproj | Includes the new Mono stub file in Mono CoreLib build. |
| src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Threading.Tasks.Tests.csproj | Adds AsyncProfilerTests to the System.Runtime task tests (non-Mono). |
| src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/AsyncProfilerTests.cs | New test suite validating buffer header, event stream, callstack encoding, wrapper anchoring, flushing, etc. |
| src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfilerBufferedEventSource.cs | New buffered EventSource emitting async-profiler buffers and receiving commands (update/flush). |
| src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.cs | Shared async-profiler buffering, encoding (varint/zigzag), per-thread context caching, and config/syncpoint logic. |
| src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncInstrumentation.cs | Ensures async-profiler EventSource is initialized so flags/config can be established. |
| src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems | Adds shared CoreLib compilation items for the new AsyncProfiler sources. |
| src/coreclr/nativeaot/System.Private.CoreLib/src/System.Private.CoreLib.csproj | Includes CoreCLR async-profiler integration file for NativeAOT CoreLib. |
| src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.CoreCLR.cs | CoreCLR-specific async-profiler integration: wrapper table, callstack capture, and syncpoint resume replay. |
| src/coreclr/System.Private.CoreLib/System.Private.CoreLib.csproj | Includes the CoreCLR async-profiler integration file in CoreCLR CoreLib build. |
| src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs | Wires async-profiler emission into runtime-async dispatcher and adds wrapper-dispatch path. |
|
|
||
| public static long[] GetContinuationWrapperIPs() | ||
| { | ||
| return new long[COUNT]; |
There was a problem hiding this comment.
On Mono, GetContinuationWrapperIPs() returns a long[COUNT] that will be all zeros, but metadata emission still reports wrapperCount = COUNT. If the event source is enabled on Mono, consumers will see wrapper IPs that look valid-but-unusable. Consider returning an empty array (wrapperCount=0) or otherwise clearly signaling “wrappers not supported” for Mono until real wrapper IPs can be provided.
| return new long[COUNT]; | |
| return Array.Empty<long>(); |
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| private static unsafe void ResumeAsyncCallstacks(AsyncThreadContext context) | ||
| { | ||
| //Write recursivly all the resume async callstack events. |
There was a problem hiding this comment.
Typo in comment: "recursivly" should be "recursively" (and add a space after // for readability).
| //Write recursivly all the resume async callstack events. | |
| // Write recursively all the resume async callstack events. |
| using System.Collections.Generic; | ||
| using System.Diagnostics; | ||
| using System.Diagnostics.Tracing; | ||
| using System.Reflection; |
There was a problem hiding this comment.
using System.Reflection; appears unused in this file (no references found). Please remove it to avoid CS8019/unused-using warnings (which can be treated as build breaks in some configurations).
| using System.Reflection; |
| lock (CacheLock) | ||
| { | ||
| Flush(true); | ||
|
|
There was a problem hiding this comment.
Cleanup takes lock (CacheLock) and then calls Flush(true), but Flush itself also takes the same lock. This double-locking is redundant and makes the locking scheme harder to reason about; consider splitting into a FlushCore(...) that assumes the lock is already held, or removing the outer lock and only locking around timer/cache state that truly needs it.
| lock (CacheLock) | ||
| { | ||
| Flush(false); | ||
|
|
There was a problem hiding this comment.
PeriodicFlush takes lock (CacheLock) and then calls Flush(false), but Flush also takes the same lock. Consider refactoring to avoid nested locking (e.g., FlushCore under a single lock) to reduce overhead and simplify concurrency reasoning.
| /// </summary> | ||
| protected override void OnEventCommand(EventCommandEventArgs command) | ||
| { | ||
| if (command.Command == (EventCommand)FlushCommand || command.Command == EventCommand.SendManifest) |
There was a problem hiding this comment.
OnEventCommand treats EventCommand.SendManifest the same as the custom flush command and forces a full CaptureState() flush. SendManifest can be issued when a listener connects just to request metadata; flushing buffered async-profiler data at that point adds avoidable overhead and can produce unexpected emissions. Consider limiting CaptureState() to the explicit FlushCommand and handling SendManifest like other built-in commands (or no-op).
| if (command.Command == (EventCommand)FlushCommand || command.Command == EventCommand.SendManifest) | |
| if (command.Command == (EventCommand)FlushCommand) |
Motivation
TPL includes support to capture compiler async (AsyncV1) events to stitch together async callstacks in tools like PerfView, VS .NET Async Profiler, and Application Insights Profiler.
The challenge using TPL events profiling async-heavy workloads is that they are verbose and produce a lot of data, creating too much overhead on the profiled process and skewing measurements.
Each TPL event is written into ETW/EventPipe/UserEvents causing latency (kernel call) as well as additional data (~100-byte header). Even a small event without a stack takes 200–500 ns to emit at 100+ bytes. TPL tracking of async execution generates heavy traffic on the eventing subsystem, increasing the risk of dropping events.
TPL overhead (synthetic benchmark)
TPL depends on a complete chain of events to recreate async callstacks — losing any events makes post-processing unreliable.
There have been ideas for quite some time to look into a more lightweight approach to track async method execution, making it possible to recreate async callstacks for sync callstacks captured by external tools like OS CPU samplers and profilers.
With the introduction of runtime async (AsyncV2), it was decided to revisit this and see what we could do to improve the profiler experience of async code. The async profiler is not tied to runtime async methods (AsyncV2), so it will be able to handle compiler async methods (AsyncV1) as well, but this PR focuses on AsyncV2. Follow-up PRs will add AsyncV1 support, making it possible to use the new async profiler to collect both AsyncV1 and AsyncV2 async callstacks.
Design
NOTE: All formats introduced by this PR are currently considered internal and can be changed without notice.
This PR adds
AsyncProfilerBufferedEventSource— a high-performance EventSource for async method profiling that uses per-thread buffered event emission with centralized flush coordination.Core architecture
AsyncThreadContextCachewith background flush timer for idle and dead thread buffer reclamation.Event types
Events cover the full async lifecycle:
Buffer management
DOTNET_AsyncProfilerBufferedEventSource_EventBufferSize(default 16 KB − 256 bytes).Integration
AsyncInstrumentationalongside the async debugger.Test coverage
Comprehensive test suite (
AsyncProfilerTests) validating event correctness, buffer serialization, delta encoding, callstack capture, config changes, and multi-threaded stress scenarios.Performance results
Overhead comparison: async profiler vs. TPL
Data volume
ETL file size is down ~10× for scenarios capturing data to recreate async callstacks. For the 1M/s scenario over 20 seconds:
VS CPU profiling visibility
Running the 1M/s scenario under VS CPU profiling, none of the async profiler methods stand out — most are in the 0.01–0.03% self-CPU range with very low sample counts. The instrumented
DispatchContinuationsfunction shows ~2% overhead compared to the uninstrumented version. Even in very heavy async workloads, the async profiler does not pollute VS CPU profiling output.Continuation wrapper optimization
My initial ambition was to track the async callstack on any thread at any point using a single event including the resumed async callstack executed through the dispatch loop. Initially that strategy hit issues due to ambiguity mapping methods between sync callstacks collected by the OS CPU sampler and the resumed async callstack active at that time.
This can be solved using
CompleteAsyncMethodevents to recreate async callstacks at any point in time.CompleteAsyncMethodis just a signal event consuming a couple of bytes in the event buffer, but it introduces ~30–40 ns per completed method. The major cost is capturing the QPC (~15–20 ns, platform-dependent); the rest is raw memcpy + delta encoding. Reading the timestamp directly via CPU instruction could bring this down to ~10 ns total — a potential future optimization (would require a JIT intrinsic).Instead of continuing to optimize
CompleteAsyncMethod, I revisited the original problem: if we inject an anchor in the sync callstack captured by external tools, we can use it to tie into the async callstack emitted via the resume async callstack event. Since we control the dispatch loop running each continuation, it's possible to call through an indexed wrapper that places enough information in the sync callstack to identify the current resumed continuation.With up to 255 frames in an async callstack, the pre-generated wrappers are capped at 32 and recycled, emitting a reset event into the stream to signal reuse to parsers. This mechanism makes it possible to recreate any async callstack tied to sync callstacks captured by external tools using a single event (
ResumeAsyncCallstack).Calling through the wrapper costs ~5 ns per method resume — compared to ~30–40 ns for
CompleteAsyncMethodplus increased output size, this ended up as a very successful optimization.Timestamp correlation
All async profiler events are emitted using existing EventSource infrastructure, meaning it's possible to listen on the event stream using in-proc
EventListeners,ICorProfiler, as well as external ETW/EventPipe/UserEvent clients.When events are used to recreate async callstacks for other events capturing sync callstacks emitted into the same event subsystem, all events share the same timestamp infrastructure. If events are emitted using different timestamp infrastructure not in sync, timestamps need to be re-synchronized and adjusted before use.
Each buffered event uses the machine's QPC infrastructure (
Stopwatch.GetTimestamp), and each event buffer includes the timestamp of first and last event. The metadata event emitted at the beginning of the stream includes a reference QPC + QPC frequency + reference UTC time in ticks, making it possible to convert all buffered events to wall clock time.Future work