Skip to content

[Diagnostics]: High-performance EventSource runtime async profiler.#127238

Open
lateralusX wants to merge 4 commits intodotnet:mainfrom
lateralusX:lateralusX/async-profiler
Open

[Diagnostics]: High-performance EventSource runtime async profiler.#127238
lateralusX wants to merge 4 commits intodotnet:mainfrom
lateralusX:lateralusX/async-profiler

Conversation

@lateralusX
Copy link
Copy Markdown
Member

Motivation

TPL includes support to capture compiler async (AsyncV1) events to stitch together async callstacks in tools like PerfView, VS .NET Async Profiler, and Application Insights Profiler.

The challenge using TPL events profiling async-heavy workloads is that they are verbose and produce a lot of data, creating too much overhead on the profiled process and skewing measurements.

Each TPL event is written into ETW/EventPipe/UserEvents causing latency (kernel call) as well as additional data (~100-byte header). Even a small event without a stack takes 200–500 ns to emit at 100+ bytes. TPL tracking of async execution generates heavy traffic on the eventing subsystem, increasing the risk of dropping events.

TPL overhead (synthetic benchmark)

Async resume/suspend rate Throughput drop ETL size (20 s) Dropped events
1M/s >75% severe (requires enlarged ETW buffers)
100K/s ~45% 3+ GB high (requires enlarged ETW buffers)
10K/s ~10% moderate

TPL depends on a complete chain of events to recreate async callstacks — losing any events makes post-processing unreliable.

There have been ideas for quite some time to look into a more lightweight approach to track async method execution, making it possible to recreate async callstacks for sync callstacks captured by external tools like OS CPU samplers and profilers.

With the introduction of runtime async (AsyncV2), it was decided to revisit this and see what we could do to improve the profiler experience of async code. The async profiler is not tied to runtime async methods (AsyncV2), so it will be able to handle compiler async methods (AsyncV1) as well, but this PR focuses on AsyncV2. Follow-up PRs will add AsyncV1 support, making it possible to use the new async profiler to collect both AsyncV1 and AsyncV2 async callstacks.


Design

NOTE: All formats introduced by this PR are currently considered internal and can be changed without notice.

This PR adds AsyncProfilerBufferedEventSource — a high-performance EventSource for async method profiling that uses per-thread buffered event emission with centralized flush coordination.

Core architecture

  • Per-thread event buffers with lock-free acquire/release for zero-contention writes on the hot path.
  • Delta timestamp encoding using compressed variable-length integers (LEB128 + zigzag), reducing per-event timestamp overhead from 8 bytes to typically 1–2 bytes under load.
  • Delta IP encoding using compressed variable-length integers, reducing bytes per frame IP.
  • Centralized AsyncThreadContextCache with background flush timer for idle and dead thread buffer reclamation.
  • Continuation wrapper table for compact async callstack representation, mapping runtime IPs to table indices — enables matching sync callstacks captured by OS CPU profilers through the resume async callstack event.

Event types

Events cover the full async lifecycle:

Category Events
Async context create, resume, suspend, complete
Async method resume, complete
Exception unwind unhandled, handled
Async callstacks create, resume, suspend

Buffer management

  • Configurable buffer size via DOTNET_AsyncProfilerBufferedEventSource_EventBufferSize (default 16 KB − 256 bytes).
  • Optimized buffer serialization with low overhead.
  • SyncPoint mechanism for coordinated config changes across writer threads.
  • BlockContext flag for safe flush-thread access to live thread buffers with 100 ms spin timeout to prevent flush thread stalls.

Integration

  • Async profiler wired into AsyncInstrumentation alongside the async debugger.
  • Cross-runtime design: EventSource + AsyncProfiler with CoreCLR-specific parts in a dedicated source file.
  • Mono stub for potential future platform support (AsyncV1).

Test coverage

Comprehensive test suite (AsyncProfilerTests) validating event correctness, buffer serialization, delta encoding, callstack capture, config changes, and multi-threaded stress scenarios.


Performance results

Overhead comparison: async profiler vs. TPL

Async resume/suspend rate TPL overhead Async profiler overhead Improvement
1M/s >75% ~17% ~4×
100K/s ~45% <2% ~20×
10K/s ~10% <0.2% (noise) ~50×

Note: The 1M/s scenario is extreme — the benchmark does virtually no work, only exercising the internal async dispatch loop. Any real user code in the async methods will quickly reduce the relative overhead.

Data volume

ETL file size is down ~10× for scenarios capturing data to recreate async callstacks. For the 1M/s scenario over 20 seconds:

ETL size Dropped events
TPL 3+ GB many (default ETW settings)
Async profiler ~330 MB none (default ETW settings)

VS CPU profiling visibility

Running the 1M/s scenario under VS CPU profiling, none of the async profiler methods stand out — most are in the 0.01–0.03% self-CPU range with very low sample counts. The instrumented DispatchContinuations function shows ~2% overhead compared to the uninstrumented version. Even in very heavy async workloads, the async profiler does not pollute VS CPU profiling output.


Continuation wrapper optimization

My initial ambition was to track the async callstack on any thread at any point using a single event including the resumed async callstack executed through the dispatch loop. Initially that strategy hit issues due to ambiguity mapping methods between sync callstacks collected by the OS CPU sampler and the resumed async callstack active at that time.

This can be solved using CompleteAsyncMethod events to recreate async callstacks at any point in time. CompleteAsyncMethod is just a signal event consuming a couple of bytes in the event buffer, but it introduces ~30–40 ns per completed method. The major cost is capturing the QPC (~15–20 ns, platform-dependent); the rest is raw memcpy + delta encoding. Reading the timestamp directly via CPU instruction could bring this down to ~10 ns total — a potential future optimization (would require a JIT intrinsic).

Instead of continuing to optimize CompleteAsyncMethod, I revisited the original problem: if we inject an anchor in the sync callstack captured by external tools, we can use it to tie into the async callstack emitted via the resume async callstack event. Since we control the dispatch loop running each continuation, it's possible to call through an indexed wrapper that places enough information in the sync callstack to identify the current resumed continuation.

With up to 255 frames in an async callstack, the pre-generated wrappers are capped at 32 and recycled, emitting a reset event into the stream to signal reuse to parsers. This mechanism makes it possible to recreate any async callstack tied to sync callstacks captured by external tools using a single event (ResumeAsyncCallstack).

Calling through the wrapper costs ~5 ns per method resume — compared to ~30–40 ns for CompleteAsyncMethod plus increased output size, this ended up as a very successful optimization.


Timestamp correlation

All async profiler events are emitted using existing EventSource infrastructure, meaning it's possible to listen on the event stream using in-proc EventListeners, ICorProfiler, as well as external ETW/EventPipe/UserEvent clients.

When events are used to recreate async callstacks for other events capturing sync callstacks emitted into the same event subsystem, all events share the same timestamp infrastructure. If events are emitted using different timestamp infrastructure not in sync, timestamps need to be re-synchronized and adjusted before use.

Each buffered event uses the machine's QPC infrastructure (Stopwatch.GetTimestamp), and each event buffer includes the timestamp of first and last event. The metadata event emitted at the beginning of the stream includes a reference QPC + QPC frequency + reference UTC time in ticks, making it possible to convert all buffered events to wall clock time.


Future work

  • AsyncV1 support will come as follow-up PR(s).

Commit adds AsyncProfilerBufferedEventSource - a high-performance
EventSource for async method profiling that uses per-thread buffered
event emission with centralized flush coordination.

Key design:
 - Per-thread event buffers with lock-free acquire/release for
   zero-contention writes on the hot path.
 - Delta timestamp encoding using compressed variable-length integers,
   reducing per-event timestamp overhead from 8 bytes to typically
   1-2 bytes under load.
 - Delta IP encoding using compressed variable length integers,
   reducing bytes used per frame IP.
 - Variable-length compressed integers using LEB128 and zigzag encoding.
 - Centralized AsyncThreadContextCache with background flush timer for
   idle and dead thread buffer reclamation.
 - Continuation wrapper table for compact async callstack representation,
   mapping runtime IPs to table indices. Makes it possible to match
   sync callstacks captured by OS CPU profiler with resume async
   callstack event.

Event types cover the full async lifecycle:
 - async context: create/resume/suspend/complete.
 - async method: resume/complete.
 - exception unwind: unhandled/handled.
 - async callstacks: create/resume/suspend.

Buffer management:
 - Configurable buffer size via DOTNET_AsyncProfilerBufferedEventSource_EventBufferSize
   (default 16KB - 256 bytes).
 - Optimized buffer serialization methods, low overhead serializing events.
 - SyncPoint mechanism for coordinated config changes across writer threads.
 - BlockContext flag for safe flush-thread access to live thread buffers with
   100ms spin timeout to prevent flush thread stalls.

 Integration:
 - Async profiler wired into AsyncInstrumentation alongside the async debugger.
 - EventSource + AsyncProfiler, cross runtime support. Runtime specific
   parts implemented in a CoreCLR specific source file.
 - Mono stub for potential future platform support (AsyncV1).

Includes comprehensive test coverage (AsyncProfilerTests) validating event
correctness, buffer serialization, delta encoding, callstack capture,
config changes, and multi-threaded stress scenarios.
* Fix 0 length room for callstack.
* Have AsyncEventHeader return header start index.
* Add qpc and utc time to metadata to make time conversion possible.
* Harden buffer allocation failure.
* AI review feedback.
* Adjust tests.
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new high-performance, buffered EventSource-based async profiler pipeline for Runtime Async (AsyncV2), integrates it into the CoreCLR runtime-async dispatch loop (including continuation-wrapper anchoring), and adds a comprehensive test suite to validate the emitted buffer format and event semantics.

Changes:

  • Add AsyncProfilerBufferedEventSource + shared AsyncProfiler buffering/encoding infrastructure in CoreLib.
  • Integrate async-profiler emission into CoreCLR runtime-async dispatching (resume/suspend/complete, unwinds, callstack capture, wrapper dispatch).
  • Add AsyncProfilerTests (CoreCLR-only) and wire them into System.Threading.Tasks.Tests; add Mono stubs and project wiring for multi-runtime builds.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/mono/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.Mono.cs Mono stub implementation for async-profiler partials (wrappers + syncpoint).
src/mono/System.Private.CoreLib/System.Private.CoreLib.csproj Includes the new Mono stub file in Mono CoreLib build.
src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Threading.Tasks.Tests.csproj Adds AsyncProfilerTests to the System.Runtime task tests (non-Mono).
src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/AsyncProfilerTests.cs New test suite validating buffer header, event stream, callstack encoding, wrapper anchoring, flushing, etc.
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfilerBufferedEventSource.cs New buffered EventSource emitting async-profiler buffers and receiving commands (update/flush).
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.cs Shared async-profiler buffering, encoding (varint/zigzag), per-thread context caching, and config/syncpoint logic.
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncInstrumentation.cs Ensures async-profiler EventSource is initialized so flags/config can be established.
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems Adds shared CoreLib compilation items for the new AsyncProfiler sources.
src/coreclr/nativeaot/System.Private.CoreLib/src/System.Private.CoreLib.csproj Includes CoreCLR async-profiler integration file for NativeAOT CoreLib.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.CoreCLR.cs CoreCLR-specific async-profiler integration: wrapper table, callstack capture, and syncpoint resume replay.
src/coreclr/System.Private.CoreLib/System.Private.CoreLib.csproj Includes the CoreCLR async-profiler integration file in CoreCLR CoreLib build.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs Wires async-profiler emission into runtime-async dispatcher and adds wrapper-dispatch path.


public static long[] GetContinuationWrapperIPs()
{
return new long[COUNT];
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mono, GetContinuationWrapperIPs() returns a long[COUNT] that will be all zeros, but metadata emission still reports wrapperCount = COUNT. If the event source is enabled on Mono, consumers will see wrapper IPs that look valid-but-unusable. Consider returning an empty array (wrapperCount=0) or otherwise clearly signaling “wrappers not supported” for Mono until real wrapper IPs can be provided.

Suggested change
return new long[COUNT];
return Array.Empty<long>();

Copilot uses AI. Check for mistakes.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void ResumeAsyncCallstacks(AsyncThreadContext context)
{
//Write recursivly all the resume async callstack events.
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "recursivly" should be "recursively" (and add a space after // for readability).

Suggested change
//Write recursivly all the resume async callstack events.
// Write recursively all the resume async callstack events.

Copilot uses AI. Check for mistakes.
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Reflection;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using System.Reflection; appears unused in this file (no references found). Please remove it to avoid CS8019/unused-using warnings (which can be treated as build breaks in some configurations).

Suggested change
using System.Reflection;

Copilot uses AI. Check for mistakes.
Comment on lines +1044 to +1047
lock (CacheLock)
{
Flush(true);

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup takes lock (CacheLock) and then calls Flush(true), but Flush itself also takes the same lock. This double-locking is redundant and makes the locking scheme harder to reason about; consider splitting into a FlushCore(...) that assumes the lock is already held, or removing the outer lock and only locking around timer/cache state that truly needs it.

Copilot uses AI. Check for mistakes.
Comment on lines +1060 to +1063
lock (CacheLock)
{
Flush(false);

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeriodicFlush takes lock (CacheLock) and then calls Flush(false), but Flush also takes the same lock. Consider refactoring to avoid nested locking (e.g., FlushCore under a single lock) to reduce overhead and simplify concurrency reasoning.

Copilot uses AI. Check for mistakes.
/// </summary>
protected override void OnEventCommand(EventCommandEventArgs command)
{
if (command.Command == (EventCommand)FlushCommand || command.Command == EventCommand.SendManifest)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnEventCommand treats EventCommand.SendManifest the same as the custom flush command and forces a full CaptureState() flush. SendManifest can be issued when a listener connects just to request metadata; flushing buffered async-profiler data at that point adds avoidable overhead and can produce unexpected emissions. Consider limiting CaptureState() to the explicit FlushCommand and handling SendManifest like other built-in commands (or no-op).

Suggested change
if (command.Command == (EventCommand)FlushCommand || command.Command == EventCommand.SendManifest)
if (command.Command == (EventCommand)FlushCommand)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants