[Diagnostics]: High-performance EventSource runtime async profiler. by lateralusX · Pull Request #127238 · dotnet/runtime

lateralusX · 2026-04-21T19:01:10Z

Motivation

TPL includes support to capture compiler async (AsyncV1) events to stitch together async callstacks in tools like PerfView, VS .NET Async Profiler, and Application Insights Profiler.

The challenge using TPL events profiling async-heavy workloads is that they are verbose and produce a lot of data, creating too much overhead on the profiled process and skewing measurements.

Each TPL event is written into ETW/EventPipe/UserEvents causing latency (kernel call) as well as additional data (~100-byte header). Even a small event without a stack takes 200–500 ns to emit at 100+ bytes. TPL tracking of async execution generates heavy traffic on the eventing subsystem, increasing the risk of dropping events.

TPL overhead (synthetic benchmark)

Async resume/suspend rate	Throughput drop	ETL size (20 s)	Dropped events
1M/s	>75%		severe (requires enlarged ETW buffers)
100K/s	~45%	3+ GB	high (requires enlarged ETW buffers)
10K/s	~10%		moderate

TPL depends on a complete chain of events to recreate async callstacks — losing any events makes post-processing unreliable.

There have been ideas for quite some time to look into a more lightweight approach to track async method execution, making it possible to recreate async callstacks for sync callstacks captured by external tools like OS CPU samplers and profilers.

With the introduction of runtime async (AsyncV2), it was decided to revisit this and see what we could do to improve the profiler experience of async code. The async profiler is not tied to runtime async methods (AsyncV2), so it will be able to handle compiler async methods (AsyncV1) as well, but this PR focuses on AsyncV2. Follow-up PRs will add AsyncV1 support, making it possible to use the new async profiler to collect both AsyncV1 and AsyncV2 async callstacks.

Design

NOTE: All formats introduced by this PR are currently considered internal and can be changed without notice.

This PR adds AsyncProfilerBufferedEventSource — a high-performance EventSource for async method profiling that uses per-thread buffered event emission with centralized flush coordination.

Core architecture

Per-thread event buffers with lock-free acquire/release for zero-contention writes on the hot path.
Delta timestamp encoding using compressed variable-length integers (LEB128 + zigzag), reducing per-event timestamp overhead from 8 bytes to typically 1–2 bytes under load.
Delta IP encoding using compressed variable-length integers, reducing bytes per frame IP.
Centralized AsyncThreadContextCache with background flush timer for idle and dead thread buffer reclamation.
Continuation wrapper table for compact async callstack representation, mapping runtime IPs to table indices — enables matching sync callstacks captured by OS CPU profilers through the resume async callstack event.

Event types

Events cover the full async lifecycle:

Category	Events
Async context	create, resume, suspend, complete
Async method	resume, complete
Exception unwind	unhandled, handled
Async callstacks	create, resume, suspend

Buffer management

Configurable buffer size via DOTNET_AsyncProfilerBufferedEventSource_EventBufferSize (default 16 KB − 256 bytes).
Optimized buffer serialization with low overhead.
SyncPoint mechanism for coordinated config changes across writer threads.
BlockContext flag for safe flush-thread access to live thread buffers with 100 ms spin timeout to prevent flush thread stalls.

Integration

Async profiler wired into AsyncInstrumentation alongside the async debugger.
Cross-runtime design: EventSource + AsyncProfiler with CoreCLR-specific parts in a dedicated source file.
Mono stub for potential future platform support (AsyncV1).

Test coverage

Comprehensive test suite (AsyncProfilerTests) validating event correctness, buffer serialization, delta encoding, callstack capture, config changes, and multi-threaded stress scenarios.

Performance results

Overhead comparison: async profiler vs. TPL

Async resume/suspend rate	TPL overhead	Async profiler overhead	Improvement
1M/s	>75%	~17%	~4×
100K/s	~45%	<2%	~20×
10K/s	~10%	<0.2% (noise)	~50×

Note: The 1M/s scenario is extreme — the benchmark does virtually no work, only exercising the internal async dispatch loop. Any real user code in the async methods will quickly reduce the relative overhead.

Data volume

ETL file size is down ~10× for scenarios capturing data to recreate async callstacks. For the 1M/s scenario over 20 seconds:

	ETL size	Dropped events
TPL	3+ GB	many (default ETW settings)
Async profiler	~330 MB	none (default ETW settings)

VS CPU profiling visibility

Running the 1M/s scenario under VS CPU profiling, none of the async profiler methods stand out — most are in the 0.01–0.03% self-CPU range with very low sample counts. The instrumented DispatchContinuations function shows ~2% overhead compared to the uninstrumented version. Even in very heavy async workloads, the async profiler does not pollute VS CPU profiling output.

Continuation wrapper optimization

My initial ambition was to track the async callstack on any thread at any point using a single event including the resumed async callstack executed through the dispatch loop. Initially that strategy hit issues due to ambiguity mapping methods between sync callstacks collected by the OS CPU sampler and the resumed async callstack active at that time.

This can be solved using CompleteAsyncMethod events to recreate async callstacks at any point in time. CompleteAsyncMethod is just a signal event consuming a couple of bytes in the event buffer, but it introduces ~30–40 ns per completed method. The major cost is capturing the QPC (~15–20 ns, platform-dependent); the rest is raw memcpy + delta encoding. Reading the timestamp directly via CPU instruction could bring this down to ~10 ns total — a potential future optimization (would require a JIT intrinsic).

Instead of continuing to optimize CompleteAsyncMethod, I revisited the original problem: if we inject an anchor in the sync callstack captured by external tools, we can use it to tie into the async callstack emitted via the resume async callstack event. Since we control the dispatch loop running each continuation, it's possible to call through an indexed wrapper that places enough information in the sync callstack to identify the current resumed continuation.

With up to 255 frames in an async callstack, the pre-generated wrappers are capped at 32 and recycled, emitting a reset event into the stream to signal reuse to parsers. This mechanism makes it possible to recreate any async callstack tied to sync callstacks captured by external tools using a single event (ResumeAsyncCallstack).

Calling through the wrapper costs ~5 ns per method resume — compared to ~30–40 ns for CompleteAsyncMethod plus increased output size, this ended up as a very successful optimization.

Timestamp correlation

All async profiler events are emitted using existing EventSource infrastructure, meaning it's possible to listen on the event stream using in-proc EventListeners, ICorProfiler, as well as external ETW/EventPipe/UserEvent clients.

When events are used to recreate async callstacks for other events capturing sync callstacks emitted into the same event subsystem, all events share the same timestamp infrastructure. If events are emitted using different timestamp infrastructure not in sync, timestamps need to be re-synchronized and adjusted before use.

Each buffered event uses the machine's QPC infrastructure (Stopwatch.GetTimestamp), and each event buffer includes the timestamp of first and last event. The metadata event emitted at the beginning of the stream includes a reference QPC + QPC frequency + reference UTC time in ticks, making it possible to convert all buffered events to wall clock time.

Future work

AsyncV1 support will come as follow-up PR(s).

Commit adds AsyncProfilerBufferedEventSource - a high-performance EventSource for async method profiling that uses per-thread buffered event emission with centralized flush coordination. Key design: - Per-thread event buffers with lock-free acquire/release for zero-contention writes on the hot path. - Delta timestamp encoding using compressed variable-length integers, reducing per-event timestamp overhead from 8 bytes to typically 1-2 bytes under load. - Delta IP encoding using compressed variable length integers, reducing bytes used per frame IP. - Variable-length compressed integers using LEB128 and zigzag encoding. - Centralized AsyncThreadContextCache with background flush timer for idle and dead thread buffer reclamation. - Continuation wrapper table for compact async callstack representation, mapping runtime IPs to table indices. Makes it possible to match sync callstacks captured by OS CPU profiler with resume async callstack event. Event types cover the full async lifecycle: - async context: create/resume/suspend/complete. - async method: resume/complete. - exception unwind: unhandled/handled. - async callstacks: create/resume/suspend. Buffer management: - Configurable buffer size via DOTNET_AsyncProfilerBufferedEventSource_EventBufferSize (default 16KB - 256 bytes). - Optimized buffer serialization methods, low overhead serializing events. - SyncPoint mechanism for coordinated config changes across writer threads. - BlockContext flag for safe flush-thread access to live thread buffers with 100ms spin timeout to prevent flush thread stalls. Integration: - Async profiler wired into AsyncInstrumentation alongside the async debugger. - EventSource + AsyncProfiler, cross runtime support. Runtime specific parts implemented in a CoreCLR specific source file. - Mono stub for potential future platform support (AsyncV1). Includes comprehensive test coverage (AsyncProfilerTests) validating event correctness, buffer serialization, delta encoding, callstack capture, config changes, and multi-threaded stress scenarios.

* Fix 0 length room for callstack. * Have AsyncEventHeader return header start index. * Add qpc and utc time to metadata to make time conversion possible. * Harden buffer allocation failure. * AI review feedback. * Adjust tests.

dotnet-policy-service · 2026-04-21T19:03:19Z

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR introduces a new high-performance, buffered EventSource-based async profiler pipeline for Runtime Async (AsyncV2), integrates it into the CoreCLR runtime-async dispatch loop (including continuation-wrapper anchoring), and adds a comprehensive test suite to validate the emitted buffer format and event semantics.

Changes:

Add AsyncProfilerBufferedEventSource + shared AsyncProfiler buffering/encoding infrastructure in CoreLib.
Integrate async-profiler emission into CoreCLR runtime-async dispatching (resume/suspend/complete, unwinds, callstack capture, wrapper dispatch).
Add AsyncProfilerTests (CoreCLR-only) and wire them into System.Threading.Tasks.Tests; add Mono stubs and project wiring for multi-runtime builds.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/mono/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.Mono.cs	Mono stub implementation for async-profiler partials (wrappers + syncpoint).
src/mono/System.Private.CoreLib/System.Private.CoreLib.csproj	Includes the new Mono stub file in Mono CoreLib build.
src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Threading.Tasks.Tests.csproj	Adds AsyncProfilerTests to the System.Runtime task tests (non-Mono).
src/libraries/System.Runtime/tests/System.Threading.Tasks.Tests/System.Runtime.CompilerServices/AsyncProfilerTests.cs	New test suite validating buffer header, event stream, callstack encoding, wrapper anchoring, flushing, etc.
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfilerBufferedEventSource.cs	New buffered EventSource emitting async-profiler buffers and receiving commands (update/flush).
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.cs	Shared async-profiler buffering, encoding (varint/zigzag), per-thread context caching, and config/syncpoint logic.
src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncInstrumentation.cs	Ensures async-profiler EventSource is initialized so flags/config can be established.
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems	Adds shared CoreLib compilation items for the new AsyncProfiler sources.
src/coreclr/nativeaot/System.Private.CoreLib/src/System.Private.CoreLib.csproj	Includes CoreCLR async-profiler integration file for NativeAOT CoreLib.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncProfiler.CoreCLR.cs	CoreCLR-specific async-profiler integration: wrapper table, callstack capture, and syncpoint resume replay.
src/coreclr/System.Private.CoreLib/System.Private.CoreLib.csproj	Includes the CoreCLR async-profiler integration file in CoreCLR CoreLib build.
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncHelpers.CoreCLR.cs	Wires async-profiler emission into runtime-async dispatcher and adds wrapper-dispatch path.

Copilot · 2026-04-21T19:11:37Z

+
+            public static long[] GetContinuationWrapperIPs()
+            {
+                return new long[COUNT];


On Mono, GetContinuationWrapperIPs() returns a long[COUNT] that will be all zeros, but metadata emission still reports wrapperCount = COUNT. If the event source is enabled on Mono, consumers will see wrapper IPs that look valid-but-unusable. Consider returning an empty array (wrapperCount=0) or otherwise clearly signaling “wrappers not supported” for Mono until real wrapper IPs can be provided.

Suggested change

return new long[COUNT];

return Array.Empty<long>();

Copilot · 2026-04-21T19:11:38Z

+            [MethodImpl(MethodImplOptions.AggressiveInlining)]
+            private static unsafe void ResumeAsyncCallstacks(AsyncThreadContext context)
+            {
+                //Write recursivly all the resume async callstack events.


Typo in comment: "recursivly" should be "recursively" (and add a space after // for readability).

Suggested change

//Write recursivly all the resume async callstack events.

// Write recursively all the resume async callstack events.

Copilot · 2026-04-21T19:11:38Z

+using System.Collections.Generic;
+using System.Diagnostics;
+using System.Diagnostics.Tracing;
+using System.Reflection;


using System.Reflection; appears unused in this file (no references found). Please remove it to avoid CS8019/unused-using warnings (which can be treated as build breaks in some configurations).

Suggested change

using System.Reflection;

Copilot · 2026-04-21T19:11:38Z

+                lock (CacheLock)
+                {
+                    Flush(true);
+


Cleanup takes lock (CacheLock) and then calls Flush(true), but Flush itself also takes the same lock. This double-locking is redundant and makes the locking scheme harder to reason about; consider splitting into a FlushCore(...) that assumes the lock is already held, or removing the outer lock and only locking around timer/cache state that truly needs it.

Copilot · 2026-04-21T19:11:39Z

+                lock (CacheLock)
+                {
+                    Flush(false);
+


PeriodicFlush takes lock (CacheLock) and then calls Flush(false), but Flush also takes the same lock. Consider refactoring to avoid nested locking (e.g., FlushCore under a single lock) to reduce overhead and simplify concurrency reasoning.

Copilot · 2026-04-21T19:11:39Z

+        /// </summary>
+        protected override void OnEventCommand(EventCommandEventArgs command)
+        {
+            if (command.Command == (EventCommand)FlushCommand || command.Command == EventCommand.SendManifest)


OnEventCommand treats EventCommand.SendManifest the same as the custom flush command and forces a full CaptureState() flush. SendManifest can be issued when a listener connects just to request metadata; flushing buffered async-profiler data at that point adds avoidable overhead and can produce unexpected emissions. Consider limiting CaptureState() to the explicit FlushCommand and handling SendManifest like other built-in commands (or no-op).

Suggested change

if (command.Command == (EventCommand)FlushCommand || command.Command == EventCommand.SendManifest)

if (command.Command == (EventCommand)FlushCommand)

lateralusX added 4 commits April 21, 2026 12:45

Additional adjustments:

1eecd0e

* Fix 0 length room for callstack. * Have AsyncEventHeader return header start index. * Add qpc and utc time to metadata to make time conversion possible. * Harden buffer allocation failure. * AI review feedback. * Adjust tests.

Fix line ending.

c97da19

Fix Mono build.

40fd867

Copilot AI review requested due to automatic review settings April 21, 2026 19:01

lateralusX requested a review from steveisok as a code owner April 21, 2026 19:01

lateralusX added the area-Diagnostics-coreclr label Apr 21, 2026

lateralusX requested review from MichalStrehovsky and vitek-karas as code owners April 21, 2026 19:01

lateralusX requested a review from noahfalk April 21, 2026 19:01

Copilot started reviewing on behalf of lateralusX April 21, 2026 19:02 View session

dotnet-policy-service Bot assigned lateralusX Apr 21, 2026

Copilot AI reviewed Apr 21, 2026

View reviewed changes

steveisok requested a review from a team April 21, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diagnostics]: High-performance EventSource runtime async profiler.#127238

[Diagnostics]: High-performance EventSource runtime async profiler.#127238
lateralusX wants to merge 4 commits intodotnet:mainfrom
lateralusX:lateralusX/async-profiler

lateralusX commented Apr 21, 2026

Uh oh!

dotnet-policy-service Bot commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	//Write recursivly all the resume async callstack events.
	// Write recursively all the resume async callstack events.

	if (command.Command == (EventCommand)FlushCommand \|\| command.Command == EventCommand.SendManifest)
	if (command.Command == (EventCommand)FlushCommand)

Conversation

lateralusX commented Apr 21, 2026

Motivation

TPL overhead (synthetic benchmark)

Design

Core architecture

Event types

Buffer management

Integration

Test coverage

Performance results

Overhead comparison: async profiler vs. TPL

Data volume

VS CPU profiling visibility

Continuation wrapper optimization

Timestamp correlation

Future work

Uh oh!

dotnet-policy-service Bot commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants