Skip to content

Fix OTel telemetry perf regression in CLI restore#54063

Draft
marcpopMSFT wants to merge 1 commit intomainfrom
fix/otel-telemetry-perf
Draft

Fix OTel telemetry perf regression in CLI restore#54063
marcpopMSFT wants to merge 1 commit intomainfrom
fix/otel-telemetry-perf

Conversation

@marcpopMSFT
Copy link
Copy Markdown
Member

  • Defer OTel provider initialization to after TELEMETRY_OPTOUT check
  • Remove AddHttpClientInstrumentation() that was hooking all NuGet HTTP calls
  • Remove unused MeterProvider (no Meter defined in CLI, pure overhead)
  • Gate OTLP exporter behind explicit OTEL_EXPORTER_OTLP_ENDPOINT config
  • Add thread-safe double-checked lock for provider initialization

- Defer OTel provider initialization to after TELEMETRY_OPTOUT check
- Remove AddHttpClientInstrumentation() that was hooking all NuGet HTTP calls
- Remove unused MeterProvider (no Meter defined in CLI, pure overhead)
- Gate OTLP exporter behind explicit OTEL_EXPORTER_OTLP_ENDPOINT config
- Add thread-safe double-checked lock for provider initialization

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 24, 2026 15:59
@marcpopMSFT
Copy link
Copy Markdown
Member Author

Perf summary with this fix:

BEFORE FIX (preview 4 SDK):
   Default (telemetry on):  AVG 36.41s  MIN 23.56s  MAX 61.55s
   Telemetry opted-out:     AVG 20.98s  MIN 19.50s  MAX 23.14s
 
 AFTER FIX (locally-built SDK):
   Default (telemetry on):  AVG 29.46s  MIN 25.35s  MAX 36.31s

Improvements:

 - ~19% faster on average (36.4→29.5s)
 - MAX dropped from
  61.5→36.3s (eliminated extreme spikes from OTLP timeout)
 - Much more consistent variance

Remaining gap (~8.5s vs opt-out) is primarily from:

 1. Azure Monitor exporter (~4-5s) — still initializes storage dir, HTTP client, flushes at shutdown
 2. TracerProvider base cost (~3-4s) — in-memory exporter, sampler, activity source plumbing

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a performance regression in dotnet CLI restore telemetry by deferring OpenTelemetry (OTel) provider setup until after telemetry opt-out is evaluated and by reducing default instrumentation/export overhead.

Changes:

  • Defer OTel TracerProvider initialization until after TELEMETRY_OPTOUT is checked, with a guarded one-time initialization path.
  • Remove HTTP/runtime instrumentation and the (unused) metrics provider to avoid hooking/overhead during NuGet restore.
  • Only enable OTLP exporting when an explicit OTEL_EXPORTER_OTLP_*_ENDPOINT environment variable is configured.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/Cli/dotnet/dotnet.csproj Removes OTel instrumentation package references that were contributing to restore overhead.
src/Cli/dotnet/Telemetry/TelemetryClient.cs Defers and gates tracer provider initialization; removes metrics provider and unconditional OTLP/http/runtime instrumentation.

Comment on lines +99 to +103
if (s_tracerProvider is not null)
{
return;
}

// Only add the OTLP exporter when the user has explicitly configured an endpoint.
var otlpEndpoint = Env.GetEnvironmentVariable("OTEL_EXPORTER_OTLP_ENDPOINT")
?? Env.GetEnvironmentVariable("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT");
if (!string.IsNullOrEmpty(otlpEndpoint))
Comment on lines +117 to +123
// Only add the OTLP exporter when the user has explicitly configured an endpoint.
var otlpEndpoint = Env.GetEnvironmentVariable("OTEL_EXPORTER_OTLP_ENDPOINT")
?? Env.GetEnvironmentVariable("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT");
if (!string.IsNullOrEmpty(otlpEndpoint))
{
tracerBuilder.AddOtlpExporter();
}
Comment on lines +118 to +119
var otlpEndpoint = Env.GetEnvironmentVariable("OTEL_EXPORTER_OTLP_ENDPOINT")
?? Env.GetEnvironmentVariable("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT");
Copy link
Copy Markdown
Member

@baronfel baronfel Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad for usability - most folks won't be defining these, they'll just be using tools that bind to the conventional OTel ports of 4317 and 4318. Put another way, this is very much not the expectation for the OTel ecosystem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still learning how all of this works, but all of the OLTP exported data doesn't go anywhere we can observe since data-x-platform only takes in traces which are produced from logging events (activity events get sent as well). If this is only powering the aspire dashboard for local perf investigations, I don't think we should incur the perf hit of having this exporter by default. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on conversation, this is only pushing or sending anything if there is a specific otel endpoint configured and running such as localhost:4317/traces

Copy link
Copy Markdown
Member

@MiYanni MiYanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhhh. My gut reaction is this wouldn't work. Here's some context:

The design of this has a duality between being static and also being an instance. I maintained that during the transition to OTel since it seemed more difficult to redesign it during the migration. That's why there is a "setup" of the static instances of these provider builders, so they are not duplicated. The changes in this PR are always creating new builders on each new telemetry client instance, which I don't know the impact of that.

That's why this is written this way:

s_metricsProvider ??= s_metricsProviderBuilder.Build();
s_tracerProvider ??= s_tracerProviderBuilder.Build();

That's in instance code, so that when they are built, they are only built once. This code is a mess, but I did the best with what was there.

Here, I believe Copilot thinks the OtlpExporter is the problem. If that is the case, then it should just be optional behind the DOTNET_CLI_TELEMETRY_DISABLE_TRACE_EXPORT env-var, like I did for the Azure Monitor exporter. But that really won't change the "normal" case of just using the released SDK with telemetry turned on.

Overall, I don't understand what Copilot is trying to accomplish here. Based on the comment in the code it added, it looks like it is trying to defer telemetry setup until we know if telemetry is disabled or not. AFAIK, I did that by not building the providers until it is known. Making the provider builders statically shouldn't be incurring a cost, unless I misunderstand what the point of the builders are. .Build() should be where costs are incurred, which is already behind the OTel enabled flag.

And deferring this logic won't fix the perf impact for the normal use of the official SDK build (since telemetry is on). That doesn't make any sense.

Copy link
Copy Markdown
Member

@MiYanni MiYanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this PR just deleted the meter provider. @baronfel Correct me if I'm wrong, but we're going to use that for AOT perf investigation, right? We could put that behind an environment variable if it is causing perf impact.

@baronfel
Copy link
Copy Markdown
Member

It's an alternative to ETW events, yeah - but nothing was really hooked up to it yet aside from . NET runtime metrics IIRC. things like GC diagnostics and assembly loads.

@MiYanni MiYanni marked this pull request as draft April 24, 2026 21:31
var tracerBuilder = Sdk.CreateTracerProviderBuilder()
.ConfigureResource(r => { r.AddService("dotnet-cli", serviceVersion: Product.Version); })
.AddSource(Activities.Source.Name)
.AddInMemoryExporter(s_activities)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this also might be the root cause based on conversations with @baronfel and @MiYanni

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants