Add MetricsEventSource #54333

noahfalk · 2021-06-17T13:47:01Z

The feature is still a work in progress but wanted to let others
see it in its current state while I am refining it.

Our out-of-process tools like dotnet-counters and dotnet-monitor need
access to the metrics produced by the new Meter APIs without
requiring the app to take any dependency on a separate OpenTelemetry
library. System.Diagnostics.Metrics EventSource is a new source designed
to let those tools access this data. The EventSource includes high
performance in-proc pre-aggregation capable of observing
millions of instrument invocations/sec/thread with low CPU overhead.

This change does not create any new BCL API surface, the aggregated
data is solely exposed by subscribing to the EventSource such as
using ETW, EventPipe, Lttng, or EventListener. For anyone wanting
in-process APIs to consume the data they could either use MeterListener
for unaggregated data or a library such as OpenTelemetry for
pre-aggregated data.

Todo list:

- I think we need a configuration on the EventSource to limit the max number of metrics to track. This ensures that even if someone goes nuts with tags we won't accidentally OOM the app trying to insert all those aggregations into the dictionaries.

ghost · 2021-06-17T13:47:08Z

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

The feature is still a work in progress but wanted to let others
see it in its current state while I am refining it.

Our out-of-process tools like dotnet-counters and dotnet-monitor need
access to the metrics produced by the new Meter APIs without
requiring the app to take any dependency on a separate OpenTelemetry
library. System.Diagnostics.Metrics EventSource is a new source designed
to let those tools access this data. The EventSource includes high
performance in-proc pre-aggregation capable of observing
millions of instrument invocations/sec/thread with low CPU overhead.

This change does not create any new BCL API surface, the aggregated
data is solely exposed by subscribing to the EventSource such as
using ETW, EventPipe, Lttng, or EventListener. For anyone wanting
in-process APIs to consume the data they could either use MeterListener
for unaggregated data or a library such as OpenTelemetry for
pre-aggregated data.

Author:	noahfalk
Assignees:	-
Labels:	`area-System.Diagnostics.Tracing`
Milestone:	-

noahfalk · 2021-06-17T13:53:07Z

@tarekgh @dotnet/dotnet-diag @cijothomas @reyang @victlu @wiktork @jander-msft @shirhatti
I'm still cleaning this up but giving a heads up on what is coming. If you have any feedback on the broad functionality/design/structure I'd much appreciate it. Feedback on implementation details are welcome too but you may find it more efficient to let me clean it more and switch to non-draft PR before bothering to dig in at that level.

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/MetricsEventSource.cs

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.cs

...System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.netcore.cs

...System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/StringSequence.netcore.cs

The feature is still a work in progress but wanted to let others see it in its current state while I am refining it. Our out-of-process tools like dotnet-counters and dotnet-monitor need access to the metrics produced by the new Meter APIs without requiring the app to take any dependency on a separate OpenTelemetry library. System.Diagnostics.Metrics EventSource is a new source designed to let those tools access this data. The EventSource includes high performance in-proc pre-aggregation capable of observing millions of instrument invocations/sec/thread with low CPU overhead. This change does not create any new BCL API surface, the aggregated data is solely exposed by subscribing to the EventSource such as using ETW, EventPipe, Lttng, or EventListener. For anyone wanting in-process APIs to consume the data they could either use MeterListener for unaggregated data or a library such as OpenTelemetry for pre-aggregated data.

noahfalk · 2021-06-29T13:57:28Z

Thanks for the nice review @gfoidl! I think almost all of it has been applied and a few parts were rendered moot from other changes.

- Made some adjustments to the events - Added a bunch of tests - Fixed all the bugs I found with those tests - Misc refactoring - PR feedback

noahfalk · 2021-06-29T14:51:53Z

I got most of the changes I wanted made and rebased it on main so now it is probably more reasonable to review it.

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RateAggregator.cs

...libraries/System.Diagnostics.DiagnosticSource/src/System.Diagnostics.DiagnosticSource.csproj

src/libraries/System.Diagnostics.DiagnosticSource/src/Properties/InternalsVisibleTo.cs

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

...libraries/System.Diagnostics.DiagnosticSource/src/System.Diagnostics.DiagnosticSource.csproj

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

tarekgh · 2021-06-29T19:01:03Z

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

+            // This explicitly uses a Thread and not a Task so that metrics still work
+            // even when an app is experiencing thread-pool starvation. Although we
+            // can't make in-proc metrics robust to everything, this is a common enough
+            // problem in .NET apps that it feels worthwhile to take the precaution.


CC @stephentoub just in case he has any comment.

How many of these AggregationManager instances will there be in a process, and thus how many of these threads?

At any given time there should be at most one instance of AggregationManager and one thread. MetricsEventSource disposes the old one (which joins the thread) before creating new one.

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

...raries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregatorStore.cs

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/MetricsEventSource.cs

stephentoub · 2021-06-30T15:45:28Z

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.cs

+            Value1 = value1;
+        }
+
+        public override int GetHashCode() => Value1?.GetHashCode() ?? 0;


Is Value1 ever written outside of the ctor? If no, it should be readonly. If yes, does any code depend on this GetHashCode being stable, e.g. are these ever put into a dictionary?

Is Value1 ever written outside of the ctor? Yes
If yes, does any code depend on this GetHashCode being stable? Yes

The path that modifies these after construction is here: https://github.com/dotnet/runtime/pull/54333/files#diff-37b757b2c75dad499405642004b936213328f52d7becdec52b3cde0d7948a54bR411-R431

I think the key invariant is that the values are never changed after inserting into the dictionary.

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.cs

...System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.netcore.cs

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RateAggregator.cs

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/StringSequence.cs

josalem · 2021-06-30T23:41:43Z

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

+            if (genericDefType == typeof(Counter<>))
+            {
+                return () => new RateSumAggregator();
+            }
+            else if (genericDefType == typeof(ObservableCounter<>))
+            {
+                return () => new RateAggregator();
+            }
+            else if (genericDefType == typeof(ObservableGauge<>))
+            {
+                return () => new LastValue();
+            }
+            else if (genericDefType == typeof(Histogram<>))
+            {
+                return () => new ExponentialHistogramAggregator(DefaultHistogramConfig);
+            }
+            else
+            {
+                return null;
+            }


nit: switch statement/expression on genericDefType?

I don't think it works? case typeof(typename) gives an error that the value isn't constant. Type matching on instrument would require I know the exact type I think but I only know a partial type

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs

josalem · 2021-07-01T16:12:26Z

...raries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregatorStore.cs

+            else if (stateUnion is MultiSizeLabelNameDictionary<TAggregator> aggsMultiSize)
+            {
+                aggsMultiSize.Collect(visitFunc);
+            }


Seconding this! I think the switch matches better with the sum type nature of the state variable. But, again, it's personal preference.

josalem · 2021-07-01T18:03:16Z

...es/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/LastValueAggregator.cs

+            lock (this)
+            {
+                LastValueStatistics stats = new LastValueStatistics(_lastValue);
+                _lastValue = null;
+                return stats;
+            }


Does Update need this lock too? If it isn't locking also, I'm not sure if we need this lock here. We could swap for an interlocked exchange or something.

Yeah, you could do this one with an InterlockedExchange instead of the lock. I left it as lock because it was simple and I don't anticipate this code is on the hot path. The hot paths are InstrumentState.Update(), AggregatorStore.GetAggregator(), RateSumAggregator.Update() and Histogram.Update(). Everything else is unlikely to go above a few hundred invocations/sec.

Everything else is unlikely to go above a few hundred invocations/sec.

Which methods might be invoked several hundred times a second? Ones that might allocate?

Which methods might be invoked several hundred times a second? Ones that might allocate?

Yeah, if someone requested to collect metrics once per second and enabled a few hundred metrics then the InstrumentState.Collect() path would be running a few hundred times per second. InstrumentState.Collect() invokes AggregatorStore.Collect() which in turn invokes Aggregator.Collect(). There are a few allocations down that path.

None of that is going to happen automatically, an engineer needed to run a diagnostic tool to turn it on, they needed to specify they wanted collections every second, and they needed to specify a bunch of metrics they wanted collected.

josalem · 2021-07-01T18:18:34Z

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/MetricsEventSource.cs

+    /// ad-hoc monitoring for the new Instrument APIs. This source only supports one listener
+    /// at a time. Each new listener will overwrite the configuration about which metrics


Rather than having multiple session overwrite the configuration in 6, could we keep track of a union of requested metrics?

All sessions would observe additions to the metric collection and increases in the frequency.

e.g.,

Session A enables Metric1;Metric2 @ 10s intervals.
Session B enables Metric1;Metric3 @ 5s intervals.

then both sessions would see Metric1;Metric2;Metric3 @ 5s intervals.

Session B disables.

Session A continues to see Metric1;Metric2;Metric3 @ 5s intervals.

Then in 7 we could attempt to make changes unique to a session which would require more bookkeeping like your comment says.

- bugfix to use invariant culture - support time series and histogram limits

noahfalk · 2021-07-06T22:12:37Z

Thanks for all the review feedback everyone! I think everything has been addressed at this point. There are a few comments above where someone asked a question and I answered it that I didn't resolve so that folks could see the answers to their questions.

If there is any remaining feedback please let me know, otherwise I plan to hit the merge button tomorrow.

Also as a heads up I've got at least one more change I am thinking to make in a separate PR to improve how the EventSource handles exceptions that are thrown from user provided callbacks.

AndyAyersMS · 2022-03-24T20:50:50Z

This bit of code treats two consecutive fields of an object as a span of the two fields (similarly for related types).

I suspect this is treading on dangerous territory and may lead to incorrect optimizations by the jit, especially if later on there is code that intermingles references from the AsSpan and the individual fields. I don't see that happening anywhere but generally speaking the jit will not properly handle the types of aliasing this can introduce.

runtime/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.cs

Lines 32 to 37 in 5705c98

    
           internal partial struct ObjectSequence2 : IEquatable<ObjectSequence2>, IObjectSequence 
        
           { 
        
               public object? Value1; 
        
               public object? Value2; 
        
               public ObjectSequence2(object? value1, object? value2)

runtime/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/ObjectSequence.netcore.cs

Lines 22 to 30 in 5705c98

    
           internal partial struct ObjectSequence2 : IEquatable<ObjectSequence2>, IObjectSequence 
        
           { 
        
               public Span<object?> AsSpan() 
        
               { 
        
                   return MemoryMarshal.CreateSpan(ref Value1, 2); 
        
               } 
        
               public override int GetHashCode() => HashCode.Combine(Value1, Value2); 
        
           }

dotnet-issue-labeler bot added the area-System.Diagnostics.Tracing label Jun 17, 2021

This was referenced Jun 17, 2021

dotnet-counters support for the new System.Diagnostics.Metrics APIs dotnet/diagnostics#2373

Merged

Feature: Support Percentiles and multi-dimension metrics in dotnet-counters dotnet/diagnostics#2368

Closed

gfoidl reviewed Jun 17, 2021

View reviewed changes

noahfalk force-pushed the metric_event_source branch from f5945e0 to 87c8fca Compare June 29, 2021 14:42

MetricEventSource cleanup

f871425

- Made some adjustments to the events - Added a bunch of tests - Fixed all the bugs I found with those tests - Misc refactoring - PR feedback

noahfalk force-pushed the metric_event_source branch from 87c8fca to f871425 Compare June 29, 2021 14:45

noahfalk marked this pull request as ready for review June 29, 2021 14:49

cijothomas reviewed Jun 29, 2021

View reviewed changes

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RateAggregator.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...libraries/System.Diagnostics.DiagnosticSource/src/System.Diagnostics.DiagnosticSource.csproj Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...libraries/System.Diagnostics.DiagnosticSource/src/System.Diagnostics.DiagnosticSource.csproj Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

src/libraries/System.Diagnostics.DiagnosticSource/src/Properties/InternalsVisibleTo.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Outdated Show resolved Hide resolved

ViktorHofer reviewed Jun 29, 2021

View reviewed changes

...libraries/System.Diagnostics.DiagnosticSource/src/System.Diagnostics.DiagnosticSource.csproj Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregationManager.cs Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...raries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...raries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/AggregatorStore.cs Show resolved Hide resolved

tarekgh reviewed Jun 29, 2021

View reviewed changes

...ies/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/MetricsEventSource.cs Outdated Show resolved Hide resolved