Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MetricsEventSource #54333

Merged
merged 7 commits into from
Jul 7, 2021
Merged

Add MetricsEventSource #54333

merged 7 commits into from
Jul 7, 2021

Conversation

noahfalk
Copy link
Member

@noahfalk noahfalk commented Jun 17, 2021

The feature is still a work in progress but wanted to let others
see it in its current state while I am refining it.

Our out-of-process tools like dotnet-counters and dotnet-monitor need
access to the metrics produced by the new Meter APIs without
requiring the app to take any dependency on a separate OpenTelemetry
library. System.Diagnostics.Metrics EventSource is a new source designed
to let those tools access this data. The EventSource includes high
performance in-proc pre-aggregation capable of observing
millions of instrument invocations/sec/thread with low CPU overhead.

This change does not create any new BCL API surface, the aggregated
data is solely exposed by subscribing to the EventSource such as
using ETW, EventPipe, Lttng, or EventListener. For anyone wanting
in-process APIs to consume the data they could either use MeterListener
for unaggregated data or a library such as OpenTelemetry for
pre-aggregated data.

Todo list:

  • - I think we need a configuration on the EventSource to limit the max number of metrics to track. This ensures that even if someone goes nuts with tags we won't accidentally OOM the app trying to insert all those aggregations into the dictionaries.

@ghost
Copy link

ghost commented Jun 17, 2021

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

Issue Details

The feature is still a work in progress but wanted to let others
see it in its current state while I am refining it.

Our out-of-process tools like dotnet-counters and dotnet-monitor need
access to the metrics produced by the new Meter APIs without
requiring the app to take any dependency on a separate OpenTelemetry
library. System.Diagnostics.Metrics EventSource is a new source designed
to let those tools access this data. The EventSource includes high
performance in-proc pre-aggregation capable of observing
millions of instrument invocations/sec/thread with low CPU overhead.

This change does not create any new BCL API surface, the aggregated
data is solely exposed by subscribing to the EventSource such as
using ETW, EventPipe, Lttng, or EventListener. For anyone wanting
in-process APIs to consume the data they could either use MeterListener
for unaggregated data or a library such as OpenTelemetry for
pre-aggregated data.

Author: noahfalk
Assignees: -
Labels:

area-System.Diagnostics.Tracing

Milestone: -

@noahfalk
Copy link
Member Author

@tarekgh @dotnet/dotnet-diag @cijothomas @reyang @victlu @wiktork @jander-msft @shirhatti
I'm still cleaning this up but giving a heads up on what is coming. If you have any feedback on the broad functionality/design/structure I'd much appreciate it. Feedback on implementation details are welcome too but you may find it more efficient to let me clean it more and switch to non-draft PR before bothering to dig in at that level.

The feature is still a work in progress but wanted to let others
see it in its current state while I am refining it.

Our out-of-process tools like dotnet-counters and dotnet-monitor need
access to the metrics produced by the new Meter APIs without
requiring the app to take any dependency on a separate OpenTelemetry
library. System.Diagnostics.Metrics EventSource is a new source designed
to let those tools access this data. The EventSource includes high
performance in-proc pre-aggregation capable of observing
millions of instrument invocations/sec/thread with low CPU overhead.

This change does not create any new BCL API surface, the aggregated
data is solely exposed by subscribing to the EventSource such as
using ETW, EventPipe, Lttng, or EventListener. For anyone wanting
in-process APIs to consume the data they could either use MeterListener
for unaggregated data or a library such as OpenTelemetry for
pre-aggregated data.
@noahfalk
Copy link
Member Author

Thanks for the nice review @gfoidl! I think almost all of it has been applied and a few parts were rendered moot from other changes.

- Made some adjustments to the events
- Added a bunch of tests
- Fixed all the bugs I found with those tests
- Misc refactoring
- PR feedback
@noahfalk noahfalk marked this pull request as ready for review June 29, 2021 14:49
@noahfalk
Copy link
Member Author

I got most of the changes I wanted made and rebased it on main so now it is probably more reasonable to review it.

// This explicitly uses a Thread and not a Task so that metrics still work
// even when an app is experiencing thread-pool starvation. Although we
// can't make in-proc metrics robust to everything, this is a common enough
// problem in .NET apps that it feels worthwhile to take the precaution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @stephentoub just in case he has any comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many of these AggregationManager instances will there be in a process, and thus how many of these threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At any given time there should be at most one instance of AggregationManager and one thread. MetricsEventSource disposes the old one (which joins the thread) before creating new one.

Value1 = value1;
}

public override int GetHashCode() => Value1?.GetHashCode() ?? 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Value1 ever written outside of the ctor? If no, it should be readonly. If yes, does any code depend on this GetHashCode being stable, e.g. are these ever put into a dictionary?

Copy link
Member Author

@noahfalk noahfalk Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Value1 ever written outside of the ctor? Yes
If yes, does any code depend on this GetHashCode being stable? Yes

The path that modifies these after construction is here: https://github.com/dotnet/runtime/pull/54333/files#diff-37b757b2c75dad499405642004b936213328f52d7becdec52b3cde0d7948a54bR411-R431

I think the key invariant is that the values are never changed after inserting into the dictionary.

Comment on lines 279 to 298
if (genericDefType == typeof(Counter<>))
{
return () => new RateSumAggregator();
}
else if (genericDefType == typeof(ObservableCounter<>))
{
return () => new RateAggregator();
}
else if (genericDefType == typeof(ObservableGauge<>))
{
return () => new LastValue();
}
else if (genericDefType == typeof(Histogram<>))
{
return () => new ExponentialHistogramAggregator(DefaultHistogramConfig);
}
else
{
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: switch statement/expression on genericDefType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it works? case typeof(typename) gives an error that the value isn't constant. Type matching on instrument would require I know the exact type I think but I only know a partial type

else if (stateUnion is MultiSizeLabelNameDictionary<TAggregator> aggsMultiSize)
{
aggsMultiSize.Collect(visitFunc);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconding this! I think the switch matches better with the sum type nature of the state variable. But, again, it's personal preference.

Comment on lines +17 to +22
lock (this)
{
LastValueStatistics stats = new LastValueStatistics(_lastValue);
_lastValue = null;
return stats;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Update need this lock too? If it isn't locking also, I'm not sure if we need this lock here. We could swap for an interlocked exchange or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could do this one with an InterlockedExchange instead of the lock. I left it as lock because it was simple and I don't anticipate this code is on the hot path. The hot paths are InstrumentState.Update(), AggregatorStore.GetAggregator(), RateSumAggregator.Update() and Histogram.Update(). Everything else is unlikely to go above a few hundred invocations/sec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else is unlikely to go above a few hundred invocations/sec.

Which methods might be invoked several hundred times a second? Ones that might allocate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which methods might be invoked several hundred times a second? Ones that might allocate?

Yeah, if someone requested to collect metrics once per second and enabled a few hundred metrics then the InstrumentState.Collect() path would be running a few hundred times per second. InstrumentState.Collect() invokes AggregatorStore.Collect() which in turn invokes Aggregator.Collect(). There are a few allocations down that path.

None of that is going to happen automatically, an engineer needed to run a diagnostic tool to turn it on, they needed to specify they wanted collections every second, and they needed to specify a bunch of metrics they wanted collected.

Comment on lines +15 to +16
/// ad-hoc monitoring for the new Instrument APIs. This source only supports one listener
/// at a time. Each new listener will overwrite the configuration about which metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having multiple session overwrite the configuration in 6, could we keep track of a union of requested metrics?

All sessions would observe additions to the metric collection and increases in the frequency.

e.g.,

Session A enables Metric1;Metric2 @ 10s intervals.
Session B enables Metric1;Metric3 @ 5s intervals.

then both sessions would see Metric1;Metric2;Metric3 @ 5s intervals.

Session B disables.

Session A continues to see Metric1;Metric2;Metric3 @ 5s intervals.

Then in 7 we could attempt to make changes unique to a session which would require more bookkeeping like your comment says.

- bugfix to use invariant culture
- support time series and histogram limits
@noahfalk
Copy link
Member Author

noahfalk commented Jul 6, 2021

Thanks for all the review feedback everyone! I think everything has been addressed at this point. There are a few comments above where someone asked a question and I answered it that I didn't resolve so that folks could see the answers to their questions.

If there is any remaining feedback please let me know, otherwise I plan to hit the merge button tomorrow.

Also as a heads up I've got at least one more change I am thinking to make in a separate PR to improve how the EventSource handles exceptions that are thrown from user provided callbacks.

@noahfalk noahfalk merged commit ab2ae56 into dotnet:main Jul 7, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2021
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 24, 2022

This bit of code treats two consecutive fields of an object as a span of the two fields (similarly for related types).

I suspect this is treading on dangerous territory and may lead to incorrect optimizations by the jit, especially if later on there is code that intermingles references from the AsSpan and the individual fields. I don't see that happening anywhere but generally speaking the jit will not properly handle the types of aliasing this can introduce.

internal partial struct ObjectSequence2 : IEquatable<ObjectSequence2>, IObjectSequence
{
public object? Value1;
public object? Value2;
public ObjectSequence2(object? value1, object? value2)

internal partial struct ObjectSequence2 : IEquatable<ObjectSequence2>, IObjectSequence
{
public Span<object?> AsSpan()
{
return MemoryMarshal.CreateSpan(ref Value1, 2);
}
public override int GetHashCode() => HashCode.Combine(Value1, Value2);
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants