-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics for System.Runtime #85372
Comments
CC @reyang |
@JamesNK FYI in OpenTelemetry .NET we've implemented these:
Having these in in future versions of the runtime would be awesome, the existing OpenTelemetry instrumentation libraries can do a runtime detection and leverage the runtime instrumentation if it is there. Eventually as old versions of the runtime get deprecated, we'll land in a better situation where we don't need a separate instrumentation library as things are "baked in". |
I think OTel would still have something because .NET counters won't follow the OTel naming standard. However, the implementation should be very simple because built-in counters will provide all the information needed. |
@JamesNK is this work you were planning to pursue yourself or just recording the request? |
Recording the request. |
@JamesNK I moved this to future milestone. Please let me know if there is strong demand to have this in .NET 8.0. |
The main reason to implement these as metrics in 8, is so that we can wean people off eventcounters and onto the metrics instead. As these are the main process-wide counters, getting them converted will be a major signal towards that goal. There are likely few counters that need many dimensions here, as most are process wide. We should evaluate the work in comparison to the infrastructure needed to implement it. |
Hey folks! I am interested in trying to help implement this. |
Hi @omajid, glad to have help! I'm guessing that most of the work on this feature will be investigating design options and trying to get a concensus on the best design rather than writing the implementation code. If that is something you are interested in taking a stab at thats great. If you are interested in having someone else work through the design first thats fine too, but I don't know necessarily when that would occur. If you did want to pursue the design part, these are the major questions that come to mind right now:
My hunch is that, yes, some kinds of changes are going to be appealing but we need to figure out what are the impacts of different kinds of changes, is there anything we can do to make migration easier, and then figure out which changes seem worthwhile. For (3) my guess is that we would make it a static singleton, but we need to figure out how that intersects with DI Meter work and the new Meter config work so there might be stalls in there where one design needs to wait for stuff to resolve in the other, or they have to be resolved simultaneously. I think there is design inspiration we could take from https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process. For changes/removals to existing counters there is also some past discussion here: #77530 So if all this sounds like something you still want to dive into I think a first step would be to create an initial proposal (in a gist or a PR'ed markdown doc) describing what instruments would be exposed. Thanks! |
Hey, @noahfalk! Thanks for the various links. These are great questions. I have been prototyping an implementation and I came up with some similar questions (and some possible answers to what you asked). I wouldn't mind helping with the design, though I am not a runtime or OpenTelemetry expert. Advice from anyone more familiar with this is more than welcome. I have been looking at OpenTelemetry's docs as a great starting point from which to evaluate design ideas.
I think if we are creating a Metrics based implementation for first-class support for OpenTelemetry, we should take advantage of that and provide similar (or additional) information, but in a way that is easier to consume and/or feels more natural for anyone looking to consume it via an OpenTelemetry-compatible tool. The opentelemetry-dotnet-contrib docs almost match the existing EventCounters of System.Runtime, with a few differences.
This isn't currently listed as something used in the OpenTelemetry docs, and isn't done in the EventCounter implementation either. So I think we can pass on this for a first stab? If we find some good use cases, we should consider using adding Histograms for those.
Yes. In fact, I think we have to. Otherwise we provide OpenTelemetry-compatible metrics but violate all assumptions in the ecosystem, making things harder to parse and use. For example, all our metrics via EventCounters have a single name, but OpenTelemetry expects metrics to be namespaced via dots:
There's also prior art in the form of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime which does a great job creating a hierarchy, in the form of Though I see that https://github.com/dotnet/runtime/pull/85447/files does things differently and I am not sure why.
I think we should? We aren't putting our users on a path to success by putting thing that are confusing or likely to be misinterpreted in. Specially in a new/fresh design. I also think we should leave out various
I hadn't really thought about this. The current design (eg, looking at the output of
This shouldn't matter from a usage point of view, right? Could we make a static signleton for now and later switch to DI without breaking users? |
No worries. I think its fine to toss out ideas and then get feedback on it. If we need folks with certain areas of expertise we'll try to find them. Ultimately if there is no consensus forming and a contentious decision needs to be made I can make it.
If we add a histogram in the future where there previously the Meter has no similar instrument defined, that seems straightforward and easy to postpone. What feels less straightforward would be adding a counter or gauge now, then later deciding it would have been better to define that instrument as a histogram. For example we might propose an ObservableGauge that was an average GC pause duration, then later we think oops, maybe that should have been a gc pause duration histogram instead.
A common pattern that has arisen with the OTel work is that .NET will have some pre-existing convention or naming scheme, then OTel defines a new scheme that isn't consistent. No matter which one we choose to use it will always be inconsistent with something, either inconsistent with OTel recommendations or inconsistent with a .NET developer's past experience of the platform. When this happens we try to make a judgement call about which behavior more .NET developers are going to prefer in the long run, and often we do wind up favoring .NET self-consistency instead of OTel consistency. The pattern we've landed on in other places (example) is that we are staying consistent with .NET metric naming convention rather than switching to OTel naming convention. I'm expecting we'd do the same here. For folks who want something that conforms tightly to OTel naming and semantic conventions, the instrumentation packages from OTel such as https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime package better fill that role right now. I expect what we'll want to build fairly soon (but not as part of this PR) are mechanisms that make schema conversion very easy so that users can get the data into whatever shape they need.
Whoops, that is actually the one I meant to link to above rather than Process. Glad you found it anyways :)
Above I mentioned how we can't be consistent with both past-precedent and with OTel so we had to make a choice. As to why we choose this way, a few reasons:
Yeah, that sounds pretty reasonable to me as well.
I think the one place this came up in the past was in the discussion of GC related metrics. One set of customers will ask for fairly detailed metrics in a specific area, but I worried that if we add too much to the runtime Meter it will be confusing for users who have simple needs. I think direction things were going though is that we shouldn't worry too much about adding more detailed metrics as long as most users are seeing the metrics via dashboards and docs that can guide them in slowly rather than seeing raw dumps of every available instrument.
One area it might matter is with the Meter config work. For example in logging there is no concept of a static singleton |
A couple of questions:
|
Recently I've been assuming that the runtime counters will be a static singleton Meter defined in System.Diagnostics.DiagnosticSource, so no dependency on DI there. However if want to listen to it via the Meter config work, that would take a DI dependency. Other ways such MeterListener, OpenTelemetry, or external tools do not require DI.
I think a good portion of the counters are the result of EventCounters compensating for not having dimensions. For example there are 5 different heap size metrics (gen0, gen1, gen2, LOH, POH) that could probably be a single metric with a dimension. I'd suggest we start with a design that doesn't explicitly namespace them and if it still feels overwhelming then we think on how we'd split them.
Probably. I like OpenTelemetry's approach of separating these metrics as 'Process' metrics rather than 'Runtime'. These high-level stats like CPU-usage and VM-usage are measurements the OS tracks for all processes rather than anything specific to a language runtime like Java, Python, or .NET.
I'm hoping we don't have a huge number of them + we can provide better guidance than we currently do. Today our docs mostly say "Here is what each counter measures". I think we should get to the point where the docs say "These are the counters we think are most useful for health monitoring, here is a default dashboard you can use, here is how you might use this data to start an investigation of different common problems..."
I'm hoping we aren't going to have so many that more elaborate naming schemes are needed, but I certainly don't rule it out. I'd propose starting with something that looks like OTel's runtime metrics but using .NET's traditional naming conventions.
I think looking at what OpenTelemetry did with runtime metrics is a good starting point. I'm guessing we'd land somewhere quite similar. |
From #79459 (comment):
|
cc @Maoni0 for the last metric |
@noahfalk There is an ongoing discussion at open-telemetry/semantic-conventions#956 about introducing a semantic convention for .NET CLR runtime metrics to align with the existing Java conventions. I'm looking at whether this is something I (and Elastic) can contribute to. Would this be a useful exercise as the groundwork for this issue in terms of planning the instrumentation types, attributes etc? Regarding naming, it sounds like the preference will be to use .NET naming conventions for any new metrics. Is an option to choose the naming based on an "opt-in to semantic conventions" environment variable practical? Has anything like that been done elsewhere in the runtime? I've got some cycles coming up to spend on this effort, so looking at places I can help get stuck in and try to add some value. It would certainly be nice to get to a place where we have built-in metrics and additional contrib libraries are not needed. |
Hey Steve, thanks for reaching out and volunteering! I very much agree it would be nice to get these in the runtime by default. I think this item was in a list of items that @tarekgh was tracking. @tarekgh - any concerns if Steve were to run with this? Just so you are aware I know .NET 9 probably feels far off but we've got about ~2 months where its not too hard to get PRs merged. Once we hit July the bar starts going up and its probably not long until feature-sized work not yet merged gets automatically postponed until .NET 10.
The naming discussion above is out-of-date now. I think somewhere around July last year we made a shift in strategy, renamed all the metrics using OTel conventions, and pushed to get all conventions we were depending on marked stable quickly before .NET 8 shipped. So far I'm feeling pretty happy we did that because it eliminated the bifurcation between .NET naming conventions and OTel naming conventions which was causing confusion. As for the runtime metrics I think it gives us a clear path - we'd use OTel naming conventions only. I know above there was also some discussion about which metrics should we have and how to organize them. I think I was leaving it too open-ended before. Now I'd suggest we should adopt the conventions already implemented by OTel's runtime instrumentation as the presumptive design. If folks have feedback or want to propose changes to that design we can certainly do so. How does that sound? |
I am fine having Steve start it. Please keep me involved as I guess I am still need to handle the design review. |
@noahfalk That sounds great, and much of the complexity is resolved now. I'd be happy to specify the planned metrics and attributes int the semantic conventions based on what the OTel SDK already generates today. In terms of my time, I'd be happy to contribute to the implementation too. I wasn't expecting this to move quite as fast to get the agreements to move forward! I'm OOO this week and the last two weeks in May, but I'll try to focus on this in the week between and lay the groundwork at least. |
@tarekgh Do you want me to create a new API review issue for the proposed new public types needed to implement this? |
No, please use this issue to add the proposal on the top. If you cannot access that, please paste it as a comment and I'll move it to the top when we all agree on the shape of the proposal. We need to have one place to look. Thanks! |
@noahfalk / @tarekgh - I've opened an initial semantic convention PR to propose adding experimental runtime metrics. |
@noahfalk, had you envisioned a plan for the static We could port almost as-is from the contrib code, but that has the advantage that the extension method used to add the instrumentation can also trigger the static initialisation. For a likely use case with OTel, if we did it like the code below, nothing would happen because the Meter isn't actually created. using System.Diagnostics.Metrics;
using OpenTelemetry;
using OpenTelemetry.Metrics;
using var meterProvider = Sdk.CreateMeterProviderBuilder()
.AddMeter("Microsoft.Diagnostics.Runtime")
.AddConsoleExporter((_, m) => m.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 10000)
.Build();
GC.Collect(0);
Console.ReadKey();
internal static class RuntimeMetrics
{
public const string MeterName = "Microsoft.Diagnostics.Runtime";
private static readonly Meter s_meter = new(MeterName);
private static readonly string[] GenNames = ["gen0", "gen1", "gen2", "loh", "poh"];
static RuntimeMetrics()
{
_ = s_meter.CreateObservableCounter(
"process.runtime.dotnet.gc.collections.count",
GetGarbageCollectionCounts,
description: "Number of garbage collections that have occurred since the process started.");
}
private static IEnumerable<Measurement<long>> GetGarbageCollectionCounts()
{
long collectionsFromHigherGeneration = 0;
for (int gen = 2; gen >= 0; --gen)
{
long collectionsFromThisGeneration = GC.CollectionCount(gen);
yield return new(collectionsFromThisGeneration - collectionsFromHigherGeneration, new KeyValuePair<string, object?>("generation", GenNames[gen]));
collectionsFromHigherGeneration = collectionsFromThisGeneration;
}
}
} |
@stevejgordon I am wondering why we need to have this in OpenTelemetry? Why can't we list the detailed proposal here and then proceed with that? |
@tarekgh I'm trying to drive both the implementation here and also getting CLR metrics into the conventions as a first-class citizen, as JVM recently did. I'm happy to summarise the metrics names/types, etc., here also as we work on the implementation. |
never mind. I looked at the PR and I am seeing it is just a doc. |
What if we force anyone who creates a public MeterListener()
{
EnsureBuiltinMetersInitialized();
}
static void EnsureBuiltinMetersInitialized()
{
RuntimeMeter.Initialize();
} |
@noahfalk would your suggestion make |
I highly recommend getting input from OTEL experts, e.g. @lmolkova, on counters, tags and naming. It was a great help when putting together aspnetcore metrics. Also, we should document the metrics on learn.microsoft.com and OTEL semantic conventions docs. With aspnetcore metrics there is lightweight docs on learn.microsoft.com, and links to details docs on OTEL semantic conventions for people who want more detail. |
dotnet-counters connects to the MetricsEventSource which uses a MeterListener internally to obtain the data. There shouldn't be any alternative path to get the data from arbitrary Meters (excluding truly shady approaches like private reflection).
+1. I think that is the path we are already on by virtue of posting the sem-conv proposal in the OTel repo. |
@noahfalk Yeah, I was heading in that direction myself, although I was hoping to avoid it if there was some clever way. One thing I did consider this morning was whether |
Btw @lmolkova is currently out, but she is scheduled to be back in a week. I'm glad to get other feedback but I do want to get her feedback specifically on this one :) |
I'm out for two weeks starting on Monday, but I will keep an eye on these discussions. I'll continue playing with a POC to implement this, and once we have a design, we can determine what (if anything) needs to be prepared for API review, etc. |
@noahfalk @stevejgordon looking at the PR open-telemetry/semantic-conventions#1035 and I am seeing the proposal is missing at least three metrics comparing to what we expose in https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters.
Is it intentional we don't want to include these? |
@tarekgh, not really. @noahfalk suggested starting with the metrics exposed via the existing OTel contrib library, so I didn't review the runtime event source. We can consider proposing those, too, or they could be added later. @noahfalk An alternative implementation I've been thinking about last night is whether we should consider adding an |
Can those GC events be received by an EventListener? I'd imagine that the events are fired to ETW from the unmanaged part of the runtime, rather than by EventSource, and that makes them invisible to EventListener. |
@KalleOlaviNiemitalo, I believe they are piped through and can be observed as per this post from @Maoni0. The reason I am considering this as an option is it opens the door to collecting GC duration and perhaps some other useful metrics if we base at least some of them on these richer events. |
I see; the events are buffered in unmanaged EventPipe code, and a thread in managed code pulls them via EventPipeInternal.GetNextEvent, so the runtime doesn't need to call managed code in the middle of garbage collection. |
I would deliberately not include this one. OpenTelemetry includes WorkingSet, Cpu, and other OS level metrics in a separate group of process metrics. I think its fine if we had a built-in implementation of process metrics too, I just wouldn't lump them in the same Meter with the runtime metrics.
I'd be fine with it as long as @Maoni0 is. It also raises the question if we only want the gen0 value of this or do we want higher generation budgets too.
This metric has history as being confusing and I think folks would be better off observing the rate of change in the clr.gc.pause.time metric. We did look at adding this to the OTel metrics in the past and decided against it. Some past discussion.
Although functionally it works I'd worry you are going to incur higher perf overheads for no clear benefit. Creating the first EventListener in the process requires a thread to pump the events for a callback + blocks of virtual memory are allocated to store the buffered events prior to dispatching them. |
Catching up on the discussion here. Having CLR metrics would be great! I've shared some feedback on the open-telemetry/semantic-conventions#1035 and happy to help polish names and attributes. One thing I hope we can discuss more is if translating existing counters is enough. I believe in some cases we can do better - specifically when we want to measure duration or a distribution of something:
Some of these can be addressed incrementally, but if we'd rather eventually report GC pause time as a histogram, we should not add it as a counter now - adding a histogram in future will introduce duplication as it will allow to derive all the counts from it. |
Today there are event counters for System.Runtime: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters
Metrics should be added to this area. Advantages:
MeterListener
.System.Diaganostics.Metrics
. For example,opentelemetry-net
.What instruments should we have?
exception-count
counter include the exception type name as a tag? Then tooling can provide a breakdown of not just the total exception count but the exception count grouped by type.The text was updated successfully, but these errors were encountered: