Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Runtime Performance Counters #11227

Closed
15 of 19 tasks
noahfalk opened this issue Oct 11, 2018 · 40 comments
Closed
15 of 19 tasks

Add Runtime Performance Counters #11227

noahfalk opened this issue Oct 11, 2018 · 40 comments
Assignees
Labels
area-Tracing-coreclr enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@noahfalk
Copy link
Member

Historically .NET supported emiting certain default performance counter data into the Windows Performance Counter system. On non-windows systems there is no such facility, but in its place we now have EventCounter for counters and EventPipe as an underlying transmission method. The goal is to instrument the runtime with similar counters that we had in the past, and have that data emitted via EventPipe. It is a non-goal (of this issue at least) to transmit the data to the Windows Performance Counter system.

This task tracks:

  • Identify a set of runtime counters we initially want to support. We should consider what was originally supported in .NET Framework. It is OK if we start with only a high value subset, then add others later as needed. (See Vance's comments below for more detail)
  • Ensure that managed wrappers exist to query the underlying data we care about. For example Process.VirtualMemorySize64 already exists whereas GC.PercentTimeInGC needs to be created.
  • Modify EventCounter / RuntimeEventSource to publish counters that are backed by the APIs.

Proposed list of runtime counters for .NET Core 3 (see https://github.com/dotnet/diagnostics/issues/83#issuecomment-446729009):

System Counters

  • Working set memory
  • CPU
  • Handles

Exception performance counters

  • Exceptions thrown

Thread performance counters (see https://github.com/dotnet/corefx/issues/35500)

  • Threads
  • TPL threads
  • Total queue length
  • work items processed
  • Lock contention

Memory

  • Bytes in all heaps
  • Gen 0/1/2 collections
  • % time in GC
  • Gen 0/1/2 and LOH size
  • Allocation rate

JIT

  • IL bytes Jitted

Loader

  • Number of Assemblies currently loaded
@noahfalk
Copy link
Member Author

cc @brianrob @vancem @dotnet/dotnet-diag

@vancem
Copy link
Contributor

vancem commented Oct 11, 2018

I believe it is pretty fair to say that most of our .NET Framework counters are not very useful. Thus we should only add the ones we feel are known to be valuable (we can add others later)

Also a number of our counters are better conceptually as events. Thing that happen rarely are typically better done as events (many of the GC performance counters fall into this category, as do exceptions …)

Counters make sense in the following cases

  1. If an event is too expensive because the value changes too frequently. For this the counters are basically the 'low overhead' monitoring of the event. (e.g. request/sec).
  2. Counters have a certain value because they can be display easily. (name and number). We should resist doing too much of this, however as it is redudant with information from events.
  3. To expose underlying OS counters in a platform neutral way. (e.g. Process CPU time, Process Memory, etc).

Note that my expecation is that all information that is exposed through a counter should be already exposed to managed code without the counter in some way. For example all the process level stuff (CPU time Memory ...), is available via System.Diagnostics.Process. BytesInAllHeaps is exposed via GC.GetTotalMemory(). We should keep this pattern (e.g. if we wish to expose the JIT counters we should have a System.Runtime.JIT class with APIs that let you get at the numbers. APIs are much easer to deal with than subscribing to things that polll and serialize (and you have to deserialize), We don't want that for in-proc cases.

I think we should try hard to keep as much of the logic in managed code. This should be easy because we have already exposed all the information via managed APIs already. Thus all we need is for (managed) RuntimeEventSource to propagate the values to the EventCounter logic. We will need an event on EventCounter that fires just before an EventCounter will log its payload. This allowssomething like RuntimeEventSource and update the variables just before that.

Here is a strawman list of counters we shoudl start off with

  1. This Process Process.VirtualMemorySize64
  2. This Process Process.WorkingSet64
  3. This Process Process.PrivateMemorySize64
  4. This Process Process.TotalProcessorTime
  5. GC.GetTotalMemory(false)
  6. GC.PercentTimeInGC NEW to managed
  7. ExceptionsThrown NEW to managed
  8. LockContentions NEW to managed
  9. NumberOfThreads NEW to Managed
  10. Number of ThreadPoolThreads NEW to Managed
  11. Number of IL Bytes compiled NEW to Managed
  12. Number of Methods JIT compiled NEW to Managed

There are others we want, probably NetworkBytes, # HTTP requests, and some counters at the ASP.NET level.

There are undoubtely more we should add them here. If we have more than 50 we have probably made to many.

@noahfalk
Copy link
Member Author

Looking a little closer at how EventCounter is implemented I agree, it would be nice to reuse more of that than I was initially thinking and surfacing just the managed API query for the raw number isn't hard. I'll update my task list at the top. I see you are also proposing that EventCounter effectively adds a pull-model allowing metrics to be updated just in time in response to an event. I like this. Building on your idea I suggest rather than doing a generic event on EventCounter, what if we let each counter have an update mechanism:

public EventCounter(string name, EventSource source, Func<float> getMetricValueFunc)

This allows us to do a fine-grained pull which could let us efficiently support returning subsets of the metrics, or returning some metrics at higher polling rates than others. It also seems like an easy programming model rather than requiring the user to declare the counter, register/unregister for an event, then implement an event handler that computes the metric and calls SetMetric to push the update.

public class MyEventSource
{
    this.gcCounter = new EventCounter("# of GCs", this, () => GC.GetNumGCs());
}

And slightly separately, any interest in renaming EventCounter to something more 'metric'ish? EventCounter appears to be an arbitrary tuple (name,float) and it is entirely up to the user whether or not they use it for counting events. Maybe that ship has already sailed, but it feels weird to call a metric that is tracking total size of the GC heap as an 'Event Counter'. If we are doing other work to add multi-dimensional metrics maybe that would be an opporuntiy to create a more genericly named type?

@vancem
Copy link
Contributor

vancem commented Dec 18, 2018

Please also see https://github.com/dotnet/diagnostics/issues/83 for a discussion of counters that are requested. My recommendation is to start with a few of the most uncontroversial ones and then create batches that we sign off on before you go implement them.

I like @noahfalk's design (adding a EventCounter constructor that takes a fetcher delegate), as a way of implementing counters that can be pulled.

@vancem
Copy link
Contributor

vancem commented Dec 18, 2018

To @noahfalk's comment, I agree that EventMetric is a better name than EventCounter. Our naming was influenced by Windows (which calls them counters (they really are metrics as well)). I would not deal with trying to rename it at this point (at any point, if we decide the name is just too problematic we can make a superclass that effectively 'renames' the type, but I think that having two names makes things worse. Either way, I would not address it now.

@sywhang sywhang self-assigned this Dec 18, 2018
@vancem
Copy link
Contributor

vancem commented Dec 18, 2018

Several of the first Metrics should be % Time in GC, Allocation Rate, so our first step is to expose these.

We can simply expose these as properties on the GC class, but these values can be highly misleading becasue they are only updated when GCs happen.

One way of solving this is, instead of exposing properties, you have a EVENT 'GarbageCollection' which you can subscribe to that gets called when garbage collections happen (it will be triggered by the finalizer thread after a GC). This event can have a GCInformation class passed to it which contains things like the CPU time consumed (in and out of the GC), and the number of allocations and the time span since the last GC etc. It can also have things like PercentTimeInGC which does the math on the other numbers (and is documented to do that). The result is more accurate and won't 'lie' as much

@Maoni0 do you have any comments on the idea of a managed callback when GC's have happen (for performance monitoring purposes?)

Including @terrajobst @jkotas @stephentoub in case they wish to comment on the idea of having

    // Will be called back shortly after a Garbage Collection happens.   This is here for
    // performance monitoring purposes.   
    static event Action<GCInformation> GarbageCollection;

    class GCInformation {
            int Generation { get; }   // Gen 0 1 or 2 GC.  
            int CPUMSecUsedSinceLastGC  { get; }
            int GCCPUMSecUsedSinceLastGC {get; } 
            DateTime GCTime { get; }
            TimeSpan TimeSinceLastGC { get; }
            int BytesAllocatedSinceLastGC { get; } 
                  // Probably more...

            // These are derived from the values above.  
            double PercentTimeInGC { get; }
            double AllocationRate { get; 
           // We will probably want more.  
   }

We can quibble about the details. The main point is that we have an event that fires on every GC and it has information that allows you to compute 'counter' values.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2018

@vancem I would like to clarify the relation-ship of the GCInformation event and the regular EventPipe/EventCounter proposed above. It seems that the GCInformation event would be redundant with what you can do with EventPipe/EventCounter already. The added value is that GCInformation is more convenient to use and maybe a bit more lightweight. Is this correct?

I think that it is fine to have direct managed APIs for the most basic metrics, even if the APIs are redundant with tracing.

@vancem
Copy link
Contributor

vancem commented Dec 19, 2018

Generally speaking, monitoring infrastructure like EventSource, EventCounters, diagnosticSource ... are actually relatively inconvinient to use as a way of getting information in process. This is because they serialize/stanardize the information and thus you have to do some sort of conversion/deserialization to get at the information. This is fine when such stanardization/serialization served a purpose (e.g. getting out the the process), but is a pretty inconvinient in-proc compared to 'normal' APIs.

Moreover since EventCounter is managed, there needs to be SOME way of getting the raw data into managed code.

These two obvervations lead to a design where raw data is always exposed as a managed API (e.g. GC.GetTotalMemory) and then we simply use a EventCounter to also 'publish' this data in that way.

If you look at EventCounter, is main 'value' is

  1. Serializing the data so it can leave the process.
  2. Computing statistics for it (average, standard deviation, ...)
  3. Hooking into the EventSource control logic so that the rate of update is user controlled.

So arguably there is no redundancy. The 'raw' API provides the actual values, and the EventCounter/EventSource provide a uniform way of pushing these values to other places (typically out of process).

You COULD imagine making the raw data APIs private and thus only allow the data to be exposed via EventCounter/EventSource, but as mentioned before, this makes getting at the data in-proc pretty inconvenient, so I think the better pattern is to make the raw APIs public. (And many of the needed raw APIs already exist (for this reason).

The case of GC is a bit more complex because the raw data that the GC generates is actually more event-like then 'metric' like because it is update relatively rarely (during GCs). Thus two designs are possible

  1. Expose things like PercentTimeInGC as a static property of the GC class (this makes it metric-like). This works, but these values tend to be misleading to users because they change in a very coarse fasion.
  2. Expose an event when GCs happens and hang the data off the callback argument. This makes it clear how the raw data is being computed and that it is very coarse (it only changes on GC events).

I could live with either of these. but arguably (2) is more accurate and gives more information (you actually know when the GCs are happening). If I had to choose I would pick (2) but as I said, I could live with either, so I want to get other's perspectives.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2018

Make sense. This should just go through the same path like any other new public APIs: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md

@vancem
Copy link
Contributor

vancem commented Dec 19, 2018

Make sense. This should just go through the same path like any other new public APIs: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md

Yes. We definately would be going through the review process. It is best to have a preferred option settled by the time you come in for review. Do you have an opinion on whether a flat AP or a callback on each GC is preferred?

@jkotas
Copy link
Member

jkotas commented Dec 19, 2018

I think the callback would be better.

@vancem
Copy link
Contributor

vancem commented Dec 21, 2018

There are some non-trivial design decisions to be made. I have thought about the issues a bit, and I am proposing straw-man solutions below. Some issues are

  1. How many EventSources we are talking about for these counters and what are their names?
  2. There is a dependency problem. Right now EventCounters live in System.Diagnostics.Tracing, but
    much of the code that needs to be instrumented lives in System.Private.Corelib and
    System.Private.Corelib can't reference other DLLs, so that is a problem.
  3. To get process level CPU and Memory metrics, we need System.Diagnostics.Process.dll
    again we have a issue with dependency.
  4. This should all be pay-for-play (if no one is listening, it is close to zero cost.
  5. Some of the metrics are meant to accumulate (like Lock contention, exceptions, allocations ...) and
    some are not (% time in GC, Queue Length ...) We would like to distinguish these so that the viewer
    can do the right thing (accumulate them over a period of time or not).
  6. We have to expose all the needed information to managed code. New properties will be needed.
    and in particular the design issue with GC data mentioned above needs resolution.

Here is the strawman:

  1. We will add a New EventSource to system.private.corelib whose class name is System.Diagnostics.Tracing.ManagedRuntimeEventSource (can be placed next to System.Diagnostics.Tracing.RuntimeEventSource ), whose 'Name' attribute is 'System.Runtime' (that is it does NOT have EventSource in its name (it really is redundant to do so). We can't use System.Diagnostics.Tracing.RuntimeEventSource for this because it is 'magic' (it is a place-holder
    for the native Eventpipe events).
  2. We will put all the runtime related counters in here including the ones for process level things (CPU, Working Set, Etc), because all of these counters 'exist' for ANY runtime instance (events that you can imagine not being loaded (e.g. ASP.NET stuff), needs a different EventSource).
  3. We create the new API EventCounter(string name, EventSource source, Func getMetricValueFunc) in EventCounter,
  4. We also create the new API EventCounter(string name, EventSource source, bool isAccumulator) which marks the counter as an accumulator. This value is passed in the payload for the counter that tells the viewer that it shoudl sum the values over an interval of time, rather than show the average. We will use this on the 'counter like' counters like Exceptions and Allocations.
  5. We solve the dependency problem by using lazy reflection. We create a class 'ProcessRuntimeEventCounters' with a static method 'AddCounters(EventSource)' in the System.Diagnostics.Process.dll. This new api is public but doe NOT belong in the refernece assembly (it is only accessed via reflection. In the ManagedRuntimeEventSource, when counters are turned on it uses reflection to try to load System.Diagnostics.Process.dll and run AddCounters.
  6. We add the following process counters, We use the names in System.Diagnostics.Process as much as possible.
    • Process.TotalProcessorTime
    • Process.WorkingSet
    • Process.VirtualMemorySize
    • Process.HandleCount
    • Process.ThreadCount
  7. We play the same dependency trick we used in (5) for all Counters in ManagedRuntimeCounters. Thus we create a class 'RuntimeEventCounters' in System.Diagnostics.Tracing, that does all the GC, Lock, ThreadPool counters. This method will again be called lazily via reflection when counters are enabled. We can implement
    • GC.TotalMemory
    • GC.Gen0.CollectionCount
    • GC.Gen1.CollectionCount
    • GC.Gen2.CollectionCount
    • Exception.Count
      without implementing new APIs.
  8. Implement new static managed APIs to expose
    • ThreadPool.GetQueueLength()
    • ThreadPool.GetWorkCompleted()
    • ThreadPool.GetThreadCount() // returns sum of IO and normal worker thread
    • Monitor.LockContentions
    • GC.GarbageCollection (calls back after a GC see above).
  9. Use those to implement more counters (of the same names).

@jkotas
Copy link
Member

jkotas commented Dec 21, 2018

We have been using the reflection hacks in a very targeted way to break really bad legacy dependency cycles. This is forward looking design that is taking a major bet on them. It does not sound right to me. These hacks are incompatible with tree-shaking and they make the system hard to reason about.

I think that we should do this cleanly without reflection hacks. It does not look that bad actually:

  • Move EventCounter to CoreLib. There are 100s kB of C# for EventSource and friends in CoreLib. EventCounter is 20kB. Adding extra 20kB of C# to CoreLib for EventSource and friends to CoreLib is fine.
  • Duplicate the logic to get the 4 process related counters in CoreLib. System.Diagnostic.Process is one of those kitchen sink components that do everything and depend on everything. A small code duplication to avoid it is fine and simplifies things at the end. We have similar small code duplication in other places as well.

@Maoni0
Copy link
Member

Maoni0 commented Dec 21, 2018

sorry I'm OOF so haven't been catching up on this thread. I will take a look when I'm back in Jan. thanks.

@sywhang
Copy link
Contributor

sywhang commented Dec 26, 2018

@vancem @jkotas Sorry I was out last week due to a family emergency so I couldn't respond to this thread earlier.

Quick question - I saw that dotnet/coreclr#11036 removed EventCounters from System.Private.CoreLib and moved it to System.Diagnostics.Tracing. Was there any reason for doing this? (The PR doesn't mention why we did it). I was wondering if there's anything we'd be breaking if we move it back to System.Private.CoreLib

@jkotas
Copy link
Member

jkotas commented Dec 27, 2018

Was there any reason for doing this?

I assume that it was done to make the code sharable between runtimes and NuGet packages. The code had to live in CoreFX to be shareable between runtimes and NuGet packages. We have found that it is too constraining and relaxed this restriction by inventing CoreLib shared partition: https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/README.md .

You should be able to move EventCounter back CoreCLR/CoreLib and share the code with the OOB EventSource package via the CoreLib shared partition.

@vancem
Copy link
Contributor

vancem commented Dec 28, 2018

Still out-of-office, but catching up on this thread.

@jkotas I too am not particularly happy with the use of reflection. As you point out there is a 'straightforward' solution (pull the process functionality down into the core, as well as the EventCounter).
Indeed I thought about this possilbilty (it is the most conceptually simple) but it does have some snags that makes me at least ambivalent toward it.

  1. It means that EventCounters are now in System.Private.Corelib, and thus can't be updated out of band (without updating the System.Private.Corelib, which pretty much means updating the runtime(). It is actually KNOWN that the EventCounter API as it currently stands is likely to need update (we have seen in in just this case, but we also know there are is a desire for 'multidimensional' counters). Thus there is value to having EventCounters in a package that could be updated more easily. (Note countering this is the fact that we don't currently publish an out of band version of System.Diagnostics.Tracing, and that frankly runtimes get updates relatively quickly (although it is not clear our customers do that, especially between V2 and V3). (@sywhang this, along with the desire for this to work on Desktop was the reason EventCounters were not just put beside EventSource).

  2. While pushing things into Corelib, makes things 'cleaner' for .NET Core, it makes things more complex from a .NET Desktop / .NET Standard point of view. We will have to continue to carry the EventCounter payload in System.Diagnostics.Tracing for existing versions and maybe desktop for a long time (maybe forever). This creates its own sort of subtlety, Sure we can share the code, but we have not really gained overall simplicity by doing it.

  3. The simpler solution does involve cloned code. As you point out it is not huge, and we can share the actual code needing cloning, but it is still not an ideal situation. Note that we may need to do something with the networking code similar to System.Diagnostics.Process.

Now these are not 'killer' problems, but it at least made me pause. The other notion in play here is that this problem of dependency inversion seems like a problem worth solving in a generic way. Note that the contract from the runtime to other things it very simple (activate something), can be optional (it can failure at runtime is not runtime's fault), and is conceptually similar to any other config that conceptually needs to run code to do its job (they too are doing reflection).

In particular, our ideal situation is that if users ask for it (by building things just so) we would like to support a HTTP 'port' that supports diagnostics as REST apis. (that would all the UI to literally be a web page that gets its data from this port). This feature has this dependency inversion issue on steroids (we need all the HTTP security goo, as well as a little HTTP server ...). Clearly we don't want all that in the bowels of the runtime, we just want it to be 'activatable' from the base runtime if the users configures his app appropriately.

There was a reason I labeled my proposal above a 'strawman'. I am ambivalent about it myself. Nevertheless we had to start SOMEWHERE, and this conversation is part of making progress. @jkotas given this new information, what do you think?

@jkotas
Copy link
Member

jkotas commented Dec 29, 2018

System.Diagnostics.Tracing package cannot be evolved anymore because of it has been tighly coupled with the runtimes ("inboxed"). API additions for inboxed packages can only ship with new versions of the runtimes. It is no accident that the System.Diagnostics.Tracing package has been removed from the build. Note that Microsoft.Diagnostics.Tracing.EventSource that re-ships the APIs under different identity is fine because of it is nice clean standalone NuGet package, not tightly coupled with the runtime.

Yes, we should be optimizing for making things "cleaner" for .NET Core where the future platform innovation is going to be. It is different from what we have been doing in the past where both design and agility were hampered by optimizing for backporting to existing runtimes. Here is similar discussion in different context: https://github.com/dotnet/corefx/issues/32640#issuecomment-427618911

Note that we may need to do something with the networking code similar to System.Diagnostics.Process.

Could you please elaborate? I assume that sockets can register its own event counters when you use them for the first time. We should not need to have anything that knows about networking event counters in corelib.

if users ask for it (by building things just so) we would like to support a HTTP 'port' that supports diagnostics as REST apis

I am fine with injection of dependencies that the app explicitly asks for, e.g. by initializing the REST api in the app Main method; or via the startup hooks that we have added recently. It is different from the baseline EventCounters. The baseline EventCounters are included by default, without explicit opt-in.

@Maoni0
Copy link
Member

Maoni0 commented Jan 3, 2019

@vancem a lot of the things you mentioned for GC are already available via native ETW events (not CPU time - GetThreadTimes is not cheap and since we can already calculate this with CPU sample events we didn't go the GetThreadTimes route; if you want to debug it you'll need to collect CPU samples anyway). are you proposing that we should include all this info in one event that's fired after a GC, in addition to the events we already fire today? sorry if I misunderstood, I have to admit I did not read this (long) thread in its entirety.

@noahfalk
Copy link
Member Author

noahfalk commented Jan 3, 2019

We also create the new API EventCounter(string name, EventSource source, bool isAccumulator) which marks the counter as an accumulator. This value is passed in the payload for the counter that tells the viewer that it shoudl sum the values over an interval of time, rather than show the average.

I feel a bit leary growing the scope to include adding semantic hints that suggest how viewers should summarize data over a period of time. In a simple world we could provide no hints and viewers could do any of:
a) display the instantaneous values in text
b) graph values over time
c) allow the viewer's user to specify both an aggregation interval and the aggregation function
d) hard code specific aggregations of specific counters in a pre-defined view

My goal is to simplify and this seems like a spot that we can invest in later as needed.

@vancem
Copy link
Contributor

vancem commented Jan 3, 2019

Sorry for the delay.

@jkotas - I want to follow up with you offline to understand our versioning options. Suffice it to say here that your suggestion to pull EventCounters into the framework is 'on the table'.

@Maoni0 - The overarching goal is to simply implement some EventCounters that expose GC information. EventCounters do more than just expose information, they do it in a pretty uniform way (as a named stream of numbers. Unlike events, they are very uniform and thus the presentation of the data is uniform (a graph). There are already tools/dashboards (like the azure portal) that know how to deal with data in this form as well as an expectation by users to have such things . Thus the idea is to have a set of metrics that pretty much mirror the useful perf-counter GC metrics we have on Windows.

The issue is how to do that. All this GC information starts out in native code, but the EventCounter logic is in managed code. So we have to deal with that somehow. Also arguably, users do care about GC statistics (self-monitoring apps are not uncommon), and frankly the most natural way of doing this is to have a 'normal' managed API for it (in general exposing information to managed code is a good thing). This is what the 'GCInformation' proposal is. It is meant to simply expose what is going out as an ETW/EventPipe event to managed code in the natural way) (Ignore the CPU metrics, I did not realize we did not already do that).

are you proposing that we should include all this info in one event that's fired after a GC, in addition to the events we already fire today?

So in a word, yes, that is the proposal. With this information exposed to managed code, it is very straightforward to push the information out via EventCoutners.

@noahfalk

I feel a bit leary growing the scope to include adding semantic hints that suggest how viewers should summarize data over a period of time. In a simple world we could provide no hints and viewers could do any of ...

I too would feel more comfortable if EventCounters only provide the data and 'something else' dealt with how it is presented. However leaving this information out seems to be a disservice to users (it is strictly less convenient), with not a lot of gain (we can feel good about being 'minimal'). As a practical matter, a particular metric really does know if it is useful to simply display the value (e.g. % time in GC), or aggregate it over some time interval (e.g. RequestCount). By not allowing some way of passing this 'meta data' you end up forcing the end tool to guess at it (either by naming convention or forcing the user to tell them).

It seems to me having meta-data associated with a counter (in addition to its name, which arguably is a piece of meta-data), seems very useful and frankly not very problematic for us (our job is to simply pass it though, what it does on the other end is not our concern). It seems like a worthy request.

What do you think?

@noahfalk
Copy link
Member Author

noahfalk commented Jan 4, 2019

Thanks Vance! These are a few concerns that came to mind when we start adding display hints. Certainly all of these issues below could be tackled with appropriate diligence, it just made me feel that this wasn't entirely trivial and I wasn't sold that the value was that high. I don't recall seeing typical counter viewers/dash boards that were automatically computing these derived interval statistics on abitrary counters without requiring user configuration. Do you know of a typical viewer that does this?

a) You suggested that all counters would either be averaged or summed over an interval, but there are counters for which neither of these operations seems appropriate. For example imagine a # of exceptions counter where each data point emitted is the absolute number of exceptions that have been thrown since the app began. Presumably the customer either wants the most recent value (the total exceptions) or they want the number of exceptions thrown in a specific interval. Computing that latter option is neither a sum nor an average, it is end_interval_value - begin_interval_value.
Another example would be counters that represent high or low water marks in which case max() or min() would probably be a more desirable interval summary.
b) Providing the metadata with the counters is some additional complexity that provides opportunities for people to make mistakes. For example my # of exceptions counter above intuitively feels like an accumulator so if I didn't know better I might mark it as one. If that means the viewer starts summing the values users won't get the results I expect. Given that not all viewers have to honor the hints or display interval statistics at all it may not be apparent that I've made the mistake depending on what viewer I use to test with.
c) If the viewer wants to show the count of things which happened recently, it also has to decide some duration that represents recent. We are providing hint that the user might care about a recent interval rather than the aggregate number, so should we also provide a hint about what recent interval makes sense? If not have we saved the user much work if they still have to configure the interval manually?
d) If a counter is a count, the user still might want to see the total count rather than the recent count. Should viewers trust our hint, or show both, or let the user decide?

@Maoni0
Copy link
Member

Maoni0 commented Jan 4, 2019

@vancem got it. @noahfalk and I also chatted about this. if we mostly just want to provide the kind of info that perf counters provide, the code is mostly in gc\gcee.cpp - GCHeap::UpdatePreGCCounters and GCHeap::UpdatePostGCCounters under the ENABLE_PERF_COUNTERS define. these are called in do_pre_gc and do_post_gc. do_pre_gc is called on the user thread that triggered a GC for WKS GC and on one of the SVR GC threads for SVR GC. do_post_gc is called in various places -

  • for a Background GC this is called on the Background GC thread for WKS GC and on one of the Background GC threads for SVR GC

  • for a blocking GC this is called on the user thread that triggered the GC for WKS GC and on one of the SVR GC threads for SVR GC.

the info is already conveniently stored in the datastructure provided by perf counters - and this datastructure is only updated during pre/post GC so that means the values remain unchanged inbetween GCs and you can freely read it after a GC is done.

let me know if you need more info. as pointed out already, out of these threads the SVR GC threads are the only ones not managed threads.

@vancem
Copy link
Contributor

vancem commented Jan 4, 2019

@noahfalk - The short answer: I am willing to drop the 'isAggregation feature. Longer answer below.

Fundamentally, I am just trying to provide the information to make a dashboard much like https://portal.azure.com, (which has things like Server Response Time (Avg), Server Requests (Sum)). Notice that this portal has to do different things for these two metrics (one is averaged the other summed over a interval of time). The question is how did the dashboard decide that the first one should be averaged and the second summed. Clearly users can change this so it is just about providing defaults. You can also obviously live without it at the expense of more work setting up the dashboards. Since most dashboards are actually inherited anyway, I agree the value is small.

I looked at ApplicationInsights APIs for doing Metrics and it seems they do not provide any meta-data so they must force the user to provide this (or guess at it via names or something), so I think it is OK that we do the same (we can add the feature later).

@sywhang
Copy link
Contributor

sywhang commented Jan 4, 2019

@vancem @noahfalk Not to get too sidetracked from this discussion but it seems like this whole uncertainty about what EventCounter should look like and what other APIs we should expose in it probably means we should allow it to be updated out-of-band... That makes me kinda pause working on this (which I was doing based on the assumption that we would be moving EventCounter back to CoreLib). Would it be possible to chat about this offline some time next week?

@noahfalk
Copy link
Member Author

noahfalk commented Jan 5, 2019

That makes me kinda pause working on this

Regardless of what happens with EventCounters, there is a chunk of work that I don't think is contentious to just create managed APIs that expose the data we need in-proc. It is item 8 in Vance's proposal above. I'd suggest starting there, and you could easily build tests/demos for the APIs that have no dependency on EventCounter.

Would it be possible to chat about this offline some time next week?

Happy to chat on it offline (and I assume Vance is too).

we should allow it to be updated out-of-band

I think aside from shipping in CoreLib, there are two options you might have been refering to as OOB:

a) Code isn't in CoreLib assembly, but some other assembly that is part of the .Net Core runtime package
b) Code is in an assembly that ships on NuGet completely independent of the runtime

I'm not convinced either (a) or (b) buys us all that much in terms of agility, so I still lean towards Jan's position that we should put it in CoreLib and put the source in the shared partition. In particular:

  1. Whether its through reflection or not, we are unlikely to create public EventCounter APIs right now that the code in CoreLib doesn't need to call. The moment CoreLib takes a dependency on the API surface then it would be very awkward to break those APIs at any point in the future. I don't see the bar for making changes is going to be that different regardless of the assembly the code is in.
  2. .Net Core ships every 6-12 months with servicing bug fixes at a faster cadence which feels good enough for the kind of follow up feature work/bug fixes I'd imagine EventCounter could have (assuming we keep it simple and I'm advocating we should)
  3. OOB makes it a little easier to drop an assembly entirely in the future if we want to stop using it, but given its already small size and the dependency we want to take on it that seems unlikely. We also expose everything in CoreLib via forwarders so it remains an implementation detail we can change later if we want to move the code back out to some other in-box assembly.
  4. OOB makes it easier to share a single binary across multiple runtime implementations, but I'd like to believe sharing source in the shared partition is also a pretty reasonable way to do sharing and it doesn't come with all the overhead to build delegate and reflection wrappers for light-up/dependency inversion.

@kouvel
Copy link
Member

kouvel commented Jan 15, 2019

The thread ones look like they wouldn't be too difficult to implement, I'm thinking something like below, with some questions/comments:

  • static int Thread.Count - Total number of threads that currently exist
    • Should this include only threads that may run managed code, or all threads including wait, timer, GC threads, etc.?
    • Should this include external threads that the runtime knows about but did not create, like with reverse-p/invoke from a separately created thread?
  • static int ThreadPool.ThreadCount - Number of worker + IO threads that currently exist
    • I'm thinking this would not reflect how many threads hill climbing thinks we should have, but rather just the total that were created, which would be a more stable value
  • static int ThreadPool.ActiveThreadCount - Number of worker + IO threads that are currently processing work, thought it may be useful to have
    • The gap from this to ThreadCount would roughly give the number of threads currently waiting for work
    • Often verbose for pushing an event, could send updates periodically at some frequency, maybe at hill climbing and gate thread intervals
  • static int ThreadPool.PendingWorkItemCount - Number of work items currently in the queue
    • Could get a rough but reasonable value from scanning the global and local queues
    • Capped to int.MaxValue
    • Could be a relatively expensive call, periodic polling may be ok
    • Very verbose for pushing an event, could send updates periodically at some frequency, maybe from the gate thread or a timer callback
  • static int ThreadPool.CompletedWorkItemCount - Number of work items that have completed since the start of the process
    • Value may wrap around, changes over units of time would be plottable and would allow calculating throughput and average work item duration
    • Looks like we already track this info for hill climbing, could get a rough but reasonable value
    • Could be a relatively expensive call, periodic polling may be ok
    • Very verbose for pushing an event, could send updates similarly as for PendingWorkItemCount
  • static int Monitor.LockContentionCount - Number of times so far in the process that a thread had to wait for a lock due to contention
    • Value may wrap around, changes over units of time would be plottable and would allow calculating frequency of contention, but not average duration of contention
    • Would only include contention from Monitor locks, not from anything else like Crst
    • Contention happens a lot more frequently in NetCore than in NetFx due to less spinning, the values would not be comparable
    • Could consider decreasing the frequency in some way, [edited] need to think it more
    • Could be very verbose for pushing an event currently, though it may possible to decrease frequency
    • An alternative (maybe in addition) is to track total duration of waits if that would be more interesting. A high jump in that over units of time may more directly indicate a different type of problem.

Some of these we don't have in ETW event data, or their meaning may be different. Is it an intention to have all data exposed in this fasion to also be included in ETW events for finer debugging?

@kouvel
Copy link
Member

kouvel commented Jan 15, 2019

Also it seems like some tracking may be appropriate to enable only while the event counter is enabled. For For an API to be able to poll at any time, in info would have to be tracked all the time. For example if we wanted to track duration of waits for a lock, getting two timestamps may not be negligible overhead on the wait path if the wait does not last too long, and it seems to be a bit slower on Linux. Do you think it would be worth including APIs to enable/disable tracking, maybe on a case-by-case basis or by area or something like that?

Tracking ThreadPool.CompletedWorkItemCount is something we're already paying for (for hill climbing), and it can be made relatively cheaper than it is now, but that's in a very hot path where a few extra memory accesses can make a difference, such as if we were to use a different thread pool implementation it may not be necessary to track when tracking is not enabled.

@jkotas
Copy link
Member

jkotas commented Jan 15, 2019

Total number of threads that currently exist

https://github.com/dotnet/coreclr/issues/20372#issuecomment-449193835 suggested that this should be the same number as Process.ThreadCount. It means total number of threads in the process.

@vancem
Copy link
Contributor

vancem commented Jan 16, 2019

@kouvel - first you should know that we get to define what these things are, and so it is completely reasonable to have a conversation as the implementation proceeds as to what the exact meaning is. Some general principles are

  1. The goal of the counters is to diagnoses various peformance and functionality problems.
  2. We are trying NOT to define new things, if there are existing definitions we should be biased to using them.
  3. In particular if the Destktop runtime has the PerfCounter like it, we should be biased to using it (unless we have reason to believe it was a mistake).
  4. Getting these values should be VERY inexpensive. If that is now true we should see if we can collect something else that IS true (e.g. count the errors rather than the successes).
  5. We want this to be easy to implement (so every thing we do add we have to believe it is valuable).

Note that all these values will be PULL (we probably just have a struct with poitners on both the maanged and native side. We update it on the native side and probe it from the managed side.

On the specific questions:

static int Thread.Count
I am conflicted. System.Thread is about managed threads, so a static there should be the count of managed threads (those with managed IDs). Our strategy for thread/process stuff is to go to clone what we need out of System.Diagnostics.Process, so we just need a PRIVATE API in corelib for the OS threadcount, and we can drop the need for Thread.Count (our counter will use the OS count). Comments?

static int ThreadPool.ThreadCount - Number of worker + IO threads that currently exist
I'm thinking this would not reflect how many threads hill climbing thinks we should have, but rather > just the total that were created, which would be a more stable value

Yes, this should be the number of threads 'owned' by the Threadpool (regardless of what they are doing).

static int ThreadPool.PendingWorkItemCount - Number of work items currently in the queue
Could get a rough but reasonable value from scanning the global and local queues

Yes. I too worry about its cost. However polling should happen at reasonable (e.g. 1 sec or less) interval. If you have ideas on a better metric (that is cheaper but lets you know if work is backing up or not) we should discuss.

static int ThreadPool.CompletedWorkItemCount - Number of work items that have completed since the start of the process

Probably should be a long (unless it is problematic. Note that the EventCounter will do a delta (it will compute the number of CompletedWorkItems since the last counter Write). This is what users want to see (a monotonically increasing number is not what users want to see), but as a static variable delta's don't work (you don't know if others have looked at the variable), so a cumulative one is reasonable (Note that overflow is not that big of a deal (because delta's still make sense)).

static int Monitor.LockContentionCount - Number of times so far in the process that a thread had to wait for a lock due to contention

Again could be a long, but OK if an int is significantly easier. It should be like the Desktop (thus only managed locks, just a count.

Having total duration may be useful, but we don't need it right now (we are mostly trying to get to parity with Desktop).

Also it seems like some tracking may be appropriate to enable only while the event counter is enabled

I am trying to avoid this for simplicity's sake. The intent is that we may expose the APIs above publically and they have no notion of 'turned on'. We could invent one, but it would be a stumbling block...

@kouvel
Copy link
Member

kouvel commented Jan 16, 2019

Ok thanks, some notes:

  • Can exclude Thread.Count and lock wait duration for now
  • Can make CompletedWorkItemCount and LockContentionCount long, a single value may be interesting at times. May as well make PendingWorkItemCount long as well.
  • For CompletedWorkItemCount, I can't think of a better way to track whether the thread pool queue is getting backed up, 1 sec interval should be ok
  • For now LockContentionCount would just be incremented each time a wait is necessary. Frequent contention would not necessarily indicate a problem, the user would have to determine how large a jump in how short a time is worth investigating.

@vancem
Copy link
Contributor

vancem commented Jan 17, 2019

I send e-mail to @sywhang but I wanted to log it here as well.

The EventSource that we should hand the GC and JIT EventCounters on should be a new one called System.Runtime. In particular its declaration should be.

[EventSource(Name = "System.Runtime")]
internal sealed class ManagedRuntimeEventSource : EventSource

The name of the class is not important. RuntimeEventSource is already taken by the EventSource that supports the Native events which is why I called this one ‘ManagedRuntimeEventSource’. Note that I would actually prefer if you renamed the current one 'NativeRntimeEventSource, and then you can use 'RuntimeEventSource' to be the managed one.

@kouvel
Copy link
Member

kouvel commented Feb 15, 2019

RE ThreadPool.ThreadCount, for CoreRT, it looks like we don't have visibility into how many threads currently exist in the Windows thread pool. With any thread pool implementation it would be possible to tell how many threads are currently processing work.

Both values could be useful when available. For now, couple of options:

  • ThreadCount would reflect the number of existing thread pool threads when that info is available. When it's not available, it would reflect the number of thread pool threads currently processing work.
  • Or, in addition to or instead of ThreadCount, expose ActiveThreadCount that reflects the number of threads that are currently processing work

I'm leaning towards exposing ActiveThreadCount, without exposing ThreadCount, as that is a more long-term value that can be provided with any implementation, and that value may be more interesting anyway.

Thoughts?

@kouvel
Copy link
Member

kouvel commented Feb 17, 2019

I think it is possible to track the number of existing threads in the thread pool, and it may be cheaper than to track the active thread count, taking a closer look

@davidfowl
Copy link
Member

I'd like to add another counter and API to this list. We need the ability to get the number of timers in the timer queue (all queues) and there should also be a counter for that.

@kouvel
Copy link
Member

kouvel commented Jun 6, 2019

I'll take a look. Not sure if it's too late for API additions, but will see.

@davidfowl
Copy link
Member

@kouvel Worst case, it doesn't need to be a public API (like 80% of the GC counters 😄)

@sywhang
Copy link
Contributor

sywhang commented Jun 16, 2019

Active timer counter has been added via dotnet/coreclr#25186 and dotnet/diagnostics#342

@danmoseley
Copy link
Member

cc @preethikurup to make sure she's aware of this for 1st party purposes.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@StephenBonikowsky StephenBonikowsky moved this from In progress to Triage Needed in .NET Core impacting internal partners May 12, 2020
@tommcdon tommcdon moved this from Triage Needed P1 (Requested for 3.1 or 5.0) to 5.0 Release - Committed in .NET Core impacting internal partners May 15, 2020
@tommcdon
Copy link
Member

The remaining work on this PR is to add managed API's for .NET core counters. This is tracked here: #36571.

.NET Core impacting internal partners automation moved this from 5.0 Release - Committed to Done May 15, 2020
.NET Core Diagnostics automation moved this from Backlog to Done May 15, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr enhancement Product code improvement that does NOT require public API changes/additions
Projects
Development

No branches or pull requests

10 participants