-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve scalability of monitored Counter / UpDownCounter #91566
Conversation
Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti Issue DetailsAvoid locking around every Update. Instead, use interlocked operations, and for CounterAggregator, partition the value being updated into one per core rather than one all-up. Without the partitioning, the compare-exchange loop can actually be measurably slower than the use of locking under heavy contention. With the partitioning, given we're dealing with floating-point, this can conceivably end up changing the reported values slightly, since we end up changing the order of operations (but with concurrency, such order of operations was already non-deterministic). Contributes to dotnet/aspnetcore#50412
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Diagnostics.Metrics;
using System.Diagnostics.Tracing;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Error", "StdDev", "Median", "RatioSD", "Job")]
public class Tests
{
const int Iters = 1_000_000;
private Meter _meter;
private UpDownCounter<int> _upDownCounter;
private Counter<int> _counter;
private MetricsEventListener _listener;
[GlobalSetup]
public void Setup()
{
_meter = new Meter("Example");
_upDownCounter = _meter.CreateUpDownCounter<int>("upDownCounter");
_counter = _meter.CreateCounter<int>("counter");
_listener = new MetricsEventListener();
}
[GlobalCleanup]
public void Cleanup()
{
_listener.Dispose();
_meter.Dispose();
}
[Benchmark]
public void UpDownCounter_Serial()
{
for (int i = 0; i < Iters; i++)
{
_upDownCounter.Add(1);
_upDownCounter.Add(-1);
}
}
[Benchmark]
public void UpDownCounter_Parallel()
{
Parallel.For(0, Iters, i =>
{
_upDownCounter.Add(1);
_upDownCounter.Add(-1);
});
}
[Benchmark]
public void Counter_Serial()
{
for (int i = 0; i < Iters; i++)
{
_counter.Add(1);
_counter.Add(1);
}
}
[Benchmark]
public void Counter_Parallel()
{
Parallel.For(0, Iters, i =>
{
_counter.Add(1);
_counter.Add(1);
});
}
private sealed class MetricsEventListener : EventListener
{
protected override void OnEventSourceCreated(EventSource eventSource)
{
if (eventSource.Name == "System.Diagnostics.Metrics")
{
EnableEvents(eventSource, EventLevel.LogAlways, EventKeywords.All, new Dictionary<string, string>() { { "Metrics", "Example\\upDownCounter;Example\\counter" } });
}
}
}
}
|
...ries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/CounterAggregator.cs
Show resolved
Hide resolved
...ries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/CounterAggregator.cs
Show resolved
Hide resolved
...ries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/CounterAggregator.cs
Show resolved
Hide resolved
...ries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/CounterAggregator.cs
Outdated
Show resolved
Hide resolved
Avoid locking around every Update. Instead, use interlocked operations, and for CounterAggregator, partition the value being updated into one per core rather than one all-up.
351333b
to
83e59cb
Compare
...ries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/CounterAggregator.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will be good to have @noahfalk have a look too.
Sorry I was OOF for last week and getting caught up. I like this as something people could opt into but I worry about the memory usage if we make this the default implementation. In the case of the 80 core machines mentioned in the issue, we are trading in an 8 byte double for a ref to an array with 80*cache_line_size bytes, 80*64=~5KB per time series I think? MetricsEventSource has a default tracking limit of 1000 time series. A time series = 1 combination of counter name + all dimension values so its not that hard to hit the time series limit. For people running this in production it could be a +5MB VM cost. In some cases thats probably no big deal, but in a container with tight limits that might be a significant regression or an OOM. In terms of fixes I think we could:
I'd like to find out if they are hitting similar contention issues with the lock on histograms. Histograms are a much bigger data structure so defaulting them to be one copy-per-core probably isn't viable unless it is opt-in or we change the data structure to make them grow more dynamically than they already do. |
I wonder if it's better to just remove padding then to cap it to 8 as the main beneficiar of this opt are many-cores systems. |
If that delivers a better perf result I am fine with it, though I still think we need to cap total memory used for these counters around 512KB (a somewhat arbitrarily selected value). If we had no padding presumably that raises the cap to 64 doubles per time series. |
A microbenchmark to play with (on a system with many cores): https://gist.github.com/EgorBo/fb944b46bf9b1ffd35b01bb3b5726e7f |
What about an adaptive solution? If the number of dimensions increases beyond a limit, then switch back to the old behavior to save memory? And if a counter has a lot of dimensions, then there is probably less contention on the lock. Updates are spread out across the dimensions. (is there a lock per dimension? If there isn't then ignore this) |
I worry that the assignment of which time series are optimized for memory and which are optimized for latency will appear largely random based on creation order of the different time series. Under high contention the perf cliff seems severe enough I think I'd rather get an error asking the dev running the perf scenario to raise the memory limits vs. having unexplained significant changes in perf. From I can see in the perf numbers, I think the contention impacts in @EgorBo's ARM multi-core machine are much larger than what I assume @stephentoub measured on x64.
Today there is one lock per time-series. (Time series being the a unique combination of instrument + all the dimension values, which I think is what you meant) |
@noahfalk, what would you like me to do with this? Should I do "cap parallelism at 8"? |
Yeah, I think capping at 8 is a fine place to start. I'd still be interested in making further improvements but its likely to get more involved. I assume you'd prefer to bank your incremental improvements and move on vs. working on more complicated options? |
Yes, as that's acceptable to you, that's my preference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @stephentoub!
/// The array is limited to a semi-arbitrary limit of 8 in order to avoid excessive memory | ||
/// consumption when many counters are being used. | ||
/// </remarks> | ||
private readonly PaddedDouble[] _deltas = new PaddedDouble[Math.Min(Environment.ProcessorCount, 8)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub shouldn't this be a max if the comment is accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe both the comment and code are correct. We want to increase the number up to the processor count but not go above 8, so we want the processor count or 8, whichever is lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bencyoung-Fignum min is ensuring limit, here it means 8 or less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh you are right, apologies!
Avoid locking around every Update. Instead, use interlocked operations, and for CounterAggregator, partition the value being updated into one per core rather than one all-up. Without the partitioning, the compare-exchange loop can actually be measurably slower than the use of locking under heavy contention. With the partitioning, given we're dealing with floating-point, this can conceivably end up changing the reported values slightly, since we end up changing the order of operations (but with concurrency, such order of operations was already non-deterministic).
Contributes to dotnet/aspnetcore#50412