Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use MemoryDiagnoser properly? #1250

Closed
luckerby opened this issue Sep 25, 2019 · 9 comments
Closed

How to use MemoryDiagnoser properly? #1250

luckerby opened this issue Sep 25, 2019 · 9 comments

Comments

@luckerby
Copy link

@luckerby luckerby commented Sep 25, 2019

I'm benchmarking some very simple C# code that essentially boxes 10 mil int values.

            const int noNumbers = 10000000;  // 10 mil
            object o = null;
            for (int i = 0; i < noNumbers; i++)
            {
                o = i;
            }
            int v = (int)o;

The machine that's running this is a Windows 10 64-bit box, with no particular changes done to the .NET Framework 4.7.2, so GC should be running in workstation mode.

After running BenchmarkDotNet against the code above (AnyCPU/Prefer 32-bit mode) I get the output below, signalling that no GC occurred for either gen0/1/2:
image

However if I take the same code and place it in a new C# project (AnyCPU/Prefer 32-bit mode) in the Main function and run this, I get a considerably different outcome. Note the 57 gen0 GCs reported by PerfView:
image

I've checked the code generated by BenchmarkDotNet using the [DisassemblyDiagnoser(printAsm: true, printSource: true)] attribute and I've also looked at the code generated by the standalone app using WinDbg. Here they are side-by-side, showing identical assembly instructions (minus the hex addresses, of course):
image

Details about the runtime used, as reported by BenchmarkDotNet :
image

The difference between the number of gen0 GCs is quite big. Am I doing something wrong here, that would explain the results ?

@WojciechNagorski

This comment has been minimized.

Copy link
Collaborator

@WojciechNagorski WojciechNagorski commented Sep 25, 2019

Did you try to use different GC mode settings of your benchmark configuration?
https://benchmarkdotnet.org/articles/samples/IntroGcMode.html

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Sep 26, 2019

I was happy that the same GC mode was used in both my standalone app and in the BenchmarkDotNet profiled one - Concurrent Workstation. For my app I know this is the case, since there's no specific entry in the App.config file that would change the default, while for BenchmarkDotNet it's clearly mentioned in the output visible in my original post.
If I go ahead and change the GC mode - eg to Server - I'm worried of running into other issues (eg multiple managed heaps, one per core).
I've also thought of trying to profile the memory used by the child process invoked by BenchmarkDotNet when it's doing the profiling. JetBrains dotMemory would work just fine, but the trouble is I don't know how to hook to the child process only (representing my app); plus I imagine this is invoked a lot of times, which complicates matters.

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Sep 27, 2019

I went ahead and ran PerfView at the same time that the BenchmarkDotNet profiling was running, in order to capture the ETW for the GC. Sure enough, there are a lot of gen0 GCs happening, but BenchmarkDotNet doesn't report any at the end (below). Is there some sort of threshold maybe over which a GC is reported in the results ?

image

@adamsitnik

This comment has been minimized.

Copy link
Member

@adamsitnik adamsitnik commented Oct 2, 2019

Hello everyone!

To capture the number of total allocated bytes on Full .NET Framework we need to enable monitoring for AppDomain:

if (RuntimeInformation.IsFullFramework)
AppDomain.MonitoringIsEnabled = true;

This adds some overhead, this is why we do it in a single extra iteration run after all benchmark iterations:

(GcStats workGcHasDone, ThreadingStats threadingStats) = includeExtraStats
? GetExtraStats(new IterationData(IterationMode.Workload, IterationStage.Actual, 0, invokeCount, UnrollFactor))
: (GcStats.Empty, ThreadingStats.Empty);

In this particular case the problem is that GCs occure during the benchmark iterations, but GC is self-learning, changes the generation budgets and does not perfom any work when we run the one extra iteration at the end.

The fix would be to capture the GC counts for every benchmark workload iteration, and just extend the data with the total number of allocated bytes captured in the last iteration. I currently don't have time to fix it, but I am going to mark this issue as up-for-grabs and hope that some #Hacktoberfest contributors are going to fix it. If not, I am going to fix it after Dotnetos Conference ;P

@WojciechNagorski

This comment has been minimized.

Copy link
Collaborator

@WojciechNagorski WojciechNagorski commented Oct 9, 2019

I wanted to fix this issue but my investigation shows that there is no bug.

My benchmark:

using BenchmarkDotNet.Attributes;

namespace BenchmarkDotNet.Samples
{
    [MemoryDiagnoser] // <-- I added MemoryDiagnoser here
    public class Issue1250
    {
        [Benchmark]
        public int Benchmark()
        {
            const int noNumbers = 10000000;  // 10 mil
            object o = null;
            for (int i = 0; i < noNumbers; i++)
            {
                o = i;
            }
            return (int)o;
        }
    }
}

Then I run benchmark from the console:

dotnet run -c Release -f net472 -- --filter *Issue1250*

I got a result:
image
image

When I run this benchmark without [MemoryDiagnoser] I got a result:
image
image

@luckerby Did you use [MemoryDiagnoser] in your benchmark? if not, could you check if that solves your problem?

@adamsitnik I think that BenchmarkDotNet should not print information about GC if it doesn't run extra iteration. I mean this line:

GC:  0 0 0 0 0
Threading:  0 0 0

Extra iteration runs only when user adds MemoryDiagnoser or ThreadingDiagnoser. In other cases we shouldn't print these lines:

outWriter.WriteLine(GCStats.ToOutputLine());
outWriter.WriteLine(ThreadingStats.ToOutputLine());

Or even more precisely:
GC: 0 0 0 0 0 should print only with MemoryDiagnoser
Threading: 0 0 0 should print only with ThreadingDiagnoser
@adamsitnik What do you think?

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Oct 9, 2019

@wojtpl2 I'm on a different machine than the one I used to initially run the code, and to be honest I don't remember the attributes used, aside from the disassembly one.

However, I've recreated the project, and first used MemoryDiagnoser against the class with the method under test, and got the results below similar to yours:
image

Removing the MemoryDiagnoser attribute resulted in an outcome, again, similar to yours:
image

The value of gen0 collections however looks fishy - if I understood correctly the output in BenchmarkDotNet is how many collections are done per 1000 runs, which would indicate an average of less than 1 collection.

I will check in a few hours how the original code looked like - particularly if the MemoryDiagnoser was used or not - just to be sure.

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Oct 9, 2019

@wojtpl2 The original code didn't have the MemoryDiagnoser attribute applied either, which makes the outcomes so far consistent.

I've looked at the source code, but couldn't figure this one out: how can you tell that not using [MemoryDiagnoser] results in includeExtraStats eventually being null when the stats get prepared ?

(GcStats workGcHasDone, ThreadingStats threadingStats) = includeExtraStats
? GetExtraStats(new IterationData(IterationMode.Workload, IterationStage.Actual, 0, invokeCount, UnrollFactor))
: (GcStats.Empty, ThreadingStats.Empty);

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Oct 11, 2019

Running again on the original machine, and I've used just one single run - using [SimpleJob(RunStrategy.ColdStart, targetCount: 1)] in order not to allow the GC to tweak itself in any way by "observing" subsequent runs of the same code. A result of 57 is shown for gc0 collections:

image

Immediately running PerfView againt the individual project doing the same exact thing shows again 57 gc0, just as the value observed ealier, and the same seen in the first post of this thread (number is 114 since both GC start and stop is captured):

image

Therefore things are quite consistent.

As a side note, I initially thought that 57 in BenchmarkDotNet's output is the number per 1000 runs, but I've later discovered that's just the case in the function output, not the overall GC stats.

@adamsitnik

This comment has been minimized.

Copy link
Member

@adamsitnik adamsitnik commented Oct 14, 2019

If you want BDN to produce memory statistics then you have to use MemoryDiagnoser, either via attribute, fluent config API or console line argument. The final results are printed in the Summary table, the intermediate output is only for the purpose of the Host and Benchmark process exchanging information.

You can read more about MemoryDiagnoser in our docs and on my blog

@adamsitnik adamsitnik closed this Oct 14, 2019
@adamsitnik adamsitnik changed the title No GCs Invoked During Benchmarking How to use MemoryDiagnoser properly? Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.