Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High differences between run for GC Diagnoser #133

Closed
xoofx opened this issue Apr 1, 2016 · 21 comments
Closed

High differences between run for GC Diagnoser #133

xoofx opened this issue Apr 1, 2016 · 21 comments
Assignees
Milestone

Comments

@xoofx
Copy link
Member

xoofx commented Apr 1, 2016

Trying to play with the GCDiagnoser and I'm getting weird results on consecutive run, like the Gen0/Gen1/Gen2/Bytes Allocated/Op are varying a lot (30-40%), making it quite useless to compare... while I would not expect figures to change that much (afaik, I'm not aware about this indeterministic GC behaviour)

Have you been noticing this issue or Am I doing something wrong there?

@adamsitnik
Copy link
Member

I have been running our integration tests plenty of times, they contain simple test for GC Diagnoser and the results were always the same.

.NET GC is selftunning (changing Gen sizes etc) but we always start new process so the results should be almost the same as long as OS has similar resources available.

Could you give some sample benchmark with config that you use?

@mattwarren
Copy link
Contributor

Can you post the logs from 2 runs with different # of GC collections? It'll be easier to see what's going on

On 1 Apr 2016, at 02:48, "Alexandre Mutel" notifications@github.com wrote:

Trying to play with the GCDiagnoser and I'm getting weird results on consecutive run, like the Gen0/Gen1/Gen2/Bytes Allocated/Op are varying a lot (30-40%), making it quite useless to compare... while I would not expect figures to change that much (afaik, I'm not aware about this indeterministic GC behaviour)

Have you been noticing this issue or Am I doing something wrong there?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

@xoofx
Copy link
Member Author

xoofx commented Apr 1, 2016

The output I'm getting while running multiple times is like this:

               Method |    Median |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------------- |---------- |---------- |------- |------ |------ |------------------- |
    TestCommonMarkNet | 4.8020 ms | 0.0326 ms | 208.54 | 14.34 | 89.28 |       1,117,194.48 |
 TestCommonMarkNetNew | 5.7072 ms | 0.0541 ms | 169.87 | 84.31 | 72.44 |       1,124,892.65 |
          TestMarkdig | 6.4552 ms | 0.0454 ms | 115.00 | 75.00 | 55.00 |         851,150.77 |
     TestMarkdownDeep | 8.4201 ms | 0.5206 ms | 176.43 | 76.75 | 56.46 |       1,039,481.68 |

               Method |    Median |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------------- |---------- |---------- |------- |------ |------ |------------------- |
    TestCommonMarkNet | 4.7802 ms | 0.0365 ms | 117.28 |  8.10 | 50.15 |         738,831.91 |
 TestCommonMarkNetNew | 5.7256 ms | 0.0317 ms | 176.00 | 87.42 | 75.26 |       1,371,935.85 |
          TestMarkdig | 6.2702 ms | 0.0673 ms |  91.59 | 59.74 | 43.81 |         797,976.80 |
     TestMarkdownDeep | 7.9212 ms | 0.0405 ms | 164.00 | 71.00 | 52.00 |       1,134,079.22 |


               Method |    Median |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------------- |---------- |---------- |------- |------ |------ |------------------- |
    TestCommonMarkNet | 4.8237 ms | 0.0504 ms | 152.82 | 10.51 | 65.42 |       1,117,194.48 |
 TestCommonMarkNetNew | 5.6900 ms | 0.0490 ms | 138.97 | 68.97 | 59.27 |       1,255,915.01 |
          TestMarkdig | 6.3441 ms | 0.0963 ms |  70.75 | 46.02 | 33.66 |         713,802.94 |
     TestMarkdownDeep | 7.8991 ms | 0.0547 ms | 164.00 | 71.00 | 52.00 |       1,314,925.80 |    


               Method |    Median |    StdDev |  Gen 0 |  Gen 1 |  Gen 2 | Bytes Allocated/Op |
--------------------- |---------- |---------- |------- |------- |------- |------------------- |
    TestCommonMarkNet | 4.7855 ms | 0.0920 ms | 194.88 |  13.53 |  83.23 |         963,242.75 |
 TestCommonMarkNetNew | 5.7061 ms | 0.0456 ms | 249.01 | 123.83 | 106.91 |       1,522,547.54 |
          TestMarkdig | 6.3259 ms | 0.1434 ms | 115.00 |  75.00 |  55.00 |         785,682.84 |
     TestMarkdownDeep | 7.9409 ms | 0.0626 ms | 173.00 |  75.00 |  55.00 |         938,949.44 |

               Method |    Median |    StdDev |  Gen 0 |  Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------------- |---------- |---------- |------- |------- |------ |------------------- |
    TestCommonMarkNet | 4.7873 ms | 0.0881 ms | 188.43 |  13.16 | 80.36 |         992,620.58 |
 TestCommonMarkNetNew | 5.7098 ms | 0.0235 ms | 212.03 | 105.35 | 90.77 |       1,381,368.58 |
          TestMarkdig | 6.4275 ms | 0.0694 ms | 113.54 |  74.13 | 54.42 |         827,121.71 |
     TestMarkdownDeep | 7.9162 ms | 0.0472 ms | 173.00 |  75.00 | 55.00 |       1,000,627.30 |

As you can see, sometimes figures are jolting for the Gem0/Gen1/Gen2 (worse for Bytes Allocated/Op), changing even the ranking between runs.

When I'm running for example a simple benchmark (without using BenchmarkDotNet) and outputing GC.CollectionCount(0)/GC.CollectionCount(1)/GC.CollectionCount(2)) and testing one solution (like TestcommontMarkNet), the GC counts are fixed and not changing between runs.

@xoofx
Copy link
Member Author

xoofx commented Apr 2, 2016

Looking at the code, one thing I'm not sure is that I would expect the GCDiagnoser to only process GC events for the IterationMode.MainTarget period: Typically there are other periods that seem to be non deterministic (like the IdleTarget, not sure what it does, but it looks like it is adapting to the process) and we don't want them as part of the measurement. Also, not sure that it is good to have variable #operation count for different method when measuring GC events...

@xoofx
Copy link
Member Author

xoofx commented Apr 2, 2016

Ok, so after trying to tweak the usage of BenchmarkDotNet and using Mode.SingleRun which disable the Pilot phase, numbers are completely stable (and much more equivalent between the methods).

Though, not sure I did use correctly the API, but I had to patch the MethodInvoker.Invoke method to hardcode invokeCount = 64, because if I tried to setup the attribute on each method [Benchmark(OperationsPerInvoke = 64)], the methods are actually not executed and gives completely invalid results... so difficult to understand what's going on with this...

I used the following config:

var config = ManualConfig.Create(DefaultConfig.Instance);
var gcDiagnoser = new GCDiagnoser();
config.Add(new Job { Mode = Mode.SingleRun, LaunchCount = 2, WarmupCount = 2, TargetCount = 10 });
config.Add(gcDiagnoser);

@mattwarren
Copy link
Contributor

As you can see, sometimes figures are jolting for the Gem0/Gen1/Gen2 (worse for Bytes Allocated/Op), changing even the ranking between runs

The Gen 0/1/2 values are scaled/normalised based on the min value across the benchmarks (see https://github.com/PerfDotNet/BenchmarkDotNet/blob/master/BenchmarkDotNet.Diagnostics/GCDiagnoser.cs#L174), so it's not surprising they vary as the # of runs/Ops changes.

However the # Bytes is per Op, so should be more stable

@mattwarren
Copy link
Contributor

Yeah 'OperationsPerInvoke' isn't going to help, that's for something else.
However you should be able to use 'IterationCount' on the Job, see
https://github.com/PerfDotNet/BenchmarkDotNet/blob/master/BenchmarkDotNet/Jobs/Job.cs#L28

Looking at the code, one thing I'm not sure is that I would expect the GCDiagnoser to only process GC events for the IterationMode.MainTarget period: Typically there are other periods that seem to be non deterministic (like the IdleTarget, not sure what it does, but it looks like it is adapting to the process) and we don't want them as part of the measurement. Also, not sure that it is good to have variable #operation count for different method when measuring GC event

Good point, I need to have a think about how to best handle this.

@xoofx
Copy link
Member Author

xoofx commented Apr 2, 2016

Setting IterationTime doesn't work for me:

var config = ManualConfig.Create(DefaultConfig.Instance);
var gcDiagnoser = new GCDiagnoser();
config.Add(new Job { Mode = Mode.SingleRun, LaunchCount = 2, WarmupCount = 2, IterationTime = 64, TargetCount = 10 });
config.Add(gcDiagnoser);

It results in only 1 op:

MainTarget 1: 1 op, 4785568.28 ns, 4.7856 ms/op
MainTarget 2: 1 op, 4660543.99 ns, 4.6605 ms/op

@xoofx
Copy link
Member Author

xoofx commented Apr 2, 2016

Looking at MethodInvoker.cs:92 , IterationTime is only used for the Throughput mode and not the SingleRun

@xoofx
Copy link
Member Author

xoofx commented Apr 2, 2016

Also on a side note, I had to compile BenchmarkDotNet.Diagnostics on my side as it is not published to nuget. Is it an oversight?

@xoofx
Copy link
Member Author

xoofx commented Apr 4, 2016

Hm, ok, maybe Diagnostics cannot be released because xproj+project.json doesn't support yet embedding native files... arg... It seems we will have to wait for dotnet/cli :(

@mattwarren
Copy link
Contributor

Hm, ok, maybe Diagnostics cannot be released because xproj+project.json doesn't support yet embedding native files... arg... It seems we will have to wait for dotnet/cli :(

Hmm, didn't realise that @adamsitnik or @AndreyAkinshin does this block us from releasing a BenchmarkDotNet.Diagnostics nuget package?

@mattwarren
Copy link
Contributor

@xoofx are you happy with the workaround you have, i.e. using a custom Config?

Or would you want BenchmarkDotNet to be more stable/consistent across runs?

@xoofx
Copy link
Member Author

xoofx commented Apr 5, 2016

Hmm, didn't realise that @adamsitnik or @AndreyAkinshin does this block us from releasing a BenchmarkDotNet.Diagnostics nuget package?

Actually, I was able to ship native DLLs packInclude in project.json, along with a msbuild .targets file for net20-net4x+ platforms, so that is finally possible at least.

Or would you want BenchmarkDotNet to be more stable/consistent across runs?

Hm, actually, no as I said above, IterationTime is not working, so I still had to patch the BenchmarkDotNet.
At least, when using GCDiagnoser, it is only useful to run it in a stable mode, otherwise figures are completely useless.
I'm wondering if it is really good to have this adaptive Iteration count done for each method... I would expect that the first run returns its adaptive results and other methods measured would use the exact same iteration. But I understand that in scenarios where figures are changing a lot between the methods, that would be not ideal (if a very fast method is used as a baseline for the count, and then a much slower method is used, it would take more time to run it...)... so yeah, not sure how to deal with this

@mattwarren
Copy link
Contributor

Hm, actually, no as I said above, IterationTime is not working, so I still had to patch the BenchmarkDotNet.

IterationTime is in milliseconds, so you might need to increase it (depending on how long your benchmark lasts.

@xoofx
Copy link
Member Author

xoofx commented Apr 5, 2016

ah, ok 😅, but how does it translate to a fixed count then? Not sure to understand the meaning of this parameter...

@xoofx
Copy link
Member Author

xoofx commented May 25, 2016

Back on the subject, so any way to run for all benchmarks the same count of loop (I guess SingleRun mode, with no setup/pilot phase) to have a predictible test with MemoryDiagnoser?

I tried

var config = ManualConfig.Create(DefaultConfig.Instance);
var gcDiagnoser = new MemoryDiagnoser();
config.Add(new Job { Mode = Mode.SingleRun, LaunchCount = 2, WarmupCount = 2, IterationTime = 1024, TargetCount = 10 });
config.Add(gcDiagnoser);

But doesn't not work...

Very annoying, as so far, I can't have reliable test with GC. If you have a workaround without modifying the code, that would help, thanks!

@xoofx
Copy link
Member Author

xoofx commented May 25, 2016

For reference, the patch I used to get stable results

@mattwarren
Copy link
Contributor

@AndreyAkinshin, what are your thoughts on this, what's a safe way to make the # of Ops more consistent across runs? I don't want to break the core timings to make diagnostics work!

As a test I added the following line of code:

logger.WriteLineError($"// Run, Diagnostic - {allRuns.Sum(r => r.Operations):N0} Ops in total, across {allRuns.Count:N0} Runs");

And across 3 runs I got the following:

// Run, Diagnostic - 2,949,116 Ops in total, across 58 Runs
         Method | ArraySize |    Median |    StdDev |    Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------- |---------- |---------- |---------- |--------- |------ |------ |------------------- |
      GetSquare |      5000 | 3.5959 us | 0.0267 us | 7,085.00 |     - |     - |           8,872.17 |
 GetSquareStack |      5000 | 3.5182 us | 0.0235 us |        - |     - |     - |               0.44 |

// Run, Diagnostic - 2,818,044 Ops in total, across 56 Runs
         Method | ArraySize |    Median |    StdDev |    Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------- |---------- |---------- |---------- |--------- |------ |------ |------------------- |
      GetSquare |      5000 | 3.5873 us | 0.0274 us | 7,567.04 |     - |     - |           9,254.91 |
 GetSquareStack |      5000 | 3.5191 us | 0.0253 us |        - |     - |     - |               0.46 |

// Run, Diagnostic - 3,538,940 Ops in total, across 67 Runs
         Method | ArraySize |    Median |    StdDev |    Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
--------------- |---------- |---------- |---------- |--------- |------ |------ |------------------- |
      GetSquare |      5000 | 4.0673 us | 0.3156 us | 9,702.31 |     - |     - |           9,448.04 |
 GetSquareStack |      5000 | 3.4928 us | 0.0117 us |        - |     - |     - |               0.39 |

This is the problem, the total number of Ops/Runs changes, which in turn means the Bytes Allocated/Op also varies.

Currently @xoofx uses the config below, plus this patch

var config = ManualConfig.Create(DefaultConfig.Instance);
var gcDiagnoser = new MemoryDiagnoser();
config.Add(new Job { Mode = Mode.SingleRun, LaunchCount = 2, WarmupCount = 2, TargetCount = 10 });
config.Add(gcDiagnoser);

What's the best way to get this behaviour into BenchmarkDotNet in a reliable way? What about if we explicitly set long invokeCount = 64, but only for Diagnostic runs? We already discard the timings from these runs anyway.

@redknightlois
Copy link
Contributor

What you can do there is normalize everything. The Bytes Allocated/Op will be similar (no reason it should change over multiple runs) so normalization can give you a single value that is the average of all them.

@mattwarren
Copy link
Contributor

I'm not sure that just normalising them will help here, because unless I'm mistaken what you're proposing, we already do that?

In MemoryDiagnoser.cs we have this code:

return (result.AllocatedBytes / (double)result.TotalOperations).ToString("N2");

@adamsitnik adamsitnik added this to the v0.10.1 milestone Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants