BenchmarkDotNet (arguably) slightly overcorrects for overhead #1133

Zhentar · 2019-04-19T03:39:11Z

The test overhead deduction causes BDN to underreport benchmark execution times (as likely interpreted by users). The magnitude will vary depending upon hardware & the nature of test code, but should generally be on the order of 0.5ns-1.0ns.

I've noticed hints of this for a while, but only recently came to recognize what was occurring well enough to design a test that could clearly and consistently reproduce it.

Test Code

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.437 (1809/October2018Update/Redstone5)
Intel Core i5-6600K CPU 3.50GHz (Skylake), 1 CPU, 3 logical and 3 physical cores
.NET Core SDK=3.0.100-preview5-011351
  [Host]     : .NET Core 3.0.0-preview5-27617-04 (CoreCLR 4.6.27617.71, CoreFX 4.700.19.21614), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview5-27617-04 (CoreCLR 4.6.27617.71, CoreFX 4.700.19.21614), 64bit RyuJIT

Method	Mean	Error	StdDev
OneIncrement	0.0000 ns	0.0000 ns	0.0000 ns
TwoIncrement	0.1049 ns	0.0203 ns	0.0180 ns
ThreeIncrement	0.0771 ns	0.0222 ns	0.0208 ns
FourIncrement	0.3379 ns	0.0161 ns	0.0143 ns
FiveIncrement	0.6445 ns	0.0265 ns	0.0248 ns
SixIncrement	0.9659 ns	0.0133 ns	0.0111 ns

One, two, and three increments are all basically the same. But past that, there's a linear 0.32ns (or roughly 1 CPU cycle) increase in execution time for each additional increment. The first three are "free" - because they are able to run alongside the benchmark overhead instructions in the CPU pipeline, adding no effective latency to the the test harness. The execution time doesn't increase until all of that capacity has been filled, and the test flips from test harness bound to test code bound.

To an extent, this behavior isn't really wrong - after all the code will likely be running on a pipelined superscalar CPU in the real world, too. But the test harness code is probably abnormally independent of the test subject, since it doesn't interact with the results at all.

I don't have any ideas about what could/should be done regarding this in general. One thing that would help would be adding an option to use calli with function pointers instead of delegates in the in-process emit toolchain; reducing the total magnitude of the benchmark overhead shrinks the space in which latency can hide.

p.s. I think it's pretty great that BDN is so accurate that I can detect under-counting by three cycles

The text was updated successfully, but these errors were encountered:

adamsitnik · 2020-10-23T20:26:37Z

Hi @Zhentar

Thank you for a great input and appologies for such a huge delay in response.

One thing that would help would be adding an option to use calli with function pointers instead of delegates

We are using delegates on purpose: to prevent the [Benchmark] method from getting inlined (and get other optimizations as constant folding applied)

Example:

private void Demo(int invocationCount)
{
    Stopwatch stopwatch = Stopwatch.StartNew();

    int result = 0;
    Func<int> @delegate = Sample;

    for (int i = 0; i < invocationCount; i++)
    {
        result ^= @delegate.Invoke(); result ^= @delegate.Invoke();
        result ^= @delegate.Invoke(); result ^= @delegate.Invoke();
        result ^= @delegate.Invoke(); result ^= @delegate.Invoke();
        result ^= @delegate.Invoke(); result ^= @delegate.Invoke();
    }

    stopwatch.Stop();

    ConsumeTheResult(result);

    ReportTime(stopwatch.ElapsedTicks / invocationCount);
}

[Benchmark]
public int Sample() // some math logic

Could get optimized to:

private void Demo(int invocationCount)
{
    Stopwatch stopwatch = Stopwatch.StartNew();

    int result = Sample();

    stopwatch.Stop();

    ConsumeTheResult(result);

    ReportTime(stopwatch.ElapsedTicks / invocationCount); // a lie
}

@AndyAyersMS would be switching from delegates (callvirt) to functions pointers (calli) start inlining the benchmarks?

AndyAyersMS · 2020-10-23T21:20:18Z

Calls via function pointers would not get inlined currently.

timcassell · 2023-05-17T22:39:21Z

Delegates can get inlined, so perhaps the reasoning for doing this is outdated. Function pointers may get inlined in the future also. Maybe a separate method with NoInlining applied to it that then calls the benchmark method directly would be better?

adamsitnik added the discussion label Oct 23, 2020

timcassell linked a pull request Jun 20, 2023 that will close this issue

Call benchmark method directly #2334

Open

timcassell self-assigned this Aug 10, 2023

timcassell mentioned this issue Aug 16, 2023

Inaccurate results reported for small methods #1802

Open

timcassell mentioned this issue May 24, 2024

How can I obtain the addition result of two int variables without using a loop for testing? #2576

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BenchmarkDotNet (arguably) slightly overcorrects for overhead #1133

BenchmarkDotNet (arguably) slightly overcorrects for overhead #1133

Zhentar commented Apr 19, 2019 •

edited

adamsitnik commented Oct 23, 2020 •

edited

AndyAyersMS commented Oct 23, 2020

timcassell commented May 17, 2023 •

edited

BenchmarkDotNet (arguably) slightly overcorrects for overhead #1133

BenchmarkDotNet (arguably) slightly overcorrects for overhead #1133

Comments

Zhentar commented Apr 19, 2019 • edited

adamsitnik commented Oct 23, 2020 • edited

AndyAyersMS commented Oct 23, 2020

timcassell commented May 17, 2023 • edited

Zhentar commented Apr 19, 2019 •

edited

adamsitnik commented Oct 23, 2020 •

edited

timcassell commented May 17, 2023 •

edited