Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constructor Time Resulting In Mixed Results #1304

Open
luckerby opened this issue Oct 31, 2019 · 6 comments
Open

Constructor Time Resulting In Mixed Results #1304

luckerby opened this issue Oct 31, 2019 · 6 comments

Comments

@luckerby
Copy link

@luckerby luckerby commented Oct 31, 2019

Under BDN v0.11.5, consider a class with 2 methods marked for benchmark, and one constructor that essentially just burns CPU time - without any bearing on the 2 methods -, as such:

    [DisassemblyDiagnoser(printAsm: true, printSource: true)]
    [MemoryDiagnoser]
    public class UnderTest
    {
        const int noNumbers = 10000000;  // 10 mil
        object[] sourceArray = null;

        [Benchmark]
        public void DoNothing()
        {

        }

        [Benchmark]
        public void JustDefineAnArrayList()
        {
            ArrayList arraylist = new ArrayList(noNumbers);
        }       

        public UnderTest()
        {
            Random random = new Random(1);
            sourceArray = new object[16777216];

            for (int i = 0; i < 16777216; i++)
            {
                sourceArray[i] = random.Next(10);
            }
        }
    }

If I run this with BenchmarkRunner.Run<UnderTest>() I get the results below:
image

Clearly either (1) the constructor time is taken into account, or (2) BDN knows that DoNothing is essentially an empty method, since the results for this method are ok. The 43 ms for JustDefineAnArrayList are rather high based on my experience on this particular machine, but let's continue.

Let's now simply move the declaration for sourceArray from the class to the constructor itself. Running the benchmark again shows the time reported for the method that simply declares the ArrayList going down 10 times, yet the DoNothing method is still seen with a near 0 time:
image

I don't understand why the operations belonging to the constructor - which by themselves don't operate on data needed by either benchmarked method - affect the result of one of the methods but not the other (the empty one).

Analyzing the disassembled code for the 2 methods and constructor is the same in both cases.

I'm assuming the difference is due to some dead code elimination in the "cumulative" effect of the constructor + method under benchmark (after all, from a logical standpoint sourceArray can be ignored when defined within the constructor itself), but since I only get the individual assembly code for each, I can't really tell.

I also don't really think BDN can tell that DoNothing is indeed an empty method, since it's measuring the time just as usual, coming up with 1-2 ns/op in its corresponding benchmark cycle.

Also - the GC doesn't look to take the blame on this one, at least from the number of runs: there are 6 gen2 GCs occurring for 32 ops in the case of the first set of results and 127 gen2 GCs occurring for 128 ops for the second set.

The machine itself is not under load, and the results are reproducible on this, as well as on a different machine.

@adamsitnik

This comment has been minimized.

Copy link
Member

@adamsitnik adamsitnik commented Oct 31, 2019

Could you please share you BDN config? It looks like you are configuring BDN to run only a single iteration, which is very bad. Performance is never a single number, you should be comparing distributions with at least 15 iterations.

image

@adamsitnik

This comment has been minimized.

Copy link
Member

@adamsitnik adamsitnik commented Oct 31, 2019

Could you please use the default config, re-run the benchmarks and share the results?

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Oct 31, 2019

Re-running using the default config. Results are given in the same order as the original post:

image

image

@adamsitnik I was getting the same results using the default config in my initial tests, but then tried different things and dropped the number of iterations to 1, since I could see that I can successfully replicate the results under the default config each time, and also needed to save time. I should have been more clear in my initial posting, sorry about that.

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Nov 3, 2019

I did some profiling (dotTrace/PerfView) against the 2 scenarios and got a glimpse into what's actually going on.

Before diving into each scenario:

  • JustDefineAnArrayList should allocate around 38 MB on each run, equivalent to 40 mil bytes (10 mil elements x 4 bytes/int on 32-bit target platform) plus some bytes due to overhead for the internal ArrayList structures
  • dotTrace will mark blocking GCs in brown

The "slow" scenario first, namely the one with sourceArray defined at the class level.
BDN choses 32 ops/iteration and starts running. A dotTrace profiling session is done while BDN is in "// BeforeActualRun", showing an overall look of what goes on:
image

The memory allocations are highlighted, and the first one after the blocking GC completes is in fact a group of 6 allocations operations, as the tooltip explains. The blocking GC is in fact part of what BDN itself invokes after each iteration. Let's expand the allocation highlighted before, and see the 6 allocations in detail:
image

So there are 6 ArrayList allocations happening one after the other, in under 2 ms. Zooming out somewhere between the last and the first snapshots shows the approximate duration of each background gen2 GC that follows each group of allocations - roughly 300 ms:
image

A PerfView trace is done on a separate run, during the same "// BeforeActualRun", and yields the results below. Note the LOH size and its survival rate across generations:
image

Another view showing the duration of the GCs:
image

The duration of the independent GCs themselves is roughly equal, regardless if they're induced (groups of 4 on the Main thread) or background GCs (groups of 6 on the GC thread). If one sums up the allocation ticks that occur between a group of induced GCs, it adds up to 32, exactly the number of ops a BDN iteration consisted of. When averaging, the large gen2 background GCs (each in excess of 300 ms) are felt pretty hard, as the BDN results show.

Conclusion: There are clustered allocations performed quickly (5-6 at a time, all in under 2 ms)), followed by a lengthy gen2 background GC, which results in the high average times seen. The duration of the GCs is explained most likely by the existing 16+ mil populated object[] array in the LOH from the start, whose links need to be followed by the GC. There's a "clump" of allocations occurring at a time simply because there's an increased LOH budget, courtesy of the same initial populated object[] array.

Now for the second scenario, where sourceArray is defined at the constructor level.
BDN choses 256 ops/iteration and starts running. A similar dotTrace profiling session is done as for the first scenario, while BDN is in "// BeforeActualRun", showing an overall look of what goes on:
image

This time there's only one isolated allocation, followed by a gen2 background GC (which triggers a blocking ephemeral collection as well) which takes around 5 ms.

Once the 256 ops are over, BDN forces a GC, captured below as 4 distinct GCs. Note the short duration of these, as opposed to the first scenario:
image

A PerfView trace is done on a separate run, during the same "// BeforeActualRun" (BDN crashes eventually, but works long enough to get enough data), and yields the results below.
image

Since there are 256 ops/iteration, there's an overall majority of background gen2 GCs over the isolated ones triggered by BDN at the end of each iteration. The LOH size is close to 0, and there's nothing that gets added during the lifetime of this scenario.

Conclusion: Since the sourceArray never makes it out of the constructor, all there's allocated before each GC is an empty ArrayList, which internally translates to an empty object[] array, hence there's nothing for the GC to follow in terms of live/dead objects, so the GC is pretty fast.

There are some interesting implications based on the above: if one raises the LOH budget from the very beginning, she/he could end up with results that are faster should those artificial budget increases weren't done in the first place, provided the number of ops/iteration is chosen so that GC isn't triggered under the "inflated" budget. For the first scenario, should the number of ops would have been 5, then the only GC that would have triggered would have been the one initiated by BDN, whose time isn't counted in the results. I tried testing my assumption, but unsure how to control the number of ops/iteration (setting unrollFactor probably, but couldn't find a way to do it with attributes). Nonetheless, a slightly altered version of the "slow" scenario proves the point: if the function declaring the ArrayList is expanded to include a Thread.Sleep(500), the results become quite interesting:
image

dotTrace view over 3 consecutive iterations:
image

PerfView output during "// BeforeActualRun", showing only the BDN-induced GCs at the end of each iteration:
image

The reason for the 501 ms result is that 1 op/iteration is chosen by BDN automatically, and this time there's no more GC occurring during the function itself, since BDN's own invoked GC after each op makes sure we never quite reach the threshold required for one.

All this being said, it doesn't look like BDN is at fault here, but the GC - contrary to what I thought in the initial post. The problem is that BDN outputs the results, and without a more-than-minimum understanding of the GCs inner workings (budgets, LOH, generations etc) some of the results displayed would look flat out wrong, as shown above.

@adamsitnik

This comment has been minimized.

Copy link
Member

@adamsitnik adamsitnik commented Nov 4, 2019

@luckerby thank you for the full explanation.

GCs inner workings (budgets, LOH, generations etc)

This is one of the reasons why BDN runs every benchmark in a stand-alone process - to make sure that things like gen budgets and self-tuning nature of GC does not affect other benchmarks.

But when it comes to things allocated in ctor/setup and the way they affect the benchmark iterations - there is very little we can do.

@luckerby

This comment has been minimized.

Copy link
Author

@luckerby luckerby commented Nov 6, 2019

@adamsitnik - I was thinking about some "look-at-this-tricky-sample-where-you-wouldn't-expect-these-strange-results" example mentioned somewhere in the docs. So people that benchmark and get "wrong" results because of the GC don't quickly point their finger at BDN.

A counter argument could be that if one uses BDN, it's kind of implied that she/he has some connection with the notion of performance and fine-tuning in general, which means in turn that they should already have an idea about the GC in the .NET world. And if they don't know all the nitty-gritty details, at least they're aware of the "knowns unknowns".

I was doing this - ironically - as part of looking into how much time GC takes for a specific scenario involving ArrayLists, and simply needed one populated to be used as source; it's then when I stumbled across this issue. And when I got to the sample involving Sleep and saw the results it still got me baffled with a "this can't be right" reaction. But it was only because the sample was simple and knew what the outcome should have been. Provided there were some 2 complex methods being benchmarked one against the others, I would have just looked at the results and concluded "well, if BDN says it is so, it has to be so" particularly if both showed positive and comparable numbers in the gen0/1/2 BDN output, just as the issue in this thread did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.