GC Optimizations Round 2

dsimcha edited this page Dec 6, 2011 · 12 revisions

Now for a second round of GC optimizations, now that I've had ~9 months to think of all the ideas I missed last time I dug into this. I decided to take new baseline benchmarks, this time with the profiling instrumentation I added a while back, since changes in DMD, druntime, my configuration, etc. chould change these. These baselines were done on DMD 2.056 stock. Unfortunately, it looks like the two randomized benchmarks have gotten somewhat slower for reasons I don't understand.

Also note that I've removed HugeSingle from the benchmark list because the problem it was meant to address is thoroughly solved and it would only waste space.

In interpreting these benchmarks, please note that there is sometimes substantial variance between runs and small but useful improvements may get buried in noise when looking at total benchmark time. Pay close attention to the time spent on whatever stage the optimization was meant to improve.

RandLarge:  Done 1000 iter in 8586 milliseconds.
		Total GC prep time:  156 milliseconds
		Total mark time:  1337 milliseconds
		Total sweep time:  484 milliseconds
		Total page recovery time:  16 milliseconds
		Grand total GC time:  1993 milliseconds
RandSmall:  Done 1000 iter in 2008 milliseconds.
		Total GC prep time:  205 milliseconds
		Total mark time:  581 milliseconds
		Total sweep time:  171 milliseconds
		Total page recovery time:  217 milliseconds
		Grand total GC time:  1174 milliseconds
Tree1:  43 seconds
		Total GC prep time:  78 milliseconds
		Total mark time:  5954 milliseconds
		Total sweep time:  16309 milliseconds
		Total page recovery time:  794 milliseconds
		Grand total GC time:  23135 milliseconds

First optimization: Get rid of the multiple traversals of the scan bit data structure in most cases. This design avoids excessive recursion and stack usage with very deep heap graphs. However, it does so by trading off speed for space. In the new version, limited amounts of recursion are allowed before transitioning to the old algorithm.

RandLarge:  Done 1000 iter in 7940 milliseconds.
		Total GC prep time:  110 milliseconds
		Total mark time:  907 milliseconds
		Total sweep time:  264 milliseconds
		Total page recovery time:  0 milliseconds
		Grand total GC time:  1281 milliseconds
RandSmall:  Done 1000 iter in 2169 milliseconds.
		Total GC prep time:  93 milliseconds
		Total mark time:  466 milliseconds
		Total sweep time:  201 milliseconds
		Total page recovery time:  317 milliseconds
		Grand total GC time:  1077 milliseconds
Tree1:  42 seconds
		Total GC prep time:  63 milliseconds
		Total mark time:  3827 milliseconds
		Total sweep time:  17080 milliseconds
		Total page recovery time:  629 milliseconds
		Grand total GC time:  21599 milliseconds

The improvements in overall performance are small and mostly buried in noise, but the improvements in mark performance (the stage where the world has to be stopped and therefore the most critical stage in terms of real-world GC-bound application performace) are significant.


Next, we rewrite rt_finalize, which lives in lifetime.d and takes care of object finalization, to cut out some work that's useless when it's called from the GC. Since the RandLarge and RandSmall benchmarks don't have finalizers, any changes in these are just measurement noise. However, we get about a 10% total benchmark improvement and a 20% GC time improvement in the Tree1 benchmark, which does use finalization heavily.

RandLarge:  Done 1000 iter in 9071 milliseconds.
		Total GC prep time:  205 milliseconds
		Total mark time:  2031 milliseconds
		Total sweep time:  340 milliseconds
		Total page recovery time:  32 milliseconds
		Grand total GC time:  2608 milliseconds
RandSmall:  Done 1000 iter in 1984 milliseconds.
		Total GC prep time:  155 milliseconds
		Total mark time:  441 milliseconds
		Total sweep time:  143 milliseconds
		Total page recovery time:  200 milliseconds
		Grand total GC time:  939 milliseconds
Tree1:  38 seconds
		Total GC prep time:  142 milliseconds
		Total mark time:  4483 milliseconds
		Total sweep time:  10234 milliseconds
		Total page recovery time:  658 milliseconds
		Grand total GC time:  15517 milliseconds

This one won't matter much but it was too low-hanging to pass up: We don't need to clear the finalize bit in the sweeping code because we already called testClear() on it a few lines above in all of these cases.

RandLarge:  Done 1000 iter in 8486 milliseconds.
		Total GC prep time:  126 milliseconds
		Total mark time:  1330 milliseconds
		Total sweep time:  404 milliseconds
		Total page recovery time:  30 milliseconds
		Grand total GC time:  1890 milliseconds
RandSmall:  Done 1000 iter in 1944 milliseconds.
		Total GC prep time:  143 milliseconds
		Total mark time:  295 milliseconds
		Total sweep time:  246 milliseconds
		Total page recovery time:  251 millisecond
		Grand total GC time:  935 milliseconds
Tree1:  36 seconds
		Total GC prep time:  94 milliseconds
		Total mark time:  4004 milliseconds
		Total sweep time:  9640 milliseconds
		Total page recovery time:  830 millisecond
		Grand total GC time:  14568 milliseconds

In Gcx.setBits() a bunch of bit twiddling is done multiple times. Getting rid of this should have no effect on deallocation speed but might affect allocation speed a little.

RandLarge:  Done 1000 iter in 8629 milliseconds.
		Total GC prep time:  158 milliseconds
		Total mark time:  1544 milliseconds
		Total sweep time:  456 milliseconds
		Total page recovery time:  0 milliseconds
		Grand total GC time:  2158 milliseconds
RandSmall:  Done 1000 iter in 2038 milliseconds.
		Total GC prep time:  94 milliseconds
		Total mark time:  436 milliseconds
		Total sweep time:  266 milliseconds
		Total page recovery time:  219 milliseconds
		Grand total GC time:  1015 milliseconds
Tree1:  35 seconds
		Total GC prep time:  48 milliseconds
		Total mark time:  3823 milliseconds
		Total sweep time:  9877 milliseconds
		Total page recovery time:  766 milliseconds
		Grand total GC time:  14514 milliseconds

I had some insight into why some of the benchmarks magically got slower. It might be related to the NO_INTERIOR bit I added a few months back. This bit is a rarely used hack to get around false pointer issues. It instructs the GC to ignore pointers to the interior of a block of memory and only deal with those pointing to the base of the block. This is most useful in D's associative arrays, which were previously a false pointer magnet. Changing from eager to lazy instantiation of the nointerior GCBits. This should only affect the large allocation benchmark, since nointerior was only implemented for large object pools.

RandLarge:  Done 1000 iter in 8244 milliseconds.
		Total GC prep time:  78 milliseconds
		Total mark time:  1381 milliseconds
		Total sweep time:  423 milliseconds
		Total page recovery time:  15 milliseconds
		Grand total GC time:  1897 milliseconds
RandSmall:  Done 1000 iter in 1948 milliseconds.
		Total GC prep time:  111 milliseconds
		Total mark time:  388 milliseconds
		Total sweep time:  233 milliseconds
		Total page recovery time:  252 milliseconds
		Grand total GC time:  984 milliseconds
Tree1:  35 seconds
		Total GC prep time:  96 milliseconds
		Total mark time:  4241 milliseconds
		Total sweep time:  9869 milliseconds
		Total page recovery time:  607 milliseconds
		Grand total GC time:  14813 milliseconds

Apply the same tricks that we applied to setBits() to clrBits():

RandLarge:  Done 1000 iter in 8332 milliseconds.
		Total GC prep time:  187 milliseconds
		Total mark time:  1034 milliseconds
		Total sweep time:  393 milliseconds
		Total page recovery time:  16 milliseconds
		Grand total GC time:  1630 milliseconds
RandSmall:  Done 1000 iter in 1887 milliseconds.
		Total GC prep time:  155 milliseconds
		Total mark time:  533 milliseconds
		Total sweep time:  93 milliseconds
		Total page recovery time:  246 milliseconds
		Grand total GC time:  1027 milliseconds
Tree1:  35 seconds
		Total GC prep time:  155 milliseconds
		Total mark time:  3955 milliseconds
		Total sweep time:  9751 milliseconds
		Total page recovery time:  842 milliseconds
		Grand total GC time:  14703 milliseconds

When sweeping small objects, a lot of time is spent clearing bits. This optimization batches these clears together to take advantage of more bitwise parallelism. This optimization was too complex to implement for large objects. Again, the improvements in overall benchmark time are buried in noise, but notice the improvements in sweeping times, especially for the Tree1 benchmark. (Edit: These benchmarks are after a bugfix and a few tweaks):

RandLarge:  Done 1000 iter in 8822 milliseconds.
		Total GC prep time:  158 milliseconds
		Total mark time:  1395 milliseconds
		Total sweep time:  391 milliseconds
		Total page recovery time:  0 milliseconds
		Grand total GC time:  1944 milliseconds
RandSmall:  Done 1000 iter in 1974 milliseconds.
		Total GC prep time:  140 milliseconds
		Total mark time:  297 milliseconds
		Total sweep time:  282 milliseconds
		Total page recovery time:  360 milliseconds
		Grand total GC time:  1079 milliseconds
Tree1:  36 seconds
		Total GC prep time:  141 milliseconds
		Total mark time:  3927 milliseconds
		Total sweep time:  6933 milliseconds
		Total page recovery time:  675 milliseconds
		Grand total GC time:  11676 milliseconds