Druntime GC Optimization Fork

dsimcha edited this page Feb 27, 2011 · 16 revisions

This is a temporary fork of druntime for the purpose of testing garbage collector optimizations. I've created a set of benchmarks (located in the gcBench directory).

Please note that all benchmarks were only run once per patch and are therefore subject to small amounts of random noise. Therefore, small differences mean basically nothing.

Also note that a Tree2 benchmark also exists, but it seems to run in either 12 or 0 seconds, randomly, no matter what patches are applied, for reasons I don't understand. See Bugzilla 5650 . Therefore, I don't include this benchmark on the wiki.

Here are the baseline timings before any modifications:

SingleHuge: Collected a 1000 megabyte heap in 197 milliseconds.

LargeRand: Done 1000 iter in 41005 milliseconds.

SmallRand: Done 1000 iter in 4745 milliseconds.

Tree1: 62 seconds


The first changeset was a few first-order large allocation optimizations. These really should have been multiple patches instead of one, but I did them before I realized the value of using GIT and documenting each patch separately. These consist of using GCBits only for every page instead of 16 bytes for large allocation pools, storing instead of computing offsets from the relevant B_PAGE page for B_PAGEPLUS pages, and having each pool track how many free pages it has. Here are the timings after these first-order large block allocation optimizations.

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 18232 milliseconds.

SmallRand: Done 1000 iter in 2904 milliseconds.

Tree1: 51 seconds


The next patch was to improve the search for large allocation blocks by using the newly stored B_PAGE offset information.

SingleHuge: Collected a 1000 megabyte heap in 1 milliseconds.

LargeRand: Done 1000 iter in 7324 milliseconds.

SmallRand: Done 1000 iter in 3215 milliseconds.

Tree1: 52 seconds


Next, I added a searchStart variable to allow bypassing of large blocks of already allocated pages when searching a pool for free pages.

SingleHuge: Collected a 1000 megabyte heap in 1 milliseconds.

LargeRand: Done 1000 iter in 7766 milliseconds.

SmallRand: Done 1000 iter in 2490 milliseconds.

Tree1: 50 seconds


Next on the agenda: Get rid of freebits for large object pools, because they're not used anywhere. This should have a very small impact, but it's too trivial to pass up.

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 7296 milliseconds.

SmallRand: Done 1000 iter in 2354 milliseconds.

Tree1: 49 seconds


This one doesn't seem to make much of a difference in practice, but it seemed silly not to try: Precompute the bin size table using CTFE for sizes up to B_PAGE.

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 7215 milliseconds.

SmallRand: Done 1000 iter in 2385 milliseconds.

Tree1: 48 seconds


This one helps a little: Use bsf instead of looping to figure out how many bits to go to find next block to mark.

SingleHuge: Collected a 1000 megabyte heap in 1 milliseconds.

LargeRand: Done 1000 iter in 7255 milliseconds.

SmallRand: Done 1000 iter in 2227 milliseconds.

Tree1: 47 seconds


Next patch: Add a Pool* entry to List and return Pool from bigAlloc() so we can stop running findPool() on every allocation. I can't figure out why this one doesn't seem to matter, but again, it's pretty trivial to implement and would probably show up on a benchmark where a sufficiently large number of pools were being created.

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 7629 milliseconds.

SmallRand: Done 1000 iter in 2262 milliseconds.

Tree1: 48 seconds


Get rid of a little bit of dead code (see http://lists.puremagic.com/pipermail/phobos/2011-February/004565.html).

SingleHuge: Collected a 1000 megabyte heap in 1 milliseconds.

LargeRand: Done 1000 iter in 7093 milliseconds.

SmallRand: Done 1000 iter in 2254 milliseconds.

Tree1: 46 seconds


Storing pool specific changes info in the marking code:

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 7101 milliseconds.

SmallRand: Done 1000 iter in 1991 milliseconds.

Tree1: 47 seconds


One truly micro optimization: Use bit shifting instead of multiplication and division, since we divide a lot in performance-critical inner loops.

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 7044 milliseconds.

SmallRand: Done 1000 iter in 1878 milliseconds.

Tree1: 44 seconds


Another micro-optimization: Store shiftBy to cut down on branching.

SingleHuge: Collected a 1000 megabyte heap in 0 milliseconds.

LargeRand: Done 1000 iter in 7200 milliseconds.

SmallRand: Done 1000 iter in 1852 milliseconds.

Tree1: 44 seconds