Skip to content

Round 3

dsimcha edited this page Feb 13, 2012 · 3 revisions

For the third round of GC optimizations I'm going to focus on the global lock that's required for malloc(). This lock is bad for obvious reasons when it's contested but the overhead is substantial even when it's not. Here are some baseline benchmarks for the GC that ships with DMD 2.058 beta. These are slightly better than the ones in GC Optimization Round 2, probably because of improved inlining in 2.058.

RandLarge:  Done 1000 iter in 8431 milliseconds.
		Total GC prep time:  107 milliseconds
		Total mark time:  1470 milliseconds
		Total sweep time:  453 milliseconds
		Total page recovery time:  0 milliseconds
		Grand total GC time:  2030 milliseconds
RandSmall:  Done 1000 iter in 1991 milliseconds.
		Total GC prep time:  218 milliseconds
		Total mark time:  315 milliseconds
		Total sweep time:  217 milliseconds
		Total page recovery time:  265 milliseconds
		Grand total GC time:  1015 milliseconds
Tree1:  32 seconds
		Total GC prep time:  109 milliseconds
		Total mark time:  3766 milliseconds
		Total sweep time:  6705 milliseconds
		Total page recovery time:  563 milliseconds
		Grand total GC time:  11143 milliseconds

If I comment out the lock acquisition for malloc(), which I can do because the benchmarks are single-threaded, we see a substantial performance boost. Look at the total runtime for Tree1. It's about 9 seconds shorter. That means that the Tree1 benchmark is spending about 28% of its time just acquiring and releasing the global GC lock.

RandLarge:  Done 1000 iter in 8291 milliseconds.
		Total GC prep time:  31 milliseconds
		Total mark time:  1239 milliseconds
		Total sweep time:  393 milliseconds
		Total page recovery time:  31 milliseconds
		Grand total GC time:  1694 milliseconds
RandSmall:  Done 1000 iter in 1726 milliseconds.
		Total GC prep time:  156 milliseconds
		Total mark time:  438 milliseconds
		Total sweep time:  171 milliseconds
		Total page recovery time:  279 milliseconds
		Grand total GC time:  1044 milliseconds
Tree1:  23 seconds
		Total GC prep time:  156 milliseconds
		Total mark time:  3832 milliseconds
		Total sweep time:  6419 milliseconds
		Total page recovery time:  565 milliseconds
		Grand total GC time:  10972 milliseconds

Long term the global lock needs to go at least for most small allocations. To summarize how small (<1 page) allocations currently work, they are handled via a free list for each power-of-two bucket size. When the free list is populated, an allocation just pops a node off of it and returns. When the free list is empty, a new page is allocated to that bucket size from the pool of free pages. It is divided into blocks of the correct size and these are added to the free list.

My tentative plan is to make these free lists thread-local and make each page owned by a single thread. The global lock would only need to be acquired when allocating a new page. For use cases like building trees, which use lots of very small allocations, this would only be a small fraction of the time. This is more difficult in practice than it sounds, though. To make it work, you either need to allow sweeping to run concurrently with allocating and allow threads to be paused for marking while in the middle of allocating, or you need to guarantee mutual exclusion of these processes. Read-write locks seem at first glance like the ideal thing to use here. If allocating from a thread-local free list, you would acquire the reader lock. If collecting, you would acquire the writer lock. In practice, read-write locks have substantially more overhead than vanilla mutexes, so they won't work. It seems like I have to make sweeping run concurrently with allocating and guarantee that a thread can be paused for marking while in the middle of allocating. I only have a vague idea of how to make this work and needless to say, it's a messy problem.

As a temporary and easier to implement solution, I looked into reducing the overhead of acquiring the global lock. Right now the lock is actually a ClassInfo object, which can be synchronized on. This is horribly inefficient because the compiler inserts code to check whether the monitor has been instantiated (it's instantiated lazily) and then goes through an extra layer of indirection to actually acquire the lock. The alternative is to use core.sync.Mutex, which didn't exist when the current code was written. Additionally, I created a final subclass of it to de-virtualize the function calls for lock() and unlock(). This gets rid of a bunch of extra checks and indirection. A microbenchmark of simply locking and unlocking both flavors of lock in a loop indicates that using core.sync.mutex is about 40% faster.

Anyhow, here are the GC benchmarks with this optimization enabled. The ~3 second speedup on Tree1 is reproducible, though the speedup on RandLarge is not and is likely noise.

RandLarge:  Done 1000 iter in 7679 milliseconds.
		Total GC prep time:  94 milliseconds
		Total mark time:  908 milliseconds
		Total sweep time:  390 milliseconds
		Total page recovery time:  30 milliseconds
		Grand total GC time:  1422 milliseconds
RandSmall:  Done 1000 iter in 1901 milliseconds.
		Total GC prep time:  140 milliseconds
		Total mark time:  485 milliseconds
		Total sweep time:  252 milliseconds
		Total page recovery time:  279 milliseconds
		Grand total GC time:  1156 milliseconds
Tree1:  29 seconds
		Total GC prep time:  123 milliseconds
		Total mark time:  3751 milliseconds
		Total sweep time:  6592 milliseconds
		Total page recovery time:  671 milliseconds
		Grand total GC time:  11137 milliseconds
You can’t perform that action at this time.