-
-
Notifications
You must be signed in to change notification settings - Fork 422
Optimize GC mark phase implementation #1497
Conversation
|
Very nice! Did you benchmark with gdc/ldc as well? I'll run this code through some of my allocation-heavy programs later today and see if I can verify the performance improvements as well. |
|
No. I haven't tried other compilers. Took me long enough to install DMD, don't want to waste time on that. I suppose the gains will still be there on LDC and GDC, perhaps slightly diminished, but it's unlikely any optimizer would pick these up. |
|
BTW, what compile command did you use? I tested with a moderately allocation-heavy benchmark (compiled with Will test with another program that's more allocation-heavy... |
|
I compiled dmd druntime and phobos with It doesn't really matter how your test program was compiled as the gc profiler tracks timings for druntime code, which was previously compiled in release config. The timings I report are the mark phase timings from the output of the test program running with |
|
Ah you're right, I wasn't thinking it through. Anyway, I'm testing with a more allocation-heavy program (builds an associative array with about 950,000 entries), and I'm still not seeing a significant performance improvement. In fact, there's a tiny performance degradation: with the current GC, the mark time is about 396 ms (reported by But having said that, this may be a side-effect of this particular program's memory usage pattern: it allocates a lot of AA entries but they never become garbage until the end of the program. So that probably skews the results quite a bit. I'll have to try something else with a more "typical" memory usage pattern, I suppose. |
|
Benchmarking with (First set of numbers are from the current GC, second set from your micro-optimized GC.) Seems that the mark time does show some improvement (~530ms to ~487ms). |
|
You have to run and measure it at least 10 times and average the results to see a difference. These measurements are very sensitive to factors like OS scheduling, background I/O, and whatnot. It is very unlikely that these changes would degrade performance. |
|
I did run each test program several times (not up to 10 times, though), and the numbers seem quite consistent for both cases. In any case, I realize that the above test cases are highly biased, since even the |
|
How does it perform with the druntime/benchmark suite? If you have good benchmarks, please consider adding them. I see 3 optimizations:
|
We usually use the minimum as the "noise" always adds to the measurement. |
|
Alright, here are my latest results. I wrote a program that generates random trees (depth up to 16) of nodes that contain ints and up to 4 children each, and performs random pruning and grafting of subtrees (with some probability of discarding subtrees altogether, to generate extra garbage). Each tree undergoes 1000 such random mutations, and is discarded afterwards. This is repeated 200,000 times in the outer loop to give the GC a good workout. A total of 10 runs were performed for each of the current GC and the micro-optimized GC. The minimum mark time for the current GC is 156 ms; whereas the minimum mark time for the micro-optimized GC is 119 ms. That's a good 20-30% improvement. Due to the random nature of the benchmark, though, these minimum times may not reflect the actual current/micro-opt performance ratios (some runs may have lower GC workload than others). So I also took the average mark times: current GC: 159.8 ms; micro-optimized GC: 122.7 ms. Again, we see a consistent 20-30% improvement. So I think this is good evidence that these optimizations are worthwhile. |
|
@rainers How do you run the druntime benchmarks? I did a |
|
@quickfur I am glad you reached the same conclusion as I have. This one is not for free, and comes at the expense of 2 pointers per pool, but my tests show it's well worth it. |
|
Unfortunately this seems to have broken druntime unittests ( |
|
Right! I'll take care of that quickly. What does your benchmark say with this new commit? |
|
PoolTable unit-tests fixed. |
Run |
|
Using my random trees test: Minimum mark times: 127 ms (micro-opt); 156 ms (current). Seems the overall improvement has dropped a bit compared to the previous commit; closer to the 20% vicinity now. Not sure how much can be attributed to noise and the inherent randomness of the benchmark, and how much is actual difference, though. |
|
Did another run of 10; the numbers are pretty much the same, with ±1 ms on minimum and average times. This is slightly worse than the original commit. |
|
@rainers Thanks! Running the benchmarks now. Will post results when they are ready. |
|
Here are the benchmark results: Current GC: Micro-optimized GC: |
|
Looks like other than |
|
Are these with/without the second commit? Thanks a lot for helping with this @quickfur :) |
|
The official benchmarks were run with the second commit. |
a8f92e4 to
d164b0f
Compare
|
Here are the results without the second commit: Seems the picture isn't quite so simple after all. Some cases show improvement, others show degradation, relative to the test with the second commit. Of course, both micro-opt versions consistently improve on or are on par with the current GC. |
|
I suppose this is ready to be merged if wanted. I will improve on it further over the weekend, if I have time. Still have some ideas that could be more quick wins, but those will be another PR. |
|
Since I'm not familiar enough with the GC code to do justice to the code review (I just like running benchmarks :-P), I'll leave it to @rainers to decide whether to merge. |
The con* tests are a bit shaky due to them testing concurrent allocations. If you use Note that you cannot use rdmd in this case because it will consider |
|
I'm seeing improvements, too, here. The inlining of findPool does not seem to contribute to that, though. Actually my test showed better results without inlining... I guess this is the usual pitfall of mobile processors. I'll have to retry on a desktop PC... |
|
Another note regarding findPool: it would be best if we could find a better lookup mechanism than binary search. I tried (2-dim) page tables in the past but that turned out even slower because of non-locality of accesses. Maybe we can come up with some good hash function of the address to pick the pool. To not complicate working on this, it would also be good if the function is not manually inlined. |
|
Don't waste time on findPool, it's hard to beat binary search for the small numbers of pools, already spend too much time trying. |
| else break; | ||
|
|
||
| if (low > high) | ||
| continue Lnext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the marginal inlining of findPool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really prefer to keep it.
It might seem that it's not contributing, but it is helping. It's impact is proportional to the amount of times the outer branch (minAddr < p < maxAddr) branch is taken, right?
It's not hard to imagine legitimate cases where this branch would be taken very frequently:
-Very large pointer-based data structures, like trees, doubly linked lists, ...
-Unfortunate data-sets - i.e. 64-bit numbers that happen to be in range
-Both of the previous combined
One could still argue that when the outer branch is taken, the binary search is not the dominant factor, so I constructed a new test case to prove my theory: for every 16 bytes allocated, the first 8 bytes contain a pointer to a random address in range. The results show that this is indeed the case. The scan is now ~6x slower, because the outer branch is now taken much more frequently, and the inlined version is about ~25% faster than the non-inlined version.
Even in my original posted test case with no regard for memory content, the difference is ~1% which is still a pretty significant improvement, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean manually inlining not inlining in general.
We could change findPool to a static function and call findPool(pools) if that is currently preventing inlining, there is also pragma(inline, true).
|
FreeBSD 32 is failing, but seems unrelated to this. Is that known to be a flaky test or is there something I must do? |
2bb7eff to
9f5b08e
Compare
|
It seems DMD doesn't want to inline it. Tried my version of the loop as well, with and without nothrow and pure, but nothing will do. Ideas? pragma(inline, true) T* findPool(T)(T** pools, size_t highpool, void* p) pure nothrow
{
if (highpool == 0)
return pools[0];
size_t low = 0;
size_t high = highpool;
while (low <= high)
{
size_t mid = (low + high) >> 1;
auto pool = pools[mid];
if (p < pool.baseAddr)
high = mid - 1;
else if (p >= pool.topAddr)
low = mid + 1;
else
return pool;
}
return null;
} |
|
Maybe a string mixin would do in this case? |
|
It should be able to inline it. @9rnsr ? |
|
I was investigating why it's not inlining and it seems that the culprits are the two nested return statements. I am not at all familiar with the DMD code, so I might be very wrong, but even the extremely simple Have a look at if (cond)
return 1;
return 2; |
|
Sometime last year when I tested dmd's inlining, I noticed (to my dismay) that something as simple as: will prevent inlining, whereas inserting an This may have been fixed since, I don't remember now, but it does show that dmd's inliner is rather temperamental, and little things can upset it. Sometimes it helps to do trivial refactorings of the code like the above; there might be a particular combination that prods the inliner in the right direction. (Boy do I wish for the day we don't have to do this anymore...) |
|
Let's merge it as is. The inliner is crap and it doesn't make sense to optimize specifically for dmd, but the GC performance is important enough to still do it. |
Optimize GC mark phase implementation
There are some low-hanging micro optimizations for the GC.
This is an average 24.7% reduction in running time for the mark phase on my test-case. The speed up should translate well for other cases as it just improves the throughput by minimizing branching and inlining the pool search.
This is my test program:
Timings were computed over 20 runs of this program with and without the patch, passing
"--DRT-gcopt=profile:1"and recording the mark phase time.this patch
mean: 295.15ms
stdev: 42.65
current master
mean: 392.00ms
stdev: 31.19