Skip to content
This repository has been archived by the owner on Oct 12, 2022. It is now read-only.

Optimize GC mark phase implementation #1497

Merged
merged 2 commits into from
Feb 19, 2016

Conversation

marcioapm
Copy link
Contributor

There are some low-hanging micro optimizations for the GC.

This is an average 24.7% reduction in running time for the mark phase on my test-case. The speed up should translate well for other cases as it just improves the throughput by minimizing branching and inlining the pool search.

This is my test program:

import core.memory;
enum Prime = cast(uint)3147655439;
void main() {
    uint start = 1000;
    void* lptr = null;
    GC.collect();
    foreach (i; 0.. 2_000_000) {
        auto size = cast(uint)(start++ * Prime) % (8 << 10);
        if (size < 8) size = 8;
        auto ptr = GC.malloc(size);
        if (lptr) *cast(void**)lptr = ptr;
        lptr = ptr;
    }
    lptr = null;
    GC.collect();
}

Timings were computed over 20 runs of this program with and without the patch, passing "--DRT-gcopt=profile:1" and recording the mark phase time.

this patch
mean: 295.15ms
stdev: 42.65

current master
mean: 392.00ms
stdev: 31.19

@quickfur
Copy link
Member

Very nice! Did you benchmark with gdc/ldc as well?

I'll run this code through some of my allocation-heavy programs later today and see if I can verify the performance improvements as well.

@marcioapm
Copy link
Contributor Author

No. I haven't tried other compilers. Took me long enough to install DMD, don't want to waste time on that. I suppose the gains will still be there on LDC and GDC, perhaps slightly diminished, but it's unlikely any optimizer would pick these up.

@quickfur
Copy link
Member

BTW, what compile command did you use? I tested with a moderately allocation-heavy benchmark (compiled with dmd -O -inline -release) and it seems the overall performance difference is slight.

Will test with another program that's more allocation-heavy...

@marcioapm
Copy link
Contributor Author

I compiled dmd druntime and phobos with make -j4 -f posix.mak which is by default the release configuration, I think.

It doesn't really matter how your test program was compiled as the gc profiler tracks timings for druntime code, which was previously compiled in release config.

The timings I report are the mark phase timings from the output of the test program running with "--DRT-gcopt=profile:1" as argument.

@quickfur
Copy link
Member

Ah you're right, I wasn't thinking it through.

Anyway, I'm testing with a more allocation-heavy program (builds an associative array with about 950,000 entries), and I'm still not seeing a significant performance improvement. In fact, there's a tiny performance degradation: with the current GC, the mark time is about 396 ms (reported by --DRT-gcopt=profile:1); with your micro-optimized GC, the mark time is about 422 ms. I'm not sure what's going on here... maybe have to look into the generated assembly to figure out why this is happening.

But having said that, this may be a side-effect of this particular program's memory usage pattern: it allocates a lot of AA entries but they never become garbage until the end of the program. So that probably skews the results quite a bit. I'll have to try something else with a more "typical" memory usage pattern, I suppose.

@quickfur
Copy link
Member

Benchmarking with std.csv, reading a .csv file with 2.1 million records:

benchmark-current
std.csv read 2126884 records
std.csv: 22419 msecs
        Number of collections:  10
        Total GC prep time:  120 milliseconds
        Total mark time:  530 milliseconds
        Total sweep time:  571 milliseconds
        Total page recovery time:  578 milliseconds
        Max Pause Time:  306 milliseconds
        Grand total GC time:  1801 milliseconds
GC summary: 1526 MB,   10 GC 1801 ms, Pauses  650 ms <  306 ms
benchmark-microopt
std.csv read 2126884 records
std.csv: 22320 msecs
        Number of collections:  10
        Total GC prep time:  119 milliseconds
        Total mark time:  487 milliseconds
        Total sweep time:  580 milliseconds
        Total page recovery time:  580 milliseconds
        Max Pause Time:  284 milliseconds
        Grand total GC time:  1768 milliseconds
GC summary: 1526 MB,   10 GC 1768 ms, Pauses  607 ms <  284 ms

(First set of numbers are from the current GC, second set from your micro-optimized GC.) Seems that the mark time does show some improvement (~530ms to ~487ms).

@marcioapm
Copy link
Contributor Author

You have to run and measure it at least 10 times and average the results to see a difference. These measurements are very sensitive to factors like OS scheduling, background I/O, and whatnot.

It is very unlikely that these changes would degrade performance.

@quickfur
Copy link
Member

I did run each test program several times (not up to 10 times, though), and the numbers seem quite consistent for both cases.

In any case, I realize that the above test cases are highly biased, since even the std.csv test case basically just loads the 2 million records and exits. I'm in the middle of writing a more extensive, GC-heavy test program that will run for longer periods of time, with lots of collectible objects in the interim, in order to have a more "representative" GC use case. I'll post the results when they are ready.

@rainers
Copy link
Member

rainers commented Feb 11, 2016

How does it perform with the druntime/benchmark suite?

If you have good benchmarks, please consider adding them.

I see 3 optimizations:

  • move some "constants" to stack variables: this is good to let the optimizer know these are not changed from elsewhere. I don't think it will reserve registers for these, though.
  • move the loop check to where the stackpos actually changes: also good
  • inline findPool: not so great because of the code duplication. Let's hope that recent work by @9rnsr will enable the inliner to do it in this case, too.

@rainers
Copy link
Member

rainers commented Feb 11, 2016

You have to run and measure it at least 10 times and average the results to see a difference. These measurements are very sensitive to factors like OS scheduling, background I/O, and whatnot.

We usually use the minimum as the "noise" always adds to the measurement.

@quickfur
Copy link
Member

Alright, here are my latest results. I wrote a program that generates random trees (depth up to 16) of nodes that contain ints and up to 4 children each, and performs random pruning and grafting of subtrees (with some probability of discarding subtrees altogether, to generate extra garbage). Each tree undergoes 1000 such random mutations, and is discarded afterwards. This is repeated 200,000 times in the outer loop to give the GC a good workout.

A total of 10 runs were performed for each of the current GC and the micro-optimized GC.

The minimum mark time for the current GC is 156 ms; whereas the minimum mark time for the micro-optimized GC is 119 ms. That's a good 20-30% improvement.

Due to the random nature of the benchmark, though, these minimum times may not reflect the actual current/micro-opt performance ratios (some runs may have lower GC workload than others). So I also took the average mark times: current GC: 159.8 ms; micro-optimized GC: 122.7 ms. Again, we see a consistent 20-30% improvement.

So I think this is good evidence that these optimizations are worthwhile.

@quickfur
Copy link
Member

@rainers How do you run the druntime benchmarks? I did a make -f posix.mak unittest and then tried running test/profile/generated/linux/64/profile but it aborted, saying "Memory allocation failed". Same with profilegc.

@marcioapm
Copy link
Contributor Author

@quickfur I am glad you reached the same conclusion as I have.
To celebrate, would you mind running your benchmark again with this new commit? :)

This one is not for free, and comes at the expense of 2 pointers per pool, but my tests show it's well worth it.

@quickfur
Copy link
Member

Unfortunately this seems to have broken druntime unittests (make -f posix.mak unittest):

src/gc/pooltable.d(234): Error: cannot implicitly convert expression (pooltable.opSlice(0LU, 6LU)) of type PoolInfo[] to MockPool*[]

@marcioapm
Copy link
Contributor Author

Right! I'll take care of that quickly. What does your benchmark say with this new commit?

@marcioapm
Copy link
Contributor Author

PoolTable unit-tests fixed.

@rainers
Copy link
Member

rainers commented Feb 11, 2016

@rainers How do you run the druntime benchmarks?

Run rdmd runbench in the druntime/benchmark folder. runbench -h for a short help on command line options.

@quickfur
Copy link
Member

Using my random trees test:

Minimum mark times: 127 ms (micro-opt); 156 ms (current).
Average mark times: 130 ms (micro-opt); 160 ms (current).

Seems the overall improvement has dropped a bit compared to the previous commit; closer to the 20% vicinity now. Not sure how much can be attributed to noise and the inherent randomness of the benchmark, and how much is actual difference, though.

@quickfur
Copy link
Member

Did another run of 10; the numbers are pretty much the same, with ±1 ms on minimum and average times. This is slightly worse than the original commit.

@quickfur
Copy link
Member

@rainers Thanks! Running the benchmarks now. Will post results when they are ready.

@quickfur
Copy link
Member

Here are the benchmark results:

Current GC:

R slist            2.172 s
R concpu           1.127 s
R words            1.384 s
R huge_single      0.007 s
R rand_small       0.776 s
R conmsg           2.505 s
R vdparser         2.398 s
R rand_large       0.342 s
R tree1            0.799 s
R conalloc         0.756 s
R testgc3          2.229 s
R conappend        0.009 s
R dlist            2.237 s
R tree2            1.299 s

Micro-optimized GC:

R slist            2.127 s
R concpu           0.885 s
R words            1.240 s
R huge_single      0.007 s
R rand_small       0.753 s
R conmsg           1.637 s
R vdparser         2.385 s
R rand_large       0.343 s
R tree1            0.782 s
R conalloc         0.750 s
R testgc3          1.984 s
R conappend        0.009 s
R dlist            2.191 s
R tree2            1.222 s

@quickfur
Copy link
Member

Looks like other than huge_single and conappend, which have equal performance, everything else is consistently better with this PR applied.

@marcioapm
Copy link
Contributor Author

Are these with/without the second commit?
I will remove the last commit as I can confirm a slight performance degradation consistently.

Thanks a lot for helping with this @quickfur :)

@quickfur
Copy link
Member

The official benchmarks were run with the second commit.

@quickfur
Copy link
Member

Here are the results without the second commit:

R slist            2.127 s
R concpu           0.971 s
R words            1.261 s
R huge_single      0.007 s
R rand_small       0.756 s
R conmsg           2.513 s
R vdparser         2.397 s
R rand_large       0.333 s
R tree1            0.761 s
R conalloc         0.756 s
R testgc3          1.956 s
R conappend        0.008 s
R dlist            2.200 s
R tree2            1.271 s

Seems the picture isn't quite so simple after all. Some cases show improvement, others show degradation, relative to the test with the second commit. Of course, both micro-opt versions consistently improve on or are on par with the current GC.

@marcioapm
Copy link
Contributor Author

I suppose this is ready to be merged if wanted.

I will improve on it further over the weekend, if I have time. Still have some ideas that could be more quick wins, but those will be another PR.

@quickfur
Copy link
Member

Since I'm not familiar enough with the GC code to do justice to the code review (I just like running benchmarks :-P), I'll leave it to @rainers to decide whether to merge.

@rainers
Copy link
Member

rainers commented Feb 12, 2016

Looks like other than huge_single and conappend, which have equal performance, everything else is consistently better with this PR applied.

The con* tests are a bit shaky due to them testing concurrent allocations.

If you use runbench gcbench -- --DRT-gcopt=profile=1 you get the GC summary of the best run printed, too:

R dlist            2.969 s,    22 MB,   52 GC  435 ms, Pauses  265 ms <    5 ms
R huge_single      0.019 s,  1501 MB,    3 GC    0 ms, Pauses    0 ms <    0 ms
R rand_large       0.423 s,   165 MB,  892 GC  171 ms, Pauses   61 ms <    0 ms
R rand_small       0.688 s,    22 MB,  426 GC  203 ms, Pauses   76 ms <    0 ms
R slist            2.896 s,    22 MB,   52 GC  398 ms, Pauses  234 ms <    4 ms
R testgc3          2.427 s,   247 MB,   11 GC  588 ms, Pauses  413 ms <   83 ms
R tree1            0.875 s,    35 MB,   27 GC  285 ms, Pauses  171 ms <   12 ms
R tree2            1.691 s,     1 MB,  216 GC  130 ms, Pauses   11 ms <    0 ms
R vdparser         2.956 s,    70 MB,   15 GC  356 ms, Pauses  217 ms <   21 ms
R words            1.723 s,   396 MB,    6 GC   24 ms, Pauses   24 ms <   22 ms

Note that you cannot use rdmd in this case because it will consider --DRT-gcopt=profile=1 an option to itself, not the program to execute.

@rainers
Copy link
Member

rainers commented Feb 12, 2016

I'm seeing improvements, too, here. The inlining of findPool does not seem to contribute to that, though. Actually my test showed better results without inlining... I guess this is the usual pitfall of mobile processors. I'll have to retry on a desktop PC...

@rainers
Copy link
Member

rainers commented Feb 12, 2016

Another note regarding findPool: it would be best if we could find a better lookup mechanism than binary search. I tried (2-dim) page tables in the past but that turned out even slower because of non-locality of accesses. Maybe we can come up with some good hash function of the address to pick the pool.

To not complicate working on this, it would also be good if the function is not manually inlined.

@MartinNowak
Copy link
Member

Don't waste time on findPool, it's hard to beat binary search for the small numbers of pools, already spend too much time trying.
Also we should get rid of pools at some point and simply use a page allocator.

else break;

if (low > high)
continue Lnext;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the marginal inlining of findPool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer to keep it.

It might seem that it's not contributing, but it is helping. It's impact is proportional to the amount of times the outer branch (minAddr < p < maxAddr) branch is taken, right?

It's not hard to imagine legitimate cases where this branch would be taken very frequently:
-Very large pointer-based data structures, like trees, doubly linked lists, ...
-Unfortunate data-sets - i.e. 64-bit numbers that happen to be in range
-Both of the previous combined

One could still argue that when the outer branch is taken, the binary search is not the dominant factor, so I constructed a new test case to prove my theory: for every 16 bytes allocated, the first 8 bytes contain a pointer to a random address in range. The results show that this is indeed the case. The scan is now ~6x slower, because the outer branch is now taken much more frequently, and the inlined version is about ~25% faster than the non-inlined version.

Even in my original posted test case with no regard for memory content, the difference is ~1% which is still a pretty significant improvement, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean manually inlining not inlining in general.
We could change findPool to a static function and call findPool(pools) if that is currently preventing inlining, there is also pragma(inline, true).

@marcioapm
Copy link
Contributor Author

FreeBSD 32 is failing, but seems unrelated to this. Is that known to be a flaky test or is there something I must do?

@marcioapm
Copy link
Contributor Author

It seems DMD doesn't want to inline it. Tried my version of the loop as well, with and without nothrow and pure, but nothing will do. Ideas?

pragma(inline, true)  T* findPool(T)(T** pools, size_t highpool, void* p) pure nothrow                                                                                                                     
{
      if (highpool == 0)
          return pools[0];

      size_t low = 0;
      size_t high = highpool;
      while (low <= high)
      {
          size_t mid = (low + high) >> 1;
          auto pool = pools[mid];
          if (p < pool.baseAddr)
              high = mid - 1;
          else if (p >= pool.topAddr)
              low = mid + 1;
          else
              return pool;
      }
      return null;
}

@etcimon
Copy link
Contributor

etcimon commented Feb 17, 2016

Maybe a string mixin would do in this case?

@ibuclaw
Copy link
Member

ibuclaw commented Feb 17, 2016

It should be able to inline it. @9rnsr ?

@marcioapm
Copy link
Contributor Author

I was investigating why it's not inlining and it seems that the culprits are the two nested return statements.

I am not at all familiar with the DMD code, so I might be very wrong, but even the extremely simple nextpow2() in druntime/src/aaA.d fails to inline due to this.
Turning nextpow2() into a single statement function lets DMD inline it, but can't really do that elegantly for the binary search.

Have a look at dmd/src/inline.d line 203 to see why.
According to line 109 it seems like it can also handle exactly this pattern:

if (cond)
  return 1;
return 2;

@quickfur
Copy link
Member

Sometime last year when I tested dmd's inlining, I noticed (to my dismay) that something as simple as:

if (cond)
   return a;
return b;

will prevent inlining, whereas inserting an else will make it inlineable:

if (cond)
    return a;
else
    return b;

This may have been fixed since, I don't remember now, but it does show that dmd's inliner is rather temperamental, and little things can upset it. Sometimes it helps to do trivial refactorings of the code like the above; there might be a particular combination that prods the inliner in the right direction. (Boy do I wish for the day we don't have to do this anymore...)

@MartinNowak
Copy link
Member

Let's merge it as is. The inliner is crap and it doesn't make sense to optimize specifically for dmd, but the GC performance is important enough to still do it.

MartinNowak added a commit that referenced this pull request Feb 19, 2016
Optimize GC mark phase implementation
@MartinNowak MartinNowak merged commit a438c2c into dlang:master Feb 19, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants