GC micro-optimizations #1919

rainers · 2017-09-18T06:03:59Z

These are a few optimizations I recently noticed while staring at the precise GC.
Benchmarks are not very stable, but show an improvement of 1 to 5 %. The graph shows the incremental performance change from bottom to top in percent, unfortunately not in the order of the commits:

dlang-bot · 2017-09-18T06:04:16Z

Thanks for your pull request, @rainers!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

ibuclaw · 2017-09-18T07:49:36Z

Do we have a set of benchmark tests for the GC? How feasible would it be to walk through all history of the GC implementation, running these tests after every change and pushing that out to a statistics / analysis tool.

I'm not convinced that we can really accurately gauge performance of the GC in the current way these sort of PRs are done - usually by comparing the difference between A and B.

rainers · 2017-09-18T17:37:11Z

Do we have a set of benchmark tests for the GC?

They are in druntime/benchmark/gcbench

How feasible would it be to walk through all history of the GC implementation, running these tests after every change and pushing that out to a statistics / analysis tool.

That's what https://blog.thecybershadow.net/2015/05/05/is-d-slim-yet/ did, though not for the GC benchmarks. Unfortunately it is currently broken because the data set has grown too large.

MartinNowak · 2017-09-23T10:40:29Z

I'm not convinced that we can really accurately gauge performance of the GC in the current way these sort of PRs are done - usually by comparing the difference between A and B.

? We do run those comparisons for every GC change, so we should keep getting better, at least regarding the GC implementation.
What would be helpful is more application benchmarks instead of micro-benchmarks. So far only the old VisualD parser fits that category.

MartinNowak

Amazing that we can still squeeze out percents with micro-optimizations.

MartinNowak · 2017-09-23T10:43:20Z

src/gc/impl/conservative/gc.d


            //if (log) debug(PRINTF) printf("\tmark %p\n", p);
-            if (p >= minAddr && p < maxAddr)
+            if (cast(size_t)(p - minAddr) < rngAddr)


Would be better named memRange or memSize, it's not an address.

MartinNowak · 2017-09-23T10:46:19Z

src/gc/impl/conservative/gc.d

            // local stack is full, push it to the global stack
            assert(stackPos == stack.length);
-            toscan.push(ScanRange(p1, p2));
+            if (p1 + 1 < p2)


That would be better added in the above loops where we break without incrementing the pointer.
In fact you can just factor this out as common code between all branches.

if (bin < B_PAGE) { // ... biti = offsetBase >> pool.shiftBy; base = pool.baseAddr + offsetBase; top = base + binsize[bin]; } else if (bin == B_PAGE) { // ... if(!pointsToBase && pool.nointerior.nbits && pool.nointerior.test(biti)) continue; } else { // ... } if (!pool.mark.set(biti) && !pool.noscan.test(biti)) { stack[stackPos++] = ScanRange(base, top); if (stackPos == stack.length) { // local stack is full, push it to the global stack if (++p1 < p2) toscan.push(ScanRange(p1, p2)); break; } }

Could help a bit to better fit the normal loop body in the µop-cache.

That would be better added in the above loops where we break without incrementing the pointer.

Incrementing could be done unconditionally, but I think having a simple common subexpression here might be better than writing back the value to p1.

In fact you can just factor this out as common code between all branches.

That way you have to calculate top unconditionally. I guess it could be slightly better to also move the calculation of base into the conditional.

Could help a bit to better fit the normal loop body in the µop-cache.

toscan.push() increases the loop size quite a bit because ScanRange.grow() is inlined. Disabling that with pragma(inline, false) didn't have any impact on performance, though.

MartinNowak · 2017-09-23T11:01:28Z

src/gc/impl/conservative/gc.d

+                    if (!pool.mark.set(biti) && !pool.noscan.test(biti))
+                    {
+                        top = base + binsize[bin];
+                        goto LaddRange;


There you go :).

MartinNowak · 2017-09-23T11:10:14Z

src/gc/impl/conservative/gc.d

+                // continue with last stack entry
+                p1 = cast(void**)base;
+                p2 = cast(void**)top;
+                goto LnextBody; // skip increment and check


That's a second loop around/inside the loop, can we keep reusing the Lagain part for that?
Maybe by hoisting the next variable and setting it here next = ScanRange(base, top), then jumping to the p1 = cast(void**)next.pbot; part.
Seems a bit cleaner to me and the case of the local stack overflowing is rare enough to not worry about 2 cycles or so.

MartinNowak · 2017-09-23T11:11:18Z

src/gc/impl/conservative/gc.d

        p2 = cast(void**)next.ptop;
        // printf("  pop [%p..%p] (%#zx)\n", p1, p2, cast(size_t)p2 - cast(size_t)p1);
-        goto Lagain;
+        goto LnextBody;


remove Lagain label

MartinNowak · 2017-09-23T11:13:43Z

src/gc/impl/conservative/gc.d


+    enum shiftBySmall = 4;
+    enum shiftByLarge = 12;
    uint shiftBy;    // shift count for the divisor used for determining bit indices.


Nice, how about using an enum.

enum ShiftBy : ubyte { small = 4, large = 12, } ShiftBy shiftBy;

MartinNowak · 2017-09-23T11:15:38Z

src/gc/impl/conservative/gc.d

+            size_t pn = offset / PAGESIZE;
+            Bins   bin = cast(Bins)pool.pagetable[pn];
+            void* base = void;
+            void* top = void;


Nesting loops is really just a bad habit ;).

MartinNowak · 2017-09-23T11:18:51Z

src/gc/impl/conservative/gc.d

+                // because it's ignored for small object pools anyhow.
+                auto offsetBase = offset & notbinsize[bin];
+                biti = offsetBase >> pool.shiftBySmall;
+                base = pool.baseAddr + offsetBase;


Common code, less icache/uop-cache pressure :).

MartinNowak · 2017-09-23T11:24:28Z

Just reran the tests, haven't seen that OOM error before.
https://ci.dlang.io/blue/organizations/jenkins/dlang-org%2Fdruntime/detail/PR-1919/1/pipeline/175#step-488-log-45

rainers · 2017-09-23T15:48:51Z

Just reran the tests, haven't seen that OOM error before.

The autotester also hangs rather consistently for Linux_32. I wasn't able to reproduce both failures locally, though. No idea what could be causing this.

dnadlinger · 2017-09-23T16:27:21Z

Off-topic: How does GC performance compare between DMD, GDC and LDC?

rainers · 2017-09-23T16:42:50Z

Off-topic: How does GC performance compare between DMD, GDC and LDC?

According to my recent tests LDC seems about 40% faster, both in overall time and GC time. I don't have a reasonable GDC version on Windows.

rainers · 2017-09-23T16:49:25Z

Addressed all comments but the loop changes. I'll try another simplification.

…structions

no idea why this is necessary but the dyaml test fails without it in commit f658df8

rainers · 2017-09-26T06:49:35Z

Finally figured the cause of the failures, though I don't get why clearing pcache is necessary. @MartinNowak, do you remember why it needed to be included in the Lagain loop?

The mark loop is now simplified to only use forward jumps and standard backward continuation. I also noticed a few more redundant operations.

ibuclaw · 2017-09-26T07:15:46Z

Finally figured the cause of the failures, though I don't get why cleraing pcache is necessary.

Just guess from me after looking at the loop. Could it happen that you have two ranges next to each other? - ie: pcache is both the end of the previous and start of the current range.

Clearing seems reasonable anyway if you are already altering p1/p2 pointers.

MartinNowak · 2017-09-26T14:49:25Z

Finally figured the cause of the failures, though I don't get why clearing pcache is necessary.

Hihi, changing sth. w/o understanding why doesn't count as "figuring out" ;).

@MartinNowak, do you remember why it needed to be included in the Lagain loop?

No I merely preserved the existing semantics.
MartinNowak@85f392d#diff-6f1ab0423fff9dcd084ecf9a677dc426R2391

Actually it looks confusing that we don't make a small 16-byte bin

druntime/src/gc/impl/conservative/gc.d

Line 2010 in 4ee4569

if (bin < B_PAGE)

just because it's on the same page (4KB) as the previous element.

druntime/src/gc/impl/conservative/gc.d

Lines 1982 to 1983 in 4ee4569

    
           if ((cast(size_t)p & ~cast(size_t)(PAGESIZE-1)) == pcache) 
        
               continue;

Am I misreading this?

MartinNowak · 2017-09-26T14:53:55Z

Ah, a bit easier to see that pcache isn't set for < B_PAGE pages when this code wasn't so regressed as currently.
www.dsource.org/projects/tango/browser/trunk/tango/core/rt/gc/basic/gcx.d#L2243
Nonetheless, pcache is a crappy name for last marked (full) page.

MartinNowak · 2017-09-26T15:01:52Z

src/gc/impl/conservative/gc.d

                    // For the NO_INTERIOR attribute.  This tracks whether
                    // the pointer is an interior pointer or points to the
                    // base address of a block.
-                    bool pointsToBase = (base == sentinel_sub(p));


Was it necessary to remove that name?
Even dmd should be able to optimize this and deciphering base != sentinel_sub(p) isn't that trivial.

Even dmd should be able to optimize

Unfortunately, it it doesn't just use the comparison in the following if, but stores the result to a byte with sete and uses that, adding a number of gratuitious instructions.

MartinNowak · 2017-09-26T15:05:41Z

I don't see why pcache needs to be reset. What's ugly is that it's set before marking, but shouldn't be a problem.
Honestly we should stop throwing time at this crappy code. The algorithmic deficiencies are a much bigger problem and the codebase is beyond repair/maintainability.

rainers · 2017-09-27T06:24:11Z

Thanks @ibuclaw and @MartinNowak for looking into this.

Just guess from me after looking at the loop. Could it happen that you have two ranges next to each other? - ie: pcache is both the end of the previous and start of the current range.

I don't see how pcache can point to two ranges at the same time.

For my own sanity: pcache points to the last GC managed 4k page that is found to be referenced and is inside a large pool, i.e. there is no other object on that page. As the object containing the page has been marked and its range has been added to the stack for scanning there is no need to look at pointers referencing the same page.
AFAICT that does not invalidate if the mark loop switches to another range. The test failures suggest it does, though.

rainers · 2017-09-27T06:34:11Z

What's ugly is that it's set before marking, but shouldn't be a problem.

I think it's not so bad, as this shortcuts tests that point to the same page even if it is already marked. Otherwise it would only work for subsequent references of the first hit.

Honestly we should stop throwing time at this crappy code. The algorithmic deficiencies are a much bigger problem and the codebase is beyond repair/maintainability.

I'm not sure it is so bad. Last time I checked the GC was faster for manually managed memory than C's malloc/free on Windows (not considering the additional benefit of not having to call addRange/removeRange).

My actual motivation was improving the precise GC, though, so that it would pass your review ;-) Adding these seemingly simple optimizations there only would bias the benchmark results if not applied to the standard GC, too.

rainers · 2017-09-27T07:47:19Z

Here's a graph of new benchmarks of the time spent in the GC [ms] for master, the version of the initial PR (intermediate) and the current PR (microopt):

MartinNowak · 2017-09-29T11:29:41Z

I'm not sure it is so bad. Last time I checked the GC was faster for manually managed memory than C's malloc/free on Windows (not considering the additional benefit of not having to call addRange/removeRange).

Are you comparing that against dmc's libc malloc/free?

MartinNowak · 2017-09-29T11:31:22Z

BTW, I've recently started to use plots with error intervals for benchmarks, should I write a short R plot script for the runbench tool as well?

MartinNowak · 2017-09-29T11:34:02Z

My actual motivation was improving the precise GC, though, so that it would pass your review ;-) Adding these seemingly simple optimizations there only would bias the benchmark results if not applied to the standard GC, too.

The requirements for that remain straightforward, only very little performance decrease for existing programs, and only small API/maintainability commitments.

rainers · 2017-09-30T16:07:19Z

The requirements for that remain straightforward, only very little performance decrease for existing programs, and only small API/maintainability commitments.

That's why I meant I should have left these micro-optimzations for the precise GC to compensate for some performance losses ;-)

rainers · 2017-10-01T11:53:27Z

Are you comparing that against dmc's libc malloc/free?

That comparison is some time ago, this is the respective comment: #739 (comment). I'm not sure whether the dmc lib was already just a wrapper for the Windows HeapAllocate functions as with the MS libraries. The Windows functions might have improved in the meantime, too.

rainers changed the title ~~Gc micro-optimizations~~ GC micro-optimizations Sep 18, 2017

MartinNowak approved these changes Sep 23, 2017

View reviewed changes

rainers force-pushed the gc_microopt branch from 8f1cf64 to 8063d22 Compare September 24, 2017 11:55

rainers added 16 commits September 24, 2017 14:25

Range is no longer just bot/top, use local ScanRange

cc24872

do not reload pool data, check range instead of min/max

f658df8

do not rescan last pointer in range

ae859cc

mark: avoid unnecessary check when local stack is full

0c34dcd

mark: skip unnecessary check for empty range

7a4a49c

gc: use constant bitmask shift values if known

dbcf55a

simplify flow control

c25a68e

use shiftBy enum, rename rngAddr to memSize

ae64448

GC.mark: calc base only if necessary

ef551d5

GC.mark: rearrange loop to use forward jumps only

7d08f30

do not inline ScanRange.grow

f3540b8

GC.mark: avoid copy through temporary ScanRange

4cc4d6c

GC.mark: do not save result of comparison in a bool, it adds extra in…

47945ea

…structions

GC.mark: use known Bin size for B_PAGE

3e93b74

GC.mark: remove redundant calculation of the same value

e61ae3d

GC.mark: p1 not incremented in addRange when stack not full

208bf47

rainers force-pushed the gc_microopt branch from 8063d22 to 3db208f Compare September 24, 2017 12:25

GC.mark: protect against empty range

3de7c0e

rainers force-pushed the gc_microopt branch from 3db208f to 3de7c0e Compare September 24, 2017 22:42

GC.mark: clear pcache when getting new range

ef9c67e

no idea why this is necessary but the dyaml test fails without it in commit f658df8

MartinNowak reviewed Sep 26, 2017

View reviewed changes

MartinNowak merged commit a54f055 into dlang:master Sep 29, 2017

Uh oh!

GC micro-optimizations #1919

GC micro-optimizations #1919

Uh oh!

Conversation

rainers commented Sep 18, 2017

Uh oh!

dlang-bot commented Sep 18, 2017

Bugzilla references

Uh oh!

ibuclaw commented Sep 18, 2017

Uh oh!

rainers commented Sep 18, 2017

Uh oh!

MartinNowak commented Sep 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinNowak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented Sep 23, 2017

Uh oh!

rainers commented Sep 23, 2017

Uh oh!

dnadlinger commented Sep 23, 2017

Uh oh!

rainers commented Sep 23, 2017

Uh oh!

rainers commented Sep 23, 2017

Uh oh!

rainers commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibuclaw commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinNowak commented Sep 26, 2017

Uh oh!

MartinNowak commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinNowak Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented Sep 26, 2017

Uh oh!

rainers commented Sep 27, 2017

Uh oh!

rainers commented Sep 27, 2017

Uh oh!

rainers commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinNowak commented Sep 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinNowak commented Sep 29, 2017

MartinNowak commented Sep 23, 2017 •

edited

Loading

rainers commented Sep 26, 2017 •

edited

Loading

ibuclaw commented Sep 26, 2017 •

edited

Loading

MartinNowak commented Sep 26, 2017 •

edited

Loading

MartinNowak Sep 26, 2017 •

edited

Loading

rainers commented Sep 27, 2017 •

edited

Loading

MartinNowak commented Sep 29, 2017 •

edited

Loading