Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

Conversation

rainers
Copy link
Member

@rainers rainers commented Sep 18, 2017

These are a few optimizations I recently noticed while staring at the precise GC.
Benchmarks are not very stable, but show an improvement of 1 to 5 %. The graph shows the incremental performance change from bottom to top in percent, unfortunately not in the order of the commits:

grafik

@rainers rainers changed the title Gc micro-optimizations GC micro-optimizations Sep 18, 2017
@dlang-bot
Copy link
Contributor

Thanks for your pull request, @rainers!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

@ibuclaw
Copy link
Member

ibuclaw commented Sep 18, 2017

Do we have a set of benchmark tests for the GC? How feasible would it be to walk through all history of the GC implementation, running these tests after every change and pushing that out to a statistics / analysis tool.

I'm not convinced that we can really accurately gauge performance of the GC in the current way these sort of PRs are done - usually by comparing the difference between A and B.

@rainers
Copy link
Member Author

rainers commented Sep 18, 2017

Do we have a set of benchmark tests for the GC?

They are in druntime/benchmark/gcbench

How feasible would it be to walk through all history of the GC implementation, running these tests after every change and pushing that out to a statistics / analysis tool.

That's what https://blog.thecybershadow.net/2015/05/05/is-d-slim-yet/ did, though not for the GC benchmarks. Unfortunately it is currently broken because the data set has grown too large.

@MartinNowak
Copy link
Member

MartinNowak commented Sep 23, 2017

I'm not convinced that we can really accurately gauge performance of the GC in the current way these sort of PRs are done - usually by comparing the difference between A and B.

? We do run those comparisons for every GC change, so we should keep getting better, at least regarding the GC implementation.
What would be helpful is more application benchmarks instead of micro-benchmarks. So far only the old VisualD parser fits that category.

Copy link
Member

@MartinNowak MartinNowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing that we can still squeeze out percents with micro-optimizations.


//if (log) debug(PRINTF) printf("\tmark %p\n", p);
if (p >= minAddr && p < maxAddr)
if (cast(size_t)(p - minAddr) < rngAddr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better named memRange or memSize, it's not an address.

// local stack is full, push it to the global stack
assert(stackPos == stack.length);
toscan.push(ScanRange(p1, p2));
if (p1 + 1 < p2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better added in the above loops where we break without incrementing the pointer.
In fact you can just factor this out as common code between all branches.

if (bin < B_PAGE)
{
    // ...
    biti = offsetBase >> pool.shiftBy;
    base = pool.baseAddr + offsetBase;
    top = base + binsize[bin];
}
else if (bin == B_PAGE)
{
    // ...
    if(!pointsToBase && pool.nointerior.nbits && pool.nointerior.test(biti))
        continue;
}
else
{
    // ...
}

if (!pool.mark.set(biti) && !pool.noscan.test(biti))
{
    stack[stackPos++] = ScanRange(base, top);
    if (stackPos == stack.length)
    {
        // local stack is full, push it to the global stack
        if (++p1 < p2)
            toscan.push(ScanRange(p1, p2));
        break;
    }
}

Could help a bit to better fit the normal loop body in the µop-cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better added in the above loops where we break without incrementing the pointer.

Incrementing could be done unconditionally, but I think having a simple common subexpression here might be better than writing back the value to p1.

In fact you can just factor this out as common code between all branches.

That way you have to calculate top unconditionally. I guess it could be slightly better to also move the calculation of base into the conditional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could help a bit to better fit the normal loop body in the µop-cache.

toscan.push() increases the loop size quite a bit because ScanRange.grow() is inlined. Disabling that with pragma(inline, false) didn't have any impact on performance, though.

if (!pool.mark.set(biti) && !pool.noscan.test(biti))
{
top = base + binsize[bin];
goto LaddRange;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There you go :).

// continue with last stack entry
p1 = cast(void**)base;
p2 = cast(void**)top;
goto LnextBody; // skip increment and check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a second loop around/inside the loop, can we keep reusing the Lagain part for that?
Maybe by hoisting the next variable and setting it here next = ScanRange(base, top), then jumping to the p1 = cast(void**)next.pbot; part.
Seems a bit cleaner to me and the case of the local stack overflowing is rare enough to not worry about 2 cycles or so.

p2 = cast(void**)next.ptop;
// printf(" pop [%p..%p] (%#zx)\n", p1, p2, cast(size_t)p2 - cast(size_t)p1);
goto Lagain;
goto LnextBody;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove Lagain label


enum shiftBySmall = 4;
enum shiftByLarge = 12;
uint shiftBy; // shift count for the divisor used for determining bit indices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, how about using an enum.

enum ShiftBy : ubyte
{
    small = 4,
    large = 12,
}
ShiftBy shiftBy;

size_t pn = offset / PAGESIZE;
Bins bin = cast(Bins)pool.pagetable[pn];
void* base = void;
void* top = void;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nesting loops is really just a bad habit ;).

// because it's ignored for small object pools anyhow.
auto offsetBase = offset & notbinsize[bin];
biti = offsetBase >> pool.shiftBySmall;
base = pool.baseAddr + offsetBase;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common code, less icache/uop-cache pressure :).

@MartinNowak
Copy link
Member

@rainers
Copy link
Member Author

rainers commented Sep 23, 2017

Just reran the tests, haven't seen that OOM error before.

The autotester also hangs rather consistently for Linux_32. I wasn't able to reproduce both failures locally, though. No idea what could be causing this.

@dnadlinger
Copy link
Contributor

Off-topic: How does GC performance compare between DMD, GDC and LDC?

@rainers
Copy link
Member Author

rainers commented Sep 23, 2017

Off-topic: How does GC performance compare between DMD, GDC and LDC?

According to my recent tests LDC seems about 40% faster, both in overall time and GC time. I don't have a reasonable GDC version on Windows.

@rainers
Copy link
Member Author

rainers commented Sep 23, 2017

Addressed all comments but the loop changes. I'll try another simplification.

no idea why this is necessary but the dyaml test fails without it in commit f658df8
@rainers
Copy link
Member Author

rainers commented Sep 26, 2017

Finally figured the cause of the failures, though I don't get why clearing pcache is necessary. @MartinNowak, do you remember why it needed to be included in the Lagain loop?

The mark loop is now simplified to only use forward jumps and standard backward continuation. I also noticed a few more redundant operations.

@ibuclaw
Copy link
Member

ibuclaw commented Sep 26, 2017

Finally figured the cause of the failures, though I don't get why cleraing pcache is necessary.

Just guess from me after looking at the loop. Could it happen that you have two ranges next to each other? - ie: pcache is both the end of the previous and start of the current range.

Clearing seems reasonable anyway if you are already altering p1/p2 pointers.

@MartinNowak
Copy link
Member

Finally figured the cause of the failures, though I don't get why clearing pcache is necessary.

Hihi, changing sth. w/o understanding why doesn't count as "figuring out" ;).

@MartinNowak, do you remember why it needed to be included in the Lagain loop?

No I merely preserved the existing semantics.
MartinNowak@85f392d#diff-6f1ab0423fff9dcd084ecf9a677dc426R2391

Actually it looks confusing that we don't make a small 16-byte bin

if (bin < B_PAGE)

just because it's on the same page (4KB) as the previous element.
if ((cast(size_t)p & ~cast(size_t)(PAGESIZE-1)) == pcache)
continue;

Am I misreading this?

@MartinNowak
Copy link
Member

MartinNowak commented Sep 26, 2017

Ah, a bit easier to see that pcache isn't set for < B_PAGE pages when this code wasn't so regressed as currently.
www.dsource.org/projects/tango/browser/trunk/tango/core/rt/gc/basic/gcx.d#L2243
Nonetheless, pcache is a crappy name for last marked (full) page.

// For the NO_INTERIOR attribute. This tracks whether
// the pointer is an interior pointer or points to the
// base address of a block.
bool pointsToBase = (base == sentinel_sub(p));
Copy link
Member

@MartinNowak MartinNowak Sep 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it necessary to remove that name?
Even dmd should be able to optimize this and deciphering base != sentinel_sub(p) isn't that trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even dmd should be able to optimize

Unfortunately, it it doesn't just use the comparison in the following if, but stores the result to a byte with sete and uses that, adding a number of gratuitious instructions.

@MartinNowak
Copy link
Member

I don't see why pcache needs to be reset. What's ugly is that it's set before marking, but shouldn't be a problem.
Honestly we should stop throwing time at this crappy code. The algorithmic deficiencies are a much bigger problem and the codebase is beyond repair/maintainability.

@rainers
Copy link
Member Author

rainers commented Sep 27, 2017

Thanks @ibuclaw and @MartinNowak for looking into this.

Just guess from me after looking at the loop. Could it happen that you have two ranges next to each other? - ie: pcache is both the end of the previous and start of the current range.

I don't see how pcache can point to two ranges at the same time.

For my own sanity: pcache points to the last GC managed 4k page that is found to be referenced and is inside a large pool, i.e. there is no other object on that page. As the object containing the page has been marked and its range has been added to the stack for scanning there is no need to look at pointers referencing the same page.
AFAICT that does not invalidate if the mark loop switches to another range. The test failures suggest it does, though.

@rainers
Copy link
Member Author

rainers commented Sep 27, 2017

What's ugly is that it's set before marking, but shouldn't be a problem.

I think it's not so bad, as this shortcuts tests that point to the same page even if it is already marked. Otherwise it would only work for subsequent references of the first hit.

Honestly we should stop throwing time at this crappy code. The algorithmic deficiencies are a much bigger problem and the codebase is beyond repair/maintainability.

I'm not sure it is so bad. Last time I checked the GC was faster for manually managed memory than C's malloc/free on Windows (not considering the additional benefit of not having to call addRange/removeRange).

My actual motivation was improving the precise GC, though, so that it would pass your review ;-) Adding these seemingly simple optimizations there only would bias the benchmark results if not applied to the standard GC, too.

@rainers
Copy link
Member Author

rainers commented Sep 27, 2017

Here's a graph of new benchmarks of the time spent in the GC [ms] for master, the version of the initial PR (intermediate) and the current PR (microopt):

grafik

@MartinNowak
Copy link
Member

MartinNowak commented Sep 29, 2017

I'm not sure it is so bad. Last time I checked the GC was faster for manually managed memory than C's malloc/free on Windows (not considering the additional benefit of not having to call addRange/removeRange).

Are you comparing that against dmc's libc malloc/free?

@MartinNowak
Copy link
Member

BTW, I've recently started to use plots with error intervals for benchmarks, should I write a short R plot script for the runbench tool as well?

@MartinNowak
Copy link
Member

My actual motivation was improving the precise GC, though, so that it would pass your review ;-) Adding these seemingly simple optimizations there only would bias the benchmark results if not applied to the standard GC, too.

The requirements for that remain straightforward, only very little performance decrease for existing programs, and only small API/maintainability commitments.

@MartinNowak MartinNowak merged commit a54f055 into dlang:master Sep 29, 2017
@rainers
Copy link
Member Author

rainers commented Sep 30, 2017

The requirements for that remain straightforward, only very little performance decrease for existing programs, and only small API/maintainability commitments.

That's why I meant I should have left these micro-optimzations for the precise GC to compensate for some performance losses ;-)

@rainers
Copy link
Member Author

rainers commented Oct 1, 2017

Are you comparing that against dmc's libc malloc/free?

That comparison is some time ago, this is the respective comment: #739 (comment). I'm not sure whether the dmc lib was already just a wrapper for the Windows HeapAllocate functions as with the MS libraries. The Windows functions might have improved in the meantime, too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants