fix Issue 10403 - memchr optimization for std.algorithm.find by monarchdodra · Pull Request #1492 · dlang/phobos

monarchdodra · 2013-08-18T10:33:39Z

This improves the implementation of find!(R, E) as well as find!(pred, R).

strings:

Improvements to UTF8 and UTF16 are particularly spectacular, with performance improvements up to ×20 for normal strings (read "hello world").

These performance improvements to string test that the needle fit inside a single code unit. However, even if it doesn't, I still get a rough 100% performance improvement in the general case, with or without default pred.

[u]byte[]

Implemented 10403: Same as with strings, the performance improvement is spectacular when it triggers.

Standard arrays:

I added a path for raw arrays, which uses a foreach rather than a loop. It makes little difference for dmd -inline, but I get a bit more than 100% performance improvement for non -inline, so worth it. (also, stangelly enough, gdc -inline gets a big benefit. Don't know why...)

monarchdodra · 2013-08-18T10:35:48Z

Benches.

These were done on a linux 64, with both dmd and gdc, tested with and without -inline. gdc results were mostly in line with dmd's so I did not publish them (except when different):

UTF8:
{
    find("hello world", ' ');
        dmd
            newFind:    731ms.
            oldFind:  11875ms.
        dmd -inline
            newFind:    646ms.
            oldFind:   9359ms.
        gdc
            NewFind:    808ms.
            oldFind:   8900ms.
        gdc -inline
            newFind:    817ms.
            oldFind:   8944ms.

    find("日日日本本語", '語');
        dmd
            newFind:  10336ms.
            oldFind:  16065ms.
        dmd -inline
            newFind:   5868ms.
            oldFind:  11645ms.

    find!"a==b"("hello world", ' ');
        dmd
            newFind:   3029ms.
            oldFind:   5799ms.
        dmd -inline
            newFind:   1949ms.
            oldFind:   3213ms.
}

UTF16:
{
    find("hello world"w, ' ');
        dmd
            newFind:   1321ms.
            oldFind:  12333ms.
        dmd -inline
            newFind:   1561ms.
            oldFind:   9728ms.

    find("日日日本本語"w, '\U00010000');
        dmd
            newFind:   4863ms.
            oldFind:  10508ms.
        dmd -inline
            newFind:   2613ms.
            oldFind:   8434ms.

    find!"a==b"("日日日本本語", '語');
        dmd
            newFind:   9152ms.
            oldFind:  13035ms.
        dmd -inline
            newFind:   7538ms.
            oldFind:  10094ms.
}

Arrays:
{
    find([1, 2, 3, 4, 5],  5);
        dmd
            newFind:   1899ms.
            oldFind:   4168ms.
        dmd -inline
            newFind:   1218ms.
            oldFind:   1366ms.
        dgc
            newFind:   1813ms.
            oldFind:   4074ms.
        gdc -inline
            newFind:   1830ms.
            oldFind:   4068ms.
}

monarchdodra · 2013-08-18T10:39:28Z

Implementation :

InputRange find(alias pred = "a == b", InputRange, Element)(InputRange haystack, Element needle)
if (isInputRange!InputRange &&
    is (typeof(binaryFun!pred(haystack.front, needle)) : bool))
{
    alias R = InputRange;
    alias E = Element;
    alias predFun = binaryFun!pred;
    static if (is(typeof(pred == "a == b")))
        enum isDefaultPred = pred == "a == b";
    else
        enum isDefaultPred = false;
    enum  isIntegralNeedle = isSomeChar!E || isIntegral!E || isBoolean!E;

    alias EType  = ElementType!R;

    static if (isNarrowString!R)
    {
        alias EEType = ElementEncodingType!R;
        alias UEEType = Unqual!EEType;

        //These are two special cases which can search without decoding the UTF stream.
        static if (isDefaultPred && isIntegralNeedle)
        {
            //This special case deals with UTF8 search, when the needle
            //is represented by a single code point.
            //Note: "needle <= 0x7F" properly handles sign via unsigned promotion
            static if (is(UEEType == char))
            {
                if (!__ctfe && needle <= 0x7F)
                {
                    R trustedmemchr() @trusted nothrow
                    {
                        auto ptr = memchr(haystack.ptr, needle, haystack.length);
                        return ptr ?
                             haystack[ptr - haystack.ptr .. $] :
                             haystack[$ .. $];
                    }
                    return trustedmemchr();
                }
            }

            //Ditto, but for UTF16
            static if (is(UEEType == wchar))
            {
                if (needle <= 0xD7FF || (0xE000 <= needle && needle <= 0xFFFF))
                {
                    foreach (i, ref EEType e; haystack)
                    {
                        if (e == needle)
                            return haystack[i .. $];
                    }
                    return haystack[$ .. $];
                }
            }
        }

        //Previous conditonal optimizations did not succeed. Fallback to
        //unconditional implementations
        static if (isDefaultPred)
        {
            //In case of default pred, it is faster to do string/string search.
            UEEType[is(UEEType == char) ? 4 : 2] buf;

            size_t len = encode(buf, needle);
            //TODO: Make find!(R, R) @safe
            R trustedFindRR() @trusted {return std.algorithm.find(haystack, cast(R)buf[0 .. len]);}
            return trustedFindRR();
        }
        else
        {
            //Explicit pred. Manually decode: That is the fastest implementation.
            immutable len = haystack.length;
            size_t i = 0, next = 0;
            while (next < len)
            {
                if (predFun(decode(haystack, next), needle))
                    return haystack[i .. $];
                i = next;
            }
            return haystack[$ .. $];
        }
    }
    else static if (isArray!R)
    {
        //10403 optimization
        static if (isDefaultPred && isIntegral!EType && EType.sizeof == 1 && isIntegralNeedle)
        {
            R findHelper() @trusted nothrow
            {
                EType* ptr = null;
                //Note: we use "min/max" to handle sign miss-match.
                if (min(EType.min, needle) == EType.min, needle && max(EType.max, needle) == EType.max)
                    ptr = cast(EType*) memchr(haystack.ptr, needle, haystack.length);

                return ptr ?
                    haystack[ptr - haystack.ptr .. $] :
                    haystack[$ .. $];
            }

            if (!__ctfe)
                return findHelper();
        }

        //Default implementation.
        foreach (i, ref e; haystack)
            if (predFun(e, needle))
                return haystack[i .. $];
        return haystack[$ .. $];
    }
    else
    {
        //Everything else. Walk.
        for ( ; !haystack.empty; haystack.popFront() )
        {
            if (predFun(haystack.front, needle))
                break;
        }
        return haystack;
    }
}

Implementation note: I used named functions for @trusted, as opposed to inline lambdas, as the inline lambdas seem to be conflicting with the ability to inline.

ibuclaw · 2013-08-18T14:58:36Z

I assume when you say you tested gdc with -inline I assume you mean -finline-functions. Other than that, looks good to me.

monarchdodra · 2013-08-18T16:31:55Z

I assume you mean

No, I used -inline. That said, I see no difference when adding -finline-functions. Also, I used -O, and I'm noticing that -O3 is better. I think I may be having some problems with assert: -fno-assert seems to have no effect :/ I think I botched the gdc tests... sorry.

What would be the gdc equivalent of -O -release -inline?

ibuclaw · 2013-08-18T17:24:17Z

What would be the gdc equivalent of -O -release -inline?

Simply -O2 -frelease - function inlining is turned on at -O2.

monarchdodra · 2013-08-20T07:47:19Z

Simply -O2 -frelease - function inlining is turned on at -O2.

Thanks. That does improve the performance. I'm still getting roughly the same results, in terms of relative improvements, so the performance improvements aren't worth noting compared to DMD's.

The only place where DMD and GDC differ is that I'm getting roughly the same performance with arrays on dmd for a foreach/for( ; !empty; popFront) loop (with -inline).

GDC seems to be having more problems with the for( ; !empty; popFront) loops (takes twice as much time to iterate), but on the other hand, it optimizes the foreach loop more aggressively.

In any case, it means using foreach is, at worst, a performance improvement.

monarchdodra · 2013-08-20T07:52:28Z

Also, Disclaimer:

The "old" implementation is subject to bug http://d.puremagic.com/issues/show_bug.cgi?id=10848 : inlined lambdas aren't inlined.

I reused parts of the old implementation, but fixing said bug. This, in a certain way, this makes my comparisons biased in the sense that I should have fixed the old implementation too before comparing with it.

Overall, it means that the old time for the implementations for UTF8/UTF16 without a default pred should roughly be halved.

Even with this, the new implementation is mostly still either massively better, or just as good.

quickfur · 2013-08-27T03:44:27Z

Awesome performance improvement, I like it. :) A few notes:

I'm not sure about stringing static if() if (__ctfe) if (...) all in a single line. Isn't there a better way of formatting this so that it's a little more readable?
I thought some time ago Walter wanted template parameters like R to be rewritten as something more descriptive, like InputRange, ForwardRange, etc., so the the docs are more readable. Or has that changed now that ddoc includes the signature constraints?

Other than that, LGTM.

monarchdodra · 2013-09-01T07:17:00Z

Awesome performance improvement, I like it. :)

@quickfur Thanks :)

I'm not sure about stringing static if() if (__ctfe) if (...) all in a single line. Isn't there a better way of formatting this so that it's a little more readable?

You are right. I fixed it up, and it should be more readable.

I thought some time ago Walter wanted template parameters like R to be rewritten as something more descriptive, like InputRange, ForwardRange, etc., so the the docs are more readable. Or has that changed now that ddoc includes the signature constraints?

Hum... I had though of the R vs Range debate, where the word Range, imo, buys us nothing. However, I like naming it what it is a lot (eg: ForwardRange haystack).

So I changed it accordingly.

Other than that, LGTM.

quickfur · 2013-09-02T05:08:10Z

LGTM.

quickfur · 2013-09-13T04:41:07Z

Are any of the other Phobos committers looking at this pull? It has been green for the past little while, but seems to be stuck in continually being retested as other pulls on the first page get pulled first.

monarchdodra · 2013-09-14T18:09:34Z

It has been green for the past little while, but seems to be stuck in continually being retested as other pulls on the first page get pulled first.

Well, it's been green for the past 10 days, when I re-based (except for a little window when I broke dmd with my core.bitop fix).

Do any of the reviewers wait for 10/10 green? I just check the history personally. There are too many open pull (thread failures) to wait for 10/10 green to pull, IMO.

jmdavis · 2013-09-15T18:59:42Z

Whether I wait for 10/10 green depends on the pull. If it's sufficiently complex and something else has been merged since the last time it had 10/10 green, then it risks breaking the build, whereas if it's simple enough, then odds of it breaking the build are low if it's been green on most of the systems and has been green over most of its history.

ghost · 2013-09-15T19:02:26Z

@monarchdodra: I think you're relatively safe, and you can easily make a revert pull if something goes wrong. Compare this to the mess from a few days ago when master didn't build for an entire day because someone went to sleep after pulling.

dnadlinger · 2013-10-29T15:25:35Z

std/algorithm.d

If you are actually going to use a "real" (i.e. not static) nested function anyway, is there any reason not to just go for a delegate literal?

Lambdas do not get inlined by DMD for some reason. There is a pull to enforce such optimization.
dlang/dmd#2483

Ah, I thought the issue concerned all nested functions. Thanks for the hint.

What he said. Making this a lambda murders performance. In particular: http://d.puremagic.com/issues/show_bug.cgi?id=10864

As of right now, appender is actually slower than ~= because of this.

That said I like the explicitness of the named nested function anyways. But I also think it'd be even better as a static nested function.

monarchdodra · 2013-10-31T20:19:47Z

@klickverbot : Points addressed.

fix Issue 10403 - memchr optimization for std.algorithm.find

dnadlinger · 2013-10-31T20:32:24Z

Awesome!

monarchdodra · 2013-10-31T20:48:55Z

Awesome!

TYVM for the merge :)

9rnsr · 2013-12-04T07:19:37Z

std/algorithm.d

@monarchdodra This line causes a regression 11603.
The issue will cause when needle is a 1-byte zero.

And, why comma operator is used? It would be an another bug.
if (min(EType.min, needle) == EType.min, <-- hera

This looks like an copy/paste edit botch when writing my conditions. Sorry. Thanks for fixing.

monarchdodra mentioned this pull request Aug 25, 2013

Fix bugs using fitsInCodeUnit #1517

Merged

dnadlinger reviewed Oct 29, 2013
View reviewed changes

monarchdodra added 2 commits October 31, 2013 18:08

fix Issue 10403 - memchr optimization for std.algorithm.find

67e3c32

Tweak find implementation

8372da4

dnadlinger added a commit that referenced this pull request Oct 31, 2013

Merge pull request #1492 from monarchdodra/findImprov

482d0e1

fix Issue 10403 - memchr optimization for std.algorithm.find

dnadlinger merged commit 482d0e1 into dlang:master Oct 31, 2013

9rnsr reviewed Dec 4, 2013
View reviewed changes

Uh oh!

Conversation

monarchdodra commented Aug 18, 2013

Uh oh!

monarchdodra commented Aug 18, 2013

Uh oh!

monarchdodra commented Aug 18, 2013

Uh oh!

ibuclaw commented Aug 18, 2013

Uh oh!

monarchdodra commented Aug 18, 2013

Uh oh!

ibuclaw commented Aug 18, 2013

Uh oh!

monarchdodra commented Aug 20, 2013

Uh oh!

monarchdodra commented Aug 20, 2013

Uh oh!

quickfur commented Aug 27, 2013

Uh oh!

monarchdodra commented Sep 1, 2013

Uh oh!

quickfur commented Sep 2, 2013

Uh oh!

quickfur commented Sep 13, 2013

Uh oh!

monarchdodra commented Sep 14, 2013

Uh oh!

jmdavis commented Sep 15, 2013

Uh oh!

ghost commented Sep 15, 2013

Uh oh!

dnadlinger Oct 29, 2013

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky Oct 29, 2013

Choose a reason for hiding this comment

Uh oh!

dnadlinger Oct 29, 2013

Choose a reason for hiding this comment

Uh oh!

monarchdodra Oct 29, 2013

Choose a reason for hiding this comment

Uh oh!

monarchdodra commented Oct 31, 2013

Uh oh!

dnadlinger commented Oct 31, 2013

Uh oh!

monarchdodra commented Oct 31, 2013

Uh oh!

9rnsr Dec 4, 2013

Choose a reason for hiding this comment

Uh oh!

monarchdodra Dec 4, 2013

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants