Skip to content

fix Issue 10403 - memchr optimization for std.algorithm.find#1492

Merged
dnadlinger merged 2 commits intodlang:masterfrom
monarchdodra:findImprov
Oct 31, 2013
Merged

fix Issue 10403 - memchr optimization for std.algorithm.find#1492
dnadlinger merged 2 commits intodlang:masterfrom
monarchdodra:findImprov

Conversation

@monarchdodra
Copy link
Collaborator

This improves the implementation of find!(R, E) as well as find!(pred, R).

  • strings:

Improvements to UTF8 and UTF16 are particularly spectacular, with performance improvements up to ×20 for normal strings (read "hello world").

These performance improvements to string test that the needle fit inside a single code unit. However, even if it doesn't, I still get a rough 100% performance improvement in the general case, with or without default pred.

  • [u]byte[]

Implemented 10403: Same as with strings, the performance improvement is spectacular when it triggers.

  • Standard arrays:

I added a path for raw arrays, which uses a foreach rather than a loop. It makes little difference for dmd -inline, but I get a bit more than 100% performance improvement for non -inline, so worth it. (also, stangelly enough, gdc -inline gets a big benefit. Don't know why...)

@monarchdodra
Copy link
Collaborator Author

  • Benches.

These were done on a linux 64, with both dmd and gdc, tested with and without -inline. gdc results were mostly in line with dmd's so I did not publish them (except when different):

UTF8:
{
    find("hello world", ' ');
        dmd
            newFind:    731ms.
            oldFind:  11875ms.
        dmd -inline
            newFind:    646ms.
            oldFind:   9359ms.
        gdc
            NewFind:    808ms.
            oldFind:   8900ms.
        gdc -inline
            newFind:    817ms.
            oldFind:   8944ms.

    find("日日日本本語", '');
        dmd
            newFind:  10336ms.
            oldFind:  16065ms.
        dmd -inline
            newFind:   5868ms.
            oldFind:  11645ms.

    find!"a==b"("hello world", ' ');
        dmd
            newFind:   3029ms.
            oldFind:   5799ms.
        dmd -inline
            newFind:   1949ms.
            oldFind:   3213ms.
}

UTF16:
{
    find("hello world"w, ' ');
        dmd
            newFind:   1321ms.
            oldFind:  12333ms.
        dmd -inline
            newFind:   1561ms.
            oldFind:   9728ms.

    find("日日日本本語"w, '\U00010000');
        dmd
            newFind:   4863ms.
            oldFind:  10508ms.
        dmd -inline
            newFind:   2613ms.
            oldFind:   8434ms.

    find!"a==b"("日日日本本語", '');
        dmd
            newFind:   9152ms.
            oldFind:  13035ms.
        dmd -inline
            newFind:   7538ms.
            oldFind:  10094ms.
}

Arrays:
{
    find([1, 2, 3, 4, 5],  5);
        dmd
            newFind:   1899ms.
            oldFind:   4168ms.
        dmd -inline
            newFind:   1218ms.
            oldFind:   1366ms.
        dgc
            newFind:   1813ms.
            oldFind:   4074ms.
        gdc -inline
            newFind:   1830ms.
            oldFind:   4068ms.
}

@monarchdodra
Copy link
Collaborator Author

  • Implementation :
InputRange find(alias pred = "a == b", InputRange, Element)(InputRange haystack, Element needle)
if (isInputRange!InputRange &&
    is (typeof(binaryFun!pred(haystack.front, needle)) : bool))
{
    alias R = InputRange;
    alias E = Element;
    alias predFun = binaryFun!pred;
    static if (is(typeof(pred == "a == b")))
        enum isDefaultPred = pred == "a == b";
    else
        enum isDefaultPred = false;
    enum  isIntegralNeedle = isSomeChar!E || isIntegral!E || isBoolean!E;

    alias EType  = ElementType!R;

    static if (isNarrowString!R)
    {
        alias EEType = ElementEncodingType!R;
        alias UEEType = Unqual!EEType;

        //These are two special cases which can search without decoding the UTF stream.
        static if (isDefaultPred && isIntegralNeedle)
        {
            //This special case deals with UTF8 search, when the needle
            //is represented by a single code point.
            //Note: "needle <= 0x7F" properly handles sign via unsigned promotion
            static if (is(UEEType == char))
            {
                if (!__ctfe && needle <= 0x7F)
                {
                    R trustedmemchr() @trusted nothrow
                    {
                        auto ptr = memchr(haystack.ptr, needle, haystack.length);
                        return ptr ?
                             haystack[ptr - haystack.ptr .. $] :
                             haystack[$ .. $];
                    }
                    return trustedmemchr();
                }
            }

            //Ditto, but for UTF16
            static if (is(UEEType == wchar))
            {
                if (needle <= 0xD7FF || (0xE000 <= needle && needle <= 0xFFFF))
                {
                    foreach (i, ref EEType e; haystack)
                    {
                        if (e == needle)
                            return haystack[i .. $];
                    }
                    return haystack[$ .. $];
                }
            }
        }

        //Previous conditonal optimizations did not succeed. Fallback to
        //unconditional implementations
        static if (isDefaultPred)
        {
            //In case of default pred, it is faster to do string/string search.
            UEEType[is(UEEType == char) ? 4 : 2] buf;

            size_t len = encode(buf, needle);
            //TODO: Make find!(R, R) @safe
            R trustedFindRR() @trusted {return std.algorithm.find(haystack, cast(R)buf[0 .. len]);}
            return trustedFindRR();
        }
        else
        {
            //Explicit pred. Manually decode: That is the fastest implementation.
            immutable len = haystack.length;
            size_t i = 0, next = 0;
            while (next < len)
            {
                if (predFun(decode(haystack, next), needle))
                    return haystack[i .. $];
                i = next;
            }
            return haystack[$ .. $];
        }
    }
    else static if (isArray!R)
    {
        //10403 optimization
        static if (isDefaultPred && isIntegral!EType && EType.sizeof == 1 && isIntegralNeedle)
        {
            R findHelper() @trusted nothrow
            {
                EType* ptr = null;
                //Note: we use "min/max" to handle sign miss-match.
                if (min(EType.min, needle) == EType.min, needle && max(EType.max, needle) == EType.max)
                    ptr = cast(EType*) memchr(haystack.ptr, needle, haystack.length);

                return ptr ?
                    haystack[ptr - haystack.ptr .. $] :
                    haystack[$ .. $];
            }

            if (!__ctfe)
                return findHelper();
        }

        //Default implementation.
        foreach (i, ref e; haystack)
            if (predFun(e, needle))
                return haystack[i .. $];
        return haystack[$ .. $];
    }
    else
    {
        //Everything else. Walk.
        for ( ; !haystack.empty; haystack.popFront() )
        {
            if (predFun(haystack.front, needle))
                break;
        }
        return haystack;
    }
}

Implementation note: I used named functions for @trusted, as opposed to inline lambdas, as the inline lambdas seem to be conflicting with the ability to inline.

@ibuclaw
Copy link
Member

ibuclaw commented Aug 18, 2013

I assume when you say you tested gdc with -inline I assume you mean -finline-functions. Other than that, looks good to me.

@monarchdodra
Copy link
Collaborator Author

I assume you mean

No, I used -inline. That said, I see no difference when adding -finline-functions. Also, I used -O, and I'm noticing that -O3 is better. I think I may be having some problems with assert: -fno-assert seems to have no effect :/ I think I botched the gdc tests... sorry.

What would be the gdc equivalent of -O -release -inline?

@ibuclaw
Copy link
Member

ibuclaw commented Aug 18, 2013

What would be the gdc equivalent of -O -release -inline?

Simply -O2 -frelease - function inlining is turned on at -O2.

@monarchdodra
Copy link
Collaborator Author

Simply -O2 -frelease - function inlining is turned on at -O2.

Thanks. That does improve the performance. I'm still getting roughly the same results, in terms of relative improvements, so the performance improvements aren't worth noting compared to DMD's.

The only place where DMD and GDC differ is that I'm getting roughly the same performance with arrays on dmd for a foreach/for( ; !empty; popFront) loop (with -inline).

GDC seems to be having more problems with the for( ; !empty; popFront) loops (takes twice as much time to iterate), but on the other hand, it optimizes the foreach loop more aggressively.

In any case, it means using foreach is, at worst, a performance improvement.

@monarchdodra
Copy link
Collaborator Author

Also, Disclaimer:

The "old" implementation is subject to bug http://d.puremagic.com/issues/show_bug.cgi?id=10848 : inlined lambdas aren't inlined.

I reused parts of the old implementation, but fixing said bug. This, in a certain way, this makes my comparisons biased in the sense that I should have fixed the old implementation too before comparing with it.

Overall, it means that the old time for the implementations for UTF8/UTF16 without a default pred should roughly be halved.

Even with this, the new implementation is mostly still either massively better, or just as good.

@quickfur
Copy link
Member

Awesome performance improvement, I like it. :) A few notes:

  1. I'm not sure about stringing static if() if (__ctfe) if (...) all in a single line. Isn't there a better way of formatting this so that it's a little more readable?

  2. I thought some time ago Walter wanted template parameters like R to be rewritten as something more descriptive, like InputRange, ForwardRange, etc., so the the docs are more readable. Or has that changed now that ddoc includes the signature constraints?

Other than that, LGTM.

@monarchdodra
Copy link
Collaborator Author

Awesome performance improvement, I like it. :)

@quickfur Thanks :)

  1. I'm not sure about stringing static if() if (__ctfe) if (...) all in a single line. Isn't there a better way of formatting this so that it's a little more readable?

You are right. I fixed it up, and it should be more readable.

  1. I thought some time ago Walter wanted template parameters like R to be rewritten as something more descriptive, like InputRange, ForwardRange, etc., so the the docs are more readable. Or has that changed now that ddoc includes the signature constraints?

Hum... I had though of the R vs Range debate, where the word Range, imo, buys us nothing. However, I like naming it what it is a lot (eg: ForwardRange haystack).

So I changed it accordingly.

Other than that, LGTM.

@quickfur
Copy link
Member

quickfur commented Sep 2, 2013

LGTM.

@quickfur
Copy link
Member

Are any of the other Phobos committers looking at this pull? It has been green for the past little while, but seems to be stuck in continually being retested as other pulls on the first page get pulled first.

@monarchdodra
Copy link
Collaborator Author

It has been green for the past little while, but seems to be stuck in continually being retested as other pulls on the first page get pulled first.

Well, it's been green for the past 10 days, when I re-based (except for a little window when I broke dmd with my core.bitop fix).

Do any of the reviewers wait for 10/10 green? I just check the history personally. There are too many open pull (thread failures) to wait for 10/10 green to pull, IMO.

@jmdavis
Copy link
Member

jmdavis commented Sep 15, 2013

Whether I wait for 10/10 green depends on the pull. If it's sufficiently complex and something else has been merged since the last time it had 10/10 green, then it risks breaking the build, whereas if it's simple enough, then odds of it breaking the build are low if it's been green on most of the systems and has been green over most of its history.

@ghost
Copy link

ghost commented Sep 15, 2013

@monarchdodra: I think you're relatively safe, and you can easily make a revert pull if something goes wrong. Compare this to the mess from a few days ago when master didn't build for an entire day because someone went to sleep after pulling.

std/algorithm.d Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are actually going to use a "real" (i.e. not static) nested function anyway, is there any reason not to just go for a delegate literal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambdas do not get inlined by DMD for some reason. There is a pull to enforce such optimization.
dlang/dmd#2483

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought the issue concerned all nested functions. Thanks for the hint.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What he said. Making this a lambda murders performance. In particular: http://d.puremagic.com/issues/show_bug.cgi?id=10864

As of right now, appender is actually slower than ~= because of this.

That said I like the explicitness of the named nested function anyways. But I also think it'd be even better as a static nested function.

@monarchdodra
Copy link
Collaborator Author

@klickverbot : Points addressed.

dnadlinger added a commit that referenced this pull request Oct 31, 2013
fix Issue 10403 - memchr optimization for std.algorithm.find
@dnadlinger dnadlinger merged commit 482d0e1 into dlang:master Oct 31, 2013
@dnadlinger
Copy link
Contributor

Awesome!

@monarchdodra
Copy link
Collaborator Author

Awesome!

TYVM for the merge :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monarchdodra This line causes a regression 11603.
The issue will cause when needle is a 1-byte zero.

And, why comma operator is used? It would be an another bug.
if (min(EType.min, needle) == EType.min, <-- hera

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an copy/paste edit botch when writing my conditions. Sorry. Thanks for fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants