fix Issue 10403 - memchr optimization for std.algorithm.find#1492
fix Issue 10403 - memchr optimization for std.algorithm.find#1492dnadlinger merged 2 commits intodlang:masterfrom monarchdodra:findImprov
Conversation
These were done on a linux 64, with both dmd and gdc, tested with and without -inline. gdc results were mostly in line with dmd's so I did not publish them (except when different): UTF8:
{
find("hello world", ' ');
dmd
newFind: 731ms.
oldFind: 11875ms.
dmd -inline
newFind: 646ms.
oldFind: 9359ms.
gdc
NewFind: 808ms.
oldFind: 8900ms.
gdc -inline
newFind: 817ms.
oldFind: 8944ms.
find("日日日本本語", '語');
dmd
newFind: 10336ms.
oldFind: 16065ms.
dmd -inline
newFind: 5868ms.
oldFind: 11645ms.
find!"a==b"("hello world", ' ');
dmd
newFind: 3029ms.
oldFind: 5799ms.
dmd -inline
newFind: 1949ms.
oldFind: 3213ms.
}
UTF16:
{
find("hello world"w, ' ');
dmd
newFind: 1321ms.
oldFind: 12333ms.
dmd -inline
newFind: 1561ms.
oldFind: 9728ms.
find("日日日本本語"w, '\U00010000');
dmd
newFind: 4863ms.
oldFind: 10508ms.
dmd -inline
newFind: 2613ms.
oldFind: 8434ms.
find!"a==b"("日日日本本語", '語');
dmd
newFind: 9152ms.
oldFind: 13035ms.
dmd -inline
newFind: 7538ms.
oldFind: 10094ms.
}
Arrays:
{
find([1, 2, 3, 4, 5], 5);
dmd
newFind: 1899ms.
oldFind: 4168ms.
dmd -inline
newFind: 1218ms.
oldFind: 1366ms.
dgc
newFind: 1813ms.
oldFind: 4074ms.
gdc -inline
newFind: 1830ms.
oldFind: 4068ms.
} |
InputRange find(alias pred = "a == b", InputRange, Element)(InputRange haystack, Element needle)
if (isInputRange!InputRange &&
is (typeof(binaryFun!pred(haystack.front, needle)) : bool))
{
alias R = InputRange;
alias E = Element;
alias predFun = binaryFun!pred;
static if (is(typeof(pred == "a == b")))
enum isDefaultPred = pred == "a == b";
else
enum isDefaultPred = false;
enum isIntegralNeedle = isSomeChar!E || isIntegral!E || isBoolean!E;
alias EType = ElementType!R;
static if (isNarrowString!R)
{
alias EEType = ElementEncodingType!R;
alias UEEType = Unqual!EEType;
//These are two special cases which can search without decoding the UTF stream.
static if (isDefaultPred && isIntegralNeedle)
{
//This special case deals with UTF8 search, when the needle
//is represented by a single code point.
//Note: "needle <= 0x7F" properly handles sign via unsigned promotion
static if (is(UEEType == char))
{
if (!__ctfe && needle <= 0x7F)
{
R trustedmemchr() @trusted nothrow
{
auto ptr = memchr(haystack.ptr, needle, haystack.length);
return ptr ?
haystack[ptr - haystack.ptr .. $] :
haystack[$ .. $];
}
return trustedmemchr();
}
}
//Ditto, but for UTF16
static if (is(UEEType == wchar))
{
if (needle <= 0xD7FF || (0xE000 <= needle && needle <= 0xFFFF))
{
foreach (i, ref EEType e; haystack)
{
if (e == needle)
return haystack[i .. $];
}
return haystack[$ .. $];
}
}
}
//Previous conditonal optimizations did not succeed. Fallback to
//unconditional implementations
static if (isDefaultPred)
{
//In case of default pred, it is faster to do string/string search.
UEEType[is(UEEType == char) ? 4 : 2] buf;
size_t len = encode(buf, needle);
//TODO: Make find!(R, R) @safe
R trustedFindRR() @trusted {return std.algorithm.find(haystack, cast(R)buf[0 .. len]);}
return trustedFindRR();
}
else
{
//Explicit pred. Manually decode: That is the fastest implementation.
immutable len = haystack.length;
size_t i = 0, next = 0;
while (next < len)
{
if (predFun(decode(haystack, next), needle))
return haystack[i .. $];
i = next;
}
return haystack[$ .. $];
}
}
else static if (isArray!R)
{
//10403 optimization
static if (isDefaultPred && isIntegral!EType && EType.sizeof == 1 && isIntegralNeedle)
{
R findHelper() @trusted nothrow
{
EType* ptr = null;
//Note: we use "min/max" to handle sign miss-match.
if (min(EType.min, needle) == EType.min, needle && max(EType.max, needle) == EType.max)
ptr = cast(EType*) memchr(haystack.ptr, needle, haystack.length);
return ptr ?
haystack[ptr - haystack.ptr .. $] :
haystack[$ .. $];
}
if (!__ctfe)
return findHelper();
}
//Default implementation.
foreach (i, ref e; haystack)
if (predFun(e, needle))
return haystack[i .. $];
return haystack[$ .. $];
}
else
{
//Everything else. Walk.
for ( ; !haystack.empty; haystack.popFront() )
{
if (predFun(haystack.front, needle))
break;
}
return haystack;
}
}Implementation note: I used named functions for |
|
I assume when you say you tested gdc with |
No, I used What would be the gdc equivalent of |
Simply |
Thanks. That does improve the performance. I'm still getting roughly the same results, in terms of relative improvements, so the performance improvements aren't worth noting compared to DMD's. The only place where DMD and GDC differ is that I'm getting roughly the same performance with arrays on dmd for a GDC seems to be having more problems with the In any case, it means using |
|
Also, Disclaimer: The "old" implementation is subject to bug http://d.puremagic.com/issues/show_bug.cgi?id=10848 : inlined lambdas aren't inlined. I reused parts of the old implementation, but fixing said bug. This, in a certain way, this makes my comparisons biased in the sense that I should have fixed the old implementation too before comparing with it. Overall, it means that the old time for the implementations for UTF8/UTF16 without a default pred should roughly be halved. Even with this, the new implementation is mostly still either massively better, or just as good. |
|
Awesome performance improvement, I like it. :) A few notes:
Other than that, LGTM. |
@quickfur Thanks :)
You are right. I fixed it up, and it should be more readable.
Hum... I had though of the So I changed it accordingly.
|
|
LGTM. |
|
Are any of the other Phobos committers looking at this pull? It has been green for the past little while, but seems to be stuck in continually being retested as other pulls on the first page get pulled first. |
Well, it's been green for the past 10 days, when I re-based (except for a little window when I broke dmd with my core.bitop fix). Do any of the reviewers wait for 10/10 green? I just check the history personally. There are too many open pull (thread failures) to wait for 10/10 green to pull, IMO. |
|
Whether I wait for 10/10 green depends on the pull. If it's sufficiently complex and something else has been merged since the last time it had 10/10 green, then it risks breaking the build, whereas if it's simple enough, then odds of it breaking the build are low if it's been green on most of the systems and has been green over most of its history. |
|
@monarchdodra: I think you're relatively safe, and you can easily make a revert pull if something goes wrong. Compare this to the mess from a few days ago when master didn't build for an entire day because someone went to sleep after pulling. |
std/algorithm.d
Outdated
There was a problem hiding this comment.
If you are actually going to use a "real" (i.e. not static) nested function anyway, is there any reason not to just go for a delegate literal?
There was a problem hiding this comment.
Lambdas do not get inlined by DMD for some reason. There is a pull to enforce such optimization.
dlang/dmd#2483
There was a problem hiding this comment.
Ah, I thought the issue concerned all nested functions. Thanks for the hint.
There was a problem hiding this comment.
What he said. Making this a lambda murders performance. In particular: http://d.puremagic.com/issues/show_bug.cgi?id=10864
As of right now, appender is actually slower than ~= because of this.
That said I like the explicitness of the named nested function anyways. But I also think it'd be even better as a static nested function.
|
@klickverbot : Points addressed. |
fix Issue 10403 - memchr optimization for std.algorithm.find
|
Awesome! |
TYVM for the merge :) |
There was a problem hiding this comment.
@monarchdodra This line causes a regression 11603.
The issue will cause when needle is a 1-byte zero.
And, why comma operator is used? It would be an another bug.
if (min(EType.min, needle) == EType.min, <-- hera
There was a problem hiding this comment.
This looks like an copy/paste edit botch when writing my conditions. Sorry. Thanks for fixing.
This improves the implementation of
find!(R, E)as well asfind!(pred, R).Improvements to UTF8 and UTF16 are particularly spectacular, with performance improvements up to ×20 for normal strings (read
"hello world").These performance improvements to string test that the needle fit inside a single code unit. However, even if it doesn't, I still get a rough 100% performance improvement in the general case, with or without default pred.
[u]byte[]Implemented 10403: Same as with strings, the performance improvement is spectacular when it triggers.
I added a path for raw arrays, which uses a foreach rather than a loop. It makes little difference for
dmd -inline, but I get a bit more than 100% performance improvement for non-inline, so worth it. (also, stangelly enough,gdc -inlinegets a big benefit. Don't know why...)