New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize binary match from 10% up to 70x #1803

Merged
merged 1 commit into from Aug 16, 2018

Conversation

Projects
None yet
9 participants
@josevalim
Copy link
Contributor

josevalim commented Apr 29, 2018

The idea is to use memchr on the first lookup for binary:match/2 and also after every match on binary:matches/2.

This speeds up binary matching and binary splitting by 4x in some cases and by 70x in other scenarios (when the last character in the needle does not occur in the subject).

The reason to use memchr is that it is highly specialized in most modern operating systems, often implemented in assembly and using SIMD instructions.

The implementation uses the reduction count to figure out how many bytes should be read with memchr.

Huge thanks to @k0kubun to pointing me towards SIMD and memchr. And thanks to @citizen428 for the pairing.

PS: I know the OTP team is busy this week so feel free to let this sit for a while. :)

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented Apr 29, 2018

Benchmark

Each measurement computed as the average of 100 operations.

Large binary with no match

subject = "ugbcfuysabfuqyfikgfsdalpaskfhgjsdgfjwsalp" x 1_000_000
pattern = "o"

With this patch

binary:match/2 not precompiled: 3512.24us
binary:match/2 precompiled: 3226.8us

binary:split/2 not precompiled: 3745.6us
binary:split/2 precompiled: 3264.24us

binary:matches/2 not precompiled: 2945.78us
binary:matches/2 precompiled: 3138.5us

binary:split/2 not precompiled: 2855.92us
binary:split/2 precompiled: 2967.32us

Without this patch

binary:match/2 not precompiled: 223504.4us
binary:match/2 precompiled: 222565.64us

binary:split/2 not precompiled: 222964.44us
binary:split/2 precompiled: 224047.88us

binary:matches/2 not precompiled: 224721.34us
binary:matches/2 precompiled: 224562.2us

binary:split/2 not precompiled: 224625.42us
binary:split/2 precompiled: 224629.42us

Results

About 70 times faster.

Large binary with multiple single byte matches

subject = "abchfkxhfjksdhfjksahjfhasdjkfhdksjfhjkadshfjkdshfjkerpoqkfl\n" x 1_000_000
pattern = "\n"

The goal is to test a single byte happening from time to time.

With this patch

binary:match/2 not precompiled: 0.88us
binary:match/2 precompiled: 0.28us

binary:split/2 not precompiled: 0.84us
binary:split/2 precompiled: 0.28us

binary:matches/2 not precompiled: 104144.64us
binary:matches/2 precompiled: 104321.64us

binary:split/2 not precompiled: 116378.16us
binary:split/2 precompiled: 117309.24us

Without this patch

binary:match/2 not precompiled: 3.36us
binary:match/2 precompiled: 1.08us

binary:split/2 not precompiled: 1.06us
binary:split/2 precompiled: 0.66us

binary:matches/2 not precompiled: 445137.84us
binary:matches/2 precompiled: 432233.44us

binary:split/2 not precompiled: 437167.26us
binary:split/2 precompiled: 436216.9us

Results

About 4 times faster.

Large binary with sequential three byte matches

subject = "abc" x 1_000_000
pattern = "abc"

The goal is to test a pattern happening over and over in sequence.

With this patch

binary:match/2 not precompiled: 1.48us
binary:match/2 precompiled: 0.22us

binary:split/2 not precompiled: 0.8us
binary:split/2 precompiled: 0.32us

binary:matches/2 not precompiled: 109711.36us
binary:matches/2 precompiled: 107959.44us

binary:split/2 not precompiled: 104394.38us
binary:split/2 precompiled: 101978.8us

Without this patch

binary:match/2 not precompiled: 1.24us
binary:match/2 precompiled: 0.44us

binary:split/2 not precompiled: 0.8us
binary:split/2 precompiled: 0.34us

binary:matches/2 not precompiled: 101125.82us
binary:matches/2 precompiled: 104963.38us

binary:split/2 not precompiled: 101220.5us
binary:split/2 precompiled: 101323.14us

Results

No measurable difference.

@josevalim josevalim changed the title Optimize binary match from 10% up to 4x (or more depending on config) Optimize binary match from 10% up to 80x Apr 29, 2018

@josevalim josevalim changed the title Optimize binary match from 10% up to 80x Optimize binary match from 10% up to 70x Apr 29, 2018

@josevalim josevalim force-pushed the josevalim:jv-sb branch from 1768b08 to 0e9993f Apr 29, 2018

*reductions = reds;
return (m == 0) ? BF_NOT_FOUND : BF_OK;
}
if (reds == 0) {

This comment has been minimized.

@josevalim

josevalim Apr 29, 2018

Contributor

Before we did if(--reds == 0) but since memchr may set reductions to 0, we need to check for an explicit zero before decrementing. This means we may return 0 as the number of reductions when this procedure exits! If this is not desired or not allowed, then we need to address it.

So to ask this very clearly: can we return 0 reductions from this function in the OK case?.

This comment has been minimized.

@garazdawi

garazdawi May 14, 2018

Contributor

yes

reds -= (pos_pointer - haystack - needle_last - j);
}
}
if (j > len - blen) {

This comment has been minimized.

@josevalim

josevalim Apr 29, 2018

Contributor

This condition was placed inside while but I had to move it inside since the memchr check may set j out of the range.

}

if (characters == 0) {

This comment has been minimized.

@josevalim

josevalim Apr 29, 2018

Contributor

This was a duplicate check, so I have moved it up inside.


while (j <= len - blen) {
if (--reds == 0) {
for(;;) {

This comment has been minimized.

@josevalim

josevalim Apr 29, 2018

Contributor

I have moved this to a for(;;) instead of a while to make it consistent with the implementation for bm_find_all.

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented Apr 29, 2018

My previous implementation had a bug in which the number of reductions went below zero, never giving the scheduler a chance to stop, which affected earlier results. I have fixed this bug and updated the benchmark results that now show gains up to 70x.

@josevalim josevalim force-pushed the josevalim:jv-sb branch from 0e9993f to 649a0a2 Apr 29, 2018

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented Apr 29, 2018

Finally, I have pushed a separate commit that adds a reduction factor when using memchr. This further speeds up performance about 10% in the no match case, making it closer to 80x faster. The reasoning is that using memchr is cheaper compared to boyer-moore.

reds -= mem_read / MC_LOOP_FACTOR;
} else {
j = pos_pointer - haystack - needle_last;
reds -= (pos_pointer - haystack - needle_last - j) / MC_LOOP_FACTOR;

This comment has been minimized.

@potatosalad

potatosalad Apr 30, 2018

Contributor

Maybe I'm reading this wrong, but won't this always be equivalent to reds -= (j - j) / MC_LOOP_FACTOR;? Or in other words: reds -= 0;?

This comment has been minimized.

@potatosalad

potatosalad Apr 30, 2018

Contributor

Is this perhaps supposed to be something like this?

reds -= (pos_pointer - &haystack[j] - needle_last) / MC_LOOP_FACTOR;
j = pos_pointer - haystack - needle_last;

This comment has been minimized.

@josevalim

josevalim May 1, 2018

Contributor

@potatosalad good catch, fixed it, thanks!

@josevalim josevalim force-pushed the josevalim:jv-sb branch 2 times, most recently from 7dc9ea6 to 5a011a6 May 1, 2018

@garazdawi garazdawi self-assigned this May 2, 2018

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 2, 2018

Looks good after a cursory look. I'll take a closer look once rc1 is done.

Would it be possible to add the benchmarks that you have written to stdlib_bench_SUITE? so that we can track them and make sure that we don't get any performance regressions.

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented May 2, 2018

@garazdawi oh, I did not know we had a bench suite! I will do that.

@garazdawi
Copy link
Contributor

garazdawi left a comment

Looking through the code in greater detail and it looks good. I'll put it into tests tomorrow.

*reductions = reds;
return (m == 0) ? BF_NOT_FOUND : BF_OK;
}
if (reds == 0) {

This comment has been minimized.

@garazdawi

garazdawi May 14, 2018

Contributor

yes

@garazdawi garazdawi added the testing label May 15, 2018

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 16, 2018

Testcases fail in the stdlib:string_SUITE after applying this pr.

@garazdawi garazdawi removed the testing label May 16, 2018

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 16, 2018

Also, we were talking a bit about this optimization. If you want to optimize this even further, you could use memmem on platforms that support it.

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented May 16, 2018

@garazdawi I will look at the tests soon and figure out why they didn't fail locally.

Regarding memmem, I have checked its implementation and the implementation I saw used memchr underneath, in an optimization very similar to the one in this PR. The only difference is that they add a factor: if the pattern you are looking for has more than 32 bytes, they claim the use of BM is better than using memchr. However, I could not reproduce this behaviour, even for larger bytes.

So at the end I thought adding a third path wouldn't be that beneficial.

But talking about optimizations, @michalmuskala and I want to look into optimizing the multi-pattern algorithm in binary:match to use the PCMPESTRI and PCMPESTRM SIMD instructions. Those instructions allow looking up multiple bytes in a string. So it should work similar to the current BM optimization, except we are applying it to a list of patterns.

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 16, 2018

It is the find/split/replace testcases that fail. They seem to fail on all of our test machines, including my own machine. There is also a bunch of other testcases that fail in various applications, the compiler also seems to missbehave on some platforms as it produces invalid code.

@fenollp

This comment has been minimized.

Copy link
Contributor

fenollp commented May 16, 2018

Not exactly on topic:
How realistic could it be to implement SIMD operations for binary comprehensions?
Surely there’re a couple encoder/decoder implementations on the BEAM that would benefit from vectorized - + * & | ^ << >> operations? BC semantics seem perfect for such an acceleration.

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 16, 2018

One of the problems with using SIMD in the interpreter is that for it to work we have to sacrifice local tracing on those functions, or make the compiler emit both optimized and traceable code that can be switched during run-time. Regardless, it is quite an undertaking. HiPE on the other hand could (and maybe does? I'm not sure) use SIMD instructions, as it has already sacrificed tracing.

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 16, 2018

Found these backtraces in our debug builds:

Thread 1 (Thread 11869):
#0  0x00007f09126e21b5 in raise () from /lib/libc.so.6
#1  0x00007f09126e4fc0 in abort () from /lib/libc.so.6
#2  0x000000000068db84 in erl_assert_error (expr=0x7f8e65 "mem_read > 0", func=0x7f9520 "bm_find_all_non_overlapping", file=0x7f8e40 "beam/erl_bif_binary.c", line=908) at sys/unix/sys.c:940
#3  0x0000000000666bf0 in bm_find_all_non_overlapping (ctx=0x7f08d00bc9a0, haystack=0x7f08cf903ea8 "=PROGRESS REPORT==== 16-May-2018::01:34:29.849613 ===\n    supervisor: {local,sasl_safe_sup}\n    started: [{pid,<0.14176.0>},\n", ' ' , "{id,alarm_handler},\n", ' ' , "{mfargs,{alarm_handler,star"...) at beam/erl_bif_binary.c:908
#4  0x0000000000668704 in do_binary_find (p=0x7f08d19444c8, subject=139675683751098, ctxp=0x7f08d00bc998, pat_bin=0x7f08d1dc2c50, ctx_bin=0x0, res_term=0x7f08d00bc988) at beam/erl_bif_binary.c:1414
#5  0x0000000000669486 in binary_split (p=0x7f08d19444c8, arg1=139675683751098, arg2=139675868277050, arg3=139675868277073) at beam/erl_bif_binary.c:1578
#6  0x0000000000669662 in binary_split_3 (A__p=0x7f08d19444c8, BIF__ARGS=0x7f08d1fc0100, A__I=0x7f08cdd1a7c8) at beam/erl_bif_binary.c:1602
#7  0x000000000044127a in process_main (x_reg_array=0x7f08d1fc0100, f_reg_array=0x7f08d1fc2180) at x86_64-unknown-linux-gnu/debug/smp/beam_hot.h:310
#8  0x000000000048296c in sched_thread_func (vesdp=0x7f08d0d826c0) at beam/erl_process.c:8349
#9  0x00000000007324b3 in thr_wrapper (vtwd=0x7fffd749d1e0) at pthread/ethread.c:118
#10 0x00007f0912c208ca in start_thread () from /lib/libpthread.so.0
#11 0x00007f091277fb6d in clone () from /lib/libc.so.6
#12 0x0000000000000000 in ?? ()

@garazdawi garazdawi added the waiting label May 17, 2018

@josevalim josevalim force-pushed the josevalim:jv-sb branch from 5a011a6 to 82e246c May 19, 2018

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented May 19, 2018

I have also added benchmarks in a separate commit. However, the benchmarks I have added are failing:

TEST INFO: 1 test(s), 1 case(s) in 1 suite(s)

Testing tests.stdlib_test.stdlib_bench_SUITE.match_single_pattern_no_match: Starting test, 1 test cases

*** ERROR *** Invalid return value from stdlib_bench_SUITE:match_single_pattern_no_match/0: {comment,
                                                                                             "291"}
Testing tests.stdlib_test.stdlib_bench_SUITE.match_single_pattern_no_match: *** SKIPPED test case 1 of 1 ***
Testing tests.stdlib_test.stdlib_bench_SUITE.match_single_pattern_no_match: TEST COMPLETE, 0 ok, 0 failed, 1 skipped of 1 test cases

I don't understand why they are failing, as they mimic the structure of existing benchmarks, so I have pushed the code in case somebody knows what is going on. It is in a separate commit for now.

@@ -157,41 +162,59 @@ norm_data(Config) ->

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

match_single_pattern_no_match() ->

This comment has been minimized.

@garazdawi

garazdawi May 21, 2018

Contributor

This should be an arity 1 function.

This comment has been minimized.

@josevalim

josevalim May 21, 2018

Contributor

Ah, that was it! Thank you. Fixed, squashed and pushed.

@josevalim josevalim force-pushed the josevalim:jv-sb branch from 3629092 to 4a4f522 May 21, 2018

@archseer

This comment has been minimized.

Copy link

archseer commented May 30, 2018

Sorry for interrupting, but is there any chance of this being included in the OTP 21 release? (since the final RC just went out yesterday)

@garazdawi garazdawi added testing and removed waiting labels May 30, 2018

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented May 30, 2018

We haven't decided yet, so yes there is still a chance that it will be included. It depends on what further testing finds and if we find the risk of including it acceptable.

@seriyps

This comment has been minimized.

Copy link
Contributor

seriyps commented May 31, 2018

Guess this ticket https://bugs.erlang.org/browse/ERL-374 might be closed if this PR will be merged

@sverker sverker removed the testing label Jun 1, 2018

@sverker

This comment has been minimized.

Copy link
Contributor

sverker commented Jun 1, 2018

Removed from our test runs over the weekend as it produces quite a lot of core dumps, especially on debug emulator.

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented Jun 1, 2018

Thanks @sverker! Is there any command I may run to reproduce those failures? What is the simplest way to run the emulator on debug mode on my machine?

EDIT: I found that I can cd erts/emulator && make debug. Is this correct? Is there a way to check that I am indeed running the debug emulator?

@tuncer

This comment has been minimized.

Copy link
Contributor

tuncer commented Jun 2, 2018

@josevalim wrote:

Is there a way to check that I am indeed running the debug emulator?

Development emulators are distinct executables with a different name and are started with cerl (e.g. bin/cerl -debug). That means, you will find beam.debug.smp in your process list. Further, the repl will greet you with extra info: [type-assertions] [debug-compiled] [lock-checking].

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented Jun 4, 2018

This is the stack-trace when it fails. This PR will not make it into OTP-21.0, so no rush in fixing it.

Thread 1 (process 23829):
#0  0xb7d9cb50 in memchr () from /lib/libc.so.6
#1  0x082070bc in bm_find_all_non_overlapping (ctx=0xb5a3f098, haystack=0xb467fe68 "3tC43jÅFx76yceOVÄJOn3Gj3Gt2NöTooKzVkT1q5KQot4SiADx7tZGXnSKNidJsbwnUmqMLYkbbQwqDlRIhFnf23KvIYJSWcY7zGnUWäwfk9AM8soVZhpHrS3jmuO3ö21VzMp8äLfDIÄåWågEhQLfmxrVInIz5köOwv79jl7xVKQuBkn8HkIWJoÖEXVZfCS4aNRUktÄÄ"...) at beam/erl_bif_binary.c:911
#2  0x082061d1 in do_binary_find (p=0xb71886a8, subject=3040654090, ctxp=0xb5a3f148, pat_bin=0xb467aa70, ctx_bin=0x0, res_term=0xb5a3f140) at beam/erl_bif_binary.c:1417
#3  0x08208589 in binary_split (p=0xb71886a8, arg1=3040654090, arg2=3040826894, arg3=3071002713) at beam/erl_bif_binary.c:1581
#4  0x080886ac in process_main (x_reg_array=0xb76000c0, f_reg_array=0xb7601100) at i686-pc-linux-gnu/opt/smp/beam_hot.h:310
#5  0x08095f36 in sched_thread_func (vesdp=0xb6501900) at beam/erl_process.c:8355
#6  0x082a210b in thr_wrapper (vtwd=0xbfa3b298) at pthread/ethread.c:118
#7  0xb7e612ab in start_thread () from /lib/libpthread.so.0
#8  0xb7df058e in clone () from /lib/libc.so.6
@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented Jul 31, 2018

Hello! Do you have any plans to finish this pr?

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented Jul 31, 2018

@garazdawi yes I do! We just got the Elixir v1.7 release out, which means I can finally focus on other issues.

@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented Jul 31, 2018

Great!

Optimize binary match
The idea is to use memchr on the first lookup for
binary:match/2 and also after every match on binary:matches/2.
We only use memchr in case of matches because benchmarks
showed that using memchr even when we had false positives
could negatively affect performance.

This speeds up binary matching and binary splitting by 4x
in some cases and by 70x in other scenarios (when the last
character in the needle does not occur in the subject).

The reason to use memchr is that it is highly specialized
in most modern operating systems, often defaulting to
SIMD operations.

The implementation uses the reduction count to figure out
how many bytes should be read with memchr. We could increase
those numbers but they do not seem to make a large difference.

@josevalim josevalim force-pushed the josevalim:jv-sb branch from 4a4f522 to db41ff2 Aug 7, 2018

@josevalim

This comment has been minimized.

Copy link
Contributor

josevalim commented Aug 7, 2018

@garazdawi force pushed. This should be good now. I ran the emulator and stdlib suites and I wrote some property-based tests that ran locally and they found nothing. 👍

@garazdawi garazdawi added testing and removed waiting labels Aug 10, 2018

@garazdawi garazdawi merged commit cb3eead into erlang:master Aug 16, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details
@garazdawi

This comment has been minimized.

Copy link
Contributor

garazdawi commented Aug 16, 2018

Merged! Thanks for you contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment