Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Needle #26

Closed
wants to merge 2 commits into from
Closed

Generic Needle #26

wants to merge 2 commits into from

Conversation

avitex
Copy link

@avitex avitex commented Feb 3, 2021

  • Moves benches to separate crate
    • Decreases the compile time for CI tests
    • I didn't have LLVM env for sse4-strstr and I wanted to run the tests (was too lazy)
  • Makes library no_std
  • Uses AsRef<[u8]> instead of Needle (this should be optimised like Needle).
  • Merges in work from @zakcutner for memcmp

Note: I haven't seen out how Needle was impl by @marmeladema, just saw the trait so there could be more work there missing in this one.

- makes library `no_std`
- uses `AsRef<[u8]>` instead of `Needle` (this _should_ be optimised like `Needle`).
- merges in work from @zakcutner for memcmp
@marmeladema
Copy link
Collaborator

Thanks! I'll try to take a look soon but I am pretty busy right now.

@zakcutner
Copy link
Contributor

Sorry for not taking a look at this sooner, I really appreciate your contribution! I like your ideas (particularly making the crate no_std is cool), thanks for all the work you've put into this.

You probably saw that in #8 I bounded Needle with AsRef<[u8]> although I didn't use AsRef<[u8]> directly as you have done. I probably should have explained this somewhere in the PR, but this is because of the match on the needle size that is used to specialise memcmp. Taking your code as an example...

let comparison = match self.size() {
    2 => memcmp::memcmp1(chunk, needle),
    3 => memcmp::memcmp2(chunk, needle),
    4 => memcmp::memcmp3(chunk, needle),
    5 => memcmp::memcmp4(chunk, needle),
    6 => memcmp::memcmp5(chunk, needle),
    7 => memcmp::memcmp6(chunk, needle),
    8 => memcmp::memcmp7(chunk, needle),
    9 => memcmp::memcmp8(chunk, needle),
    10 => memcmp::memcmp9(chunk, needle),
    11 => memcmp::memcmp10(chunk, needle),
    12 => memcmp::memcmp11(chunk, needle),
    13 => memcmp::memcmp12(chunk, needle),
    n => memcmp::memcmp(chunk, needle, n - 1),
};

If the generic needle parameter, is set to something like [u8; 4], the idea is that the compiler is smart enough to optimise the code above to something like...

let comparison = memcmp::memcmp3(chunk, needle);

Instead, if the generic is instead Box<[u8]> or &[u8] (essentially anything with a size only known at runtime) then the match statement can't be optimised out. It might still be faster to use the specialised memcmp4 rather than memcmp if the size at runtime happens to be 4. However, the match statement becomes a control flow hazard which can potentially outweigh this performance gain. Moreover, if we're using DynamicAvx2Searcher then we know we know the needle size is greater than 13 (otherwise it would have called one of the specialised searchers) so there is no advantage to testing the needle size at all.

Interestingly, from a few small benchmarks, I found that specialising memcmp at runtime is often still faster despite the added control flow. However, I think that while this is useful to investigate, we should avoid completely redefining all memory comparisons at the moment as its performance is most likely dependant on the workload and target platform (it seems Rust is using bcmp under the hood) in any case. Either way, I think that it is useful to have a Needle trait in general to add flexibility to cases where we wish to perform optimisations only for constant-sized needles.

My generic-array branch made a few unrelated changes and is also hardcoded for needle up to length 32. Now that constant generics are a thing, I've tried to extract purely the needle-related changes into a temporary new branch (zak/generic-needle) for reference if you want to take a look.

Thanks again for your work on this 😄 @marmeladema What do you think? It would definitely be cool to make the API more generic.

@marmeladema
Copy link
Collaborator

So I think there are actually 3 different things to discuss in this PR:

  • Moving the benches into a separate crate: it might be worthwhile to do, but not really to speedup CI as I would want the bench code to remain compiled to ensure it still works. Does moving the code into its own (unpublished) crate speeds up cargo download/build time when sliceslice is used as a dependency? I think this change deserves its own PR.
  • Making the crate no_std compatible: I personally don't have much experience since all the projects I am working on always use std but I am definitely not against it either, as long as it does not impact usability, debuggability and performance. This change also deserves its own PR.
  • Making the API more generic. There are different proposal floating around for this and I've not settled my mind yet on which version to go forward with. I think to make progress we probably need to list the different use cases we would want to address.

@marmeladema
Copy link
Collaborator

@avitex I took the liberty to re-use your work and move the benches in a sub crate in #27 ! Thank you very much for the initial work 👍

@zakcutner
Copy link
Contributor

I agree, it seems to me that the changes are also all independent from one another so they can be implemented individually.

Moving the benches into a separate crate: it might be worthwhile to do, but not really to speedup CI as I would want the bench code to remain compiled to ensure it still works. Does moving the code into its own (unpublished) crate speeds up cargo download/build time when sliceslice is used as a dependency? I think this change deserves its own PR.

I cannot see any reason this would improve download/build time because these dependencies are listed in dev-dependencies so should not be required when sliceslice is used as a dependency. However, thinking a little more about this, it does seem unnecessary to be compiling what are mostly benchmarks of other crates when we've only made a change to sliceslice. In the future we may wish to compare sliceslice's performance to even more crates (I'd be interested in comparing performance against regex, aho-corasick and bstr) but this would cause our CI to become even slower.

One option would be to make the dependencies on these other crates optional and not include compiling those benchmarks in CI but as you pointed out we still want to check everything compiles without errors. An alternative approach I've seen used in the askama crate is to move the comparison benchmarks to a completely separate repo while still keeping a benchmark of sliceslice only in the main repo to detect regressions. Each repo could have its own CI meaning we would only need to compile our comparison benchmarks when we actually update them (rather than on every change of sliceslice as is done currently). I'm not sure how established this idea is though and of course the downside is that when we update sliceslice, we would also need to update the second repo to benchmark the new version.

Making the crate no_std compatible: I personally don't have much experience since all the projects I am working on always use std but I am definitely not against it either, as long as it does not impact usability, debuggability and performance. This change also deserves its own PR.

My understanding is that this should create no difference as long as we do not use anything that is not compatible with no_std although like you I have not worked with this before.

@marmeladema
Copy link
Collaborator

Making the needle generic has been done in #31
I guess the last remaining bits proposed here is making the crate no-std compatible.

@BurntSushi
Copy link

BurntSushi commented May 1, 2021

(I'd be interested in comparing performance against regex, aho-corasick and bstr)

FWIW, memchr 2.4.0 now has an implementation of memmem, which is now used inside of regex and bstr. (aho-corasick should probably use it when it only needs to search for one pattern, but usually you don't use aho-corasick for such cases anyway.)

Aspects of the memmem implementation in the memchr crate are very similar to this one. Indeed, the release of this crate and the poor performance of bstr in some cases in comparison to it was the motivation for writing it. The main differences are that it uses a background distribution of byte frequencies to speed up common searches, dynamically disables poor performing prefilters and uses Two-Way to guarantee additive time complexity.

@marmeladema
Copy link
Collaborator

@BurntSushi glad to hear we were a motivation ! Very cool work! 👍

I have added some benchmarks against your new implementation in #37
If you have some time to take a look, that would be great :) Obviously synthetic micro benchmarks are often not the best way to measure performances but for what its worth, sliceslice seems substantially faster on those.
Additionally, I have run our internal benchmarks against our regex heavy test suite with the new version of the regex crate that uses memchr::memmem and I have noticed a 2% regression in instruction count. Not a big deal, but I figured you might be interested. Unfortunately I did have the time to investigate further.

@BurntSushi
Copy link

Ah interesting. Here are my benchmarks (limited to measurements with a 5% or more difference):

$ critcmp bench/runs/2021-05-01_regex-and-bstr/raw.json -g 'memmem/[^/]+/(.*)' -f '/(krate|sliceslice)/' -t5
group                                                                2021-05-01/memmem/krate/               2021-05-01/memmem/sliceslice/
-----                                                                ------------------------               -----------------------------
oneshot/code-rust-library/never-fn-quux                              1.00     41.0±0.12µs    37.4 GB/sec    1.15     47.0±0.06µs    32.7 GB/sec
oneshot/code-rust-library/never-fn-strength-paren                    1.00     48.9±0.11µs    31.4 GB/sec    1.07     52.1±0.06µs    29.5 GB/sec
oneshot/code-rust-library/rare-fn-from-str                           1.00     14.0±0.01µs   109.3 GB/sec    1.42     20.0±0.05µs    76.7 GB/sec
oneshot/huge-en/never-all-common-bytes                               1.11     25.7±0.05µs    22.3 GB/sec    1.00     23.1±0.03µs    24.8 GB/sec
oneshot/huge-en/rare-huge-needle                                     1.54     39.5±0.05µs    14.5 GB/sec    1.00     25.6±0.05µs    22.3 GB/sec
oneshot/huge-en/rare-long-needle                                     1.00     18.8±0.02µs    30.4 GB/sec    1.41     26.4±0.04µs    21.6 GB/sec
oneshot/huge-en/rare-medium-needle                                   1.00     18.8±0.02µs    30.4 GB/sec    2.20     41.4±0.06µs    13.8 GB/sec
oneshot/huge-en/rare-sherlock-holmes                                 1.00     17.4±0.02µs    32.8 GB/sec    1.47     25.5±0.04µs    22.4 GB/sec
oneshot/huge-ru/never-john-watson                                    1.00     17.4±0.02µs    32.9 GB/sec    3.97     68.9±0.12µs     8.3 GB/sec
oneshot/huge-ru/rare-sherlock                                        1.00     17.6±0.02µs    32.4 GB/sec    2.11     37.2±0.06µs    15.4 GB/sec
oneshot/huge-ru/rare-sherlock-holmes                                 1.00     15.3±0.01µs    37.4 GB/sec    3.90     59.6±0.09µs     9.6 GB/sec
oneshot/huge-zh/never-john-watson                                    1.00     16.4±0.02µs    34.9 GB/sec    1.60     26.1±0.02µs    21.9 GB/sec
oneshot/huge-zh/rare-sherlock                                        1.00     18.0±0.03µs    31.8 GB/sec    1.23     22.2±0.02µs    25.8 GB/sec
oneshot/huge-zh/rare-sherlock-holmes                                 1.00     18.5±0.02µs    31.0 GB/sec    1.56     28.8±0.02µs    19.8 GB/sec
oneshot/pathological-defeat-simple-vector-freq/rare-alphabet         1.00     68.7±0.06µs     9.8 GB/sec    6.03    413.9±0.96µs  1658.9 MB/sec
oneshot/pathological-defeat-simple-vector-repeated/rare-alphabet     1.00   1229.7±2.07µs   558.4 MB/sec    2.17      2.7±0.00ms   257.0 MB/sec
oneshot/pathological-defeat-simple-vector/rare-alphabet              1.77    208.6±0.72µs     2.5 GB/sec    1.00    117.8±0.19µs     4.3 GB/sec
oneshot/pathological-md5-huge/never-no-hash                          1.00      5.7±0.02µs    24.9 GB/sec    1.62      9.2±0.02µs    15.4 GB/sec
oneshot/pathological-md5-huge/rare-last-hash                         1.00      6.0±0.02µs    23.5 GB/sec    1.49      8.9±0.02µs    15.8 GB/sec
oneshot/pathological-repeated-rare-small/never-tricky                1.21     62.5±0.16ns    14.9 GB/sec    1.00     51.7±0.27ns    18.0 GB/sec
oneshot/teeny-en/never-all-common-bytes                              1.00     24.1±0.04ns  1110.3 MB/sec    1.19     28.7±0.03ns   929.5 MB/sec
oneshot/teeny-en/never-john-watson                                   1.00     24.5±0.05ns  1089.4 MB/sec    1.18     29.0±0.03ns   921.7 MB/sec
oneshot/teeny-en/never-some-rare-bytes                               1.00     24.3±0.03ns  1100.1 MB/sec    1.14     27.7±0.02ns   965.0 MB/sec
oneshot/teeny-en/never-two-space                                     1.00     23.8±0.04ns  1121.8 MB/sec    1.17     27.8±0.04ns   961.9 MB/sec
oneshot/teeny-en/rare-sherlock                                       1.00     19.9±0.02ns  1340.9 MB/sec    1.41     28.1±0.04ns   951.8 MB/sec
oneshot/teeny-en/rare-sherlock-holmes                                1.00     26.1±0.04ns  1024.4 MB/sec    1.69     43.9±0.14ns   607.7 MB/sec
oneshot/teeny-ru/never-john-watson                                   1.00     33.1±0.03ns  1211.9 MB/sec    1.12     36.9±0.07ns  1085.1 MB/sec
oneshot/teeny-ru/rare-sherlock                                       1.00     27.3±0.05ns  1467.6 MB/sec    1.09     29.7±0.05ns  1348.2 MB/sec
oneshot/teeny-ru/rare-sherlock-holmes                                1.00     35.0±0.06ns  1144.9 MB/sec    1.13     39.4±0.06ns  1016.3 MB/sec
oneshot/teeny-zh/never-john-watson                                   1.00     26.9±0.06ns  1100.9 MB/sec    1.26     33.9±0.04ns   871.2 MB/sec
oneshot/teeny-zh/rare-sherlock                                       1.00     19.2±0.03ns  1540.4 MB/sec    1.46     27.9±0.08ns  1058.5 MB/sec
oneshot/teeny-zh/rare-sherlock-holmes                                1.00     28.8±0.03ns  1025.1 MB/sec    1.60     46.2±0.09ns   639.4 MB/sec
prebuilt/code-rust-library/never-fn-quux                             1.00     41.3±0.04µs    37.2 GB/sec    1.13     46.5±0.07µs    33.0 GB/sec
prebuilt/code-rust-library/never-fn-strength-paren                   1.00     48.1±0.07µs    31.9 GB/sec    1.08     52.1±0.09µs    29.5 GB/sec
prebuilt/code-rust-library/rare-fn-from-str                          1.00     14.0±0.02µs   109.6 GB/sec    1.43     20.0±0.02µs    76.9 GB/sec
prebuilt/huge-en/never-all-common-bytes                              1.11     25.6±0.05µs    22.3 GB/sec    1.00     23.0±0.04µs    24.9 GB/sec
prebuilt/huge-en/rare-huge-needle                                    1.48     37.4±0.05µs    15.3 GB/sec    1.00     25.3±0.04µs    22.6 GB/sec
prebuilt/huge-en/rare-long-needle                                    1.00     17.9±0.02µs    31.9 GB/sec    1.48     26.4±0.02µs    21.6 GB/sec
prebuilt/huge-en/rare-medium-needle                                  1.00     18.7±0.02µs    30.5 GB/sec    2.20     41.2±0.08µs    13.9 GB/sec
prebuilt/huge-en/rare-sherlock-holmes                                1.00     17.6±0.02µs    32.5 GB/sec    1.45     25.4±0.03µs    22.5 GB/sec
prebuilt/huge-ru/never-john-watson                                   1.00     17.3±0.01µs    33.0 GB/sec    3.98     68.9±0.09µs     8.3 GB/sec
prebuilt/huge-ru/rare-sherlock                                       1.00     17.6±0.03µs    32.5 GB/sec    2.11     37.0±0.04µs    15.4 GB/sec
prebuilt/huge-ru/rare-sherlock-holmes                                1.00     15.3±0.02µs    37.2 GB/sec    3.87     59.4±0.08µs     9.6 GB/sec
prebuilt/huge-zh/never-john-watson                                   1.00     16.3±0.02µs    35.0 GB/sec    1.60     26.1±0.02µs    21.9 GB/sec
prebuilt/huge-zh/rare-sherlock                                       1.00     17.9±0.02µs    31.8 GB/sec    1.22     22.0±0.02µs    26.0 GB/sec
prebuilt/huge-zh/rare-sherlock-holmes                                1.00     18.4±0.02µs    31.1 GB/sec    1.56     28.7±0.04µs    19.9 GB/sec
prebuilt/pathological-defeat-simple-vector-freq/rare-alphabet        1.00     68.3±0.07µs     9.8 GB/sec    6.06    414.0±0.59µs  1658.5 MB/sec
prebuilt/pathological-defeat-simple-vector-repeated/rare-alphabet    1.00   1227.6±2.00µs   559.4 MB/sec    2.17      2.7±0.00ms   257.3 MB/sec
prebuilt/pathological-defeat-simple-vector/rare-alphabet             1.77    207.9±0.39µs     2.5 GB/sec    1.00    117.7±0.17µs     4.4 GB/sec
prebuilt/pathological-md5-huge/never-no-hash                         1.00      5.6±0.01µs    25.4 GB/sec    1.64      9.1±0.02µs    15.5 GB/sec
prebuilt/pathological-md5-huge/rare-last-hash                        1.00      5.9±0.01µs    23.9 GB/sec    1.51      8.9±0.04µs    15.8 GB/sec
prebuilt/pathological-repeated-rare-small/never-tricky               1.20     36.0±0.04ns    25.9 GB/sec    1.00     30.1±0.23ns    31.0 GB/sec
prebuilt/sliceslice-i386/words                                       1.00     23.2±0.04ms        0 B/sec    1.15     26.6±0.02ms        0 B/sec
prebuilt/sliceslice-words/words                                      1.06     81.1±0.06ms        0 B/sec    1.00     76.8±0.09ms        0 B/sec
prebuilt/teeny-en/never-all-common-bytes                             1.56      9.1±0.01ns     2.9 GB/sec    1.00      5.8±0.07ns     4.5 GB/sec
prebuilt/teeny-en/never-john-watson                                  1.77      9.1±0.02ns     2.9 GB/sec    1.00      5.1±0.00ns     5.1 GB/sec
prebuilt/teeny-en/never-some-rare-bytes                              1.68      9.0±0.01ns     2.9 GB/sec    1.00      5.4±0.01ns     4.8 GB/sec
prebuilt/teeny-en/never-two-space                                    1.67      8.8±0.01ns     3.0 GB/sec    1.00      5.3±0.05ns     5.0 GB/sec
prebuilt/teeny-en/rare-sherlock                                      2.06      9.5±0.01ns     2.7 GB/sec    1.00      4.6±0.03ns     5.6 GB/sec
prebuilt/teeny-en/rare-sherlock-holmes                               1.00     10.5±0.01ns     2.5 GB/sec    1.77     18.5±0.05ns  1439.6 MB/sec
prebuilt/teeny-ru/never-john-watson                                  1.10      8.1±0.01ns     4.8 GB/sec    1.00      7.4±0.01ns     5.3 GB/sec
prebuilt/teeny-ru/rare-sherlock                                      1.56      9.5±0.02ns     4.1 GB/sec    1.00      6.1±0.01ns     6.4 GB/sec
prebuilt/teeny-ru/rare-sherlock-holmes                               1.28     12.6±0.02ns     3.1 GB/sec    1.00      9.8±0.01ns     4.0 GB/sec
prebuilt/teeny-zh/never-john-watson                                  1.23      9.1±0.01ns     3.2 GB/sec    1.00      7.3±0.01ns     3.9 GB/sec
prebuilt/teeny-zh/rare-sherlock                                      2.38     10.2±0.03ns     2.8 GB/sec    1.00      4.3±0.03ns     6.7 GB/sec
prebuilt/teeny-zh/rare-sherlock-holmes                               1.25     19.5±0.04ns  1516.0 MB/sec    1.00     15.7±0.02ns  1888.1 MB/sec

(You can run that same command on a checkout of the memchr repo without having to run the benchmarks.)

Two of the benchmarks, prebuilt/sliceslice-i386/words and prebuilt/sliceslice-words/words are meant to be replicas of your long_haystack and short_haystack benchmarks, respectively. It is interesting how they are so much closer in my harness when compared to yours. (When I run your new benchmarks from your PR, I can indeed see a noticeable benefit towards sliceslice.) I can't immediately spot any differences in the benchmarking code though. With that said, I am tracking a large number of other benchmarks too.

Note that I do expect sliceslice to have a bit better latency in many cases. It isn't quite handling the same cases as the memmem implementation in memchr, nor does it provide the same guarantees.

Additionally, I have run our internal benchmarks against our regex heavy test suite with the new version of the regex crate that uses memchr::memmem and I have noticed a 2% regression in instruction count. Not a big deal, but I figured you might be interested. Unfortunately I did have the time to investigate further.

Yeah I don't track perf regressions by instruction count, so this is difficult for me to interpret. I do expect some cases to get slightly slower since memchr on an actually rare byte is hard to beat with the "Generic SIMD" algorithm. But the "Generic SIMD" algorithm has far fewer weaknesses.

@marmeladema
Copy link
Collaborator

Ok so it seems we are not actually measuring the same things.

I never actually measure wall-clock time because from my own experience, it's very hard to have reproducible results on any recent Intel x86_64 hardware whereas measuring instructions, even though is far from perfect, at least gives pretty good reproducible results. To get acceptable results you would need configure your kernel to isolate some physical core, removing irqs for this core too, disabling all frequency scaling features (like turbo mode etc) and even with all of that, I cannot get reproducible results on my laptop with a Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz CPU.

From run to run of the exact same binary with cargo criterion, I can get up to + or - 50% in wall-clock time which makes me not really trust any numbers based on wall clock time measurement :/

@BurntSushi
Copy link

BurntSushi commented May 3, 2021

Yeah I don't experience that same level of noise. But to be clear, this isn't just an instruction count thing. The wall clock time differences in your benchmarks are bigger than in mine, and those differences are stable for me. (I run my benchmarks on an otherwise idle machine with CPU frequency governor set to performance, and that's generally been good enough for me.)

Digging more, I've spotted one inconsistency. What I thought was your "long haystack" benchmark wasn't quite correct. Some time ago, it changed from using i386.txt to haystack, where the latter is considerably shorter. Upon adding a benchmark using your haystack instead of i386, I can indeed reproduce that sliceslice is a bit faster on that benchmark. It looks like your inlined memcmp is helping there. :-) My implementation is also getting poor code gen for its hot loop, which is something I've struggled with during development. Compare this from my benchmarks

codegen-mybench

and this in yours

codegen-yourbench

(Those are both from memchr::memmem. The former is just from my harness where as the latter is from yours. This is actually also ironically a great example of why I'm not a huge fan of instruction counts. They both have the same number of instructions, but the "better" codegen has one more jump and two fewer loads from memory.)

So I think there's still some stuff to work out here to convince the compiler to give better codegen more consistently. But it's been pretty hit or miss for me. :-/

BurntSushi added a commit to BurntSushi/memchr that referenced this pull request May 3, 2021
It turns out that upstreams 'long_haystack' benchmark changed from using
the i386 corpus to something different (and also smaller). We keep the
i386 benchmark, but add a benchmark corresponding to sliceslice's
current 'long_haystack' benchmark.

See: cloudflare/sliceslice-rs#26
@marmeladema
Copy link
Collaborator

Digging more, I've spotted one inconsistency. What I thought was your "long haystack" benchmark wasn't quite correct. Some time ago, it changed from using i386.txt to haystack, where the latter is considerably shorter

Wow, I am not sure that was an intended change :/ I will have to look at it more closely

@marmeladema
Copy link
Collaborator

Ok @BurntSushi thank you very much for finding out the unintended benchmark change! I have pushed a fix and have updated the results in #37

@BurntSushi
Copy link

@marmeladema Aye, thanks. I think our measurements roughly align now, with the only unknown remaining is the difference in codegen for memchr::memmem. I'll see about digging into that later.

I would definitely encourage you to take a look at the benchmarks in memchr. There are a lot of them and I tried to document some portion of them in README files: https://github.com/BurntSushi/memchr/tree/master/bench/data --- There are a lot of interesting cases to consider, although, I'm sure the benchmarks in sliceslice more closely reflect your specific workload.

@marmeladema
Copy link
Collaborator

@BurntSushi yes I definitely will! Those are very very good to have and also the results you posted are incredibly useful information because it seems there are quite a few big outliers that are not handled properly in our implementation.

@marmeladema marmeladema mentioned this pull request May 3, 2021
@marmeladema
Copy link
Collaborator

Closing this, it seems this PR has already been the place of too much unrelated discussion ^^
I have created to #39 to not forget making the crate no_std compatible "someday".

@BurntSushi I am not closing to end the discussion but we can probably continue in a dedicated issue or something else more appropriate 👍

@BurntSushi
Copy link

@marmeladema Aye no worries. I think we hit a good point anyway! Very happy to get pinged and talk about stuff! And would always love to add more benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants