Some performance tweaks #19

wants to merge 7 commits into
vmg commented Aug 4, 2012


I wanted to play around with your awesome bloom library for some internal super-secret GitHub stuff (HAH), but as it turns out, I found it a tad lacking on raw performance, despite being all ballin' and stuff when it comes to memory usage.

I ran some instrumentation on the library, made some minor (major?) changes, and wanted to share them back. Here's the rationale:

"So I was like, let me trace this, yo"


This is the first instrumentation I ran with the original code (unmodified). The corpus is the words file that ships with Mac OS 10.7:

$ wc /usr/share/dict/words
  235886  235886 2493109 /usr/share/dict/words

Clearly, the two main bottlenecks are:

  • 53.7% (1274.0ms) + 9.3% (221.0ms) + 2.9% (70.0ms) spent just in core hashing (without counting the hash finalization for MD5). As it turns
  • More than 5% spent in lseek, when the code only uses lseek in one place... (??)

Everything else in the trace was noise, so I decided to go HAM on just these two things. The lowest hanging fruit is certainly MD5: It is not a general purpose hash function, so its use in a bloom filter is ill-advised.

Sex & the CityHash

My first choice for a replacement was Google's CityHash (ayuup, I drank the Google koolaid -- I'm a moron and deserve to be mocked). I left the original commit in the branch for reference.

This simple change halved the runtime, but traces were still showing way too much time spent in hashing. The cause? Well, the bloom filter requires a pretty wide nfunc size for most corpuses if you want reasonable error rates, but CityHash has only two hash modes: Either 64 bit, or 128 bit. Neither of these modes is optimal for bloom filter operation.

  • 64-bit hash output is (supposedly) optimized for small strings, which is not our target corpus at GitHub, although it should perform well with the words file of this synthetic benchmark. In practice, we end up performing too many calls to CityHash to fill the nfunc spectrum because of the small output size.

  • 128-bit (also know as brain-damage-mode) performs poorly for small strings (by poorly I mean worse than other highly-optimized general purpose hashes) and doesn't offer any other specific advantages besides the adequate output word size.

To top off this disaster, CityHash doesn't really have a native "seeded mode". The seed API performs a standard hash and then an extra iteration (??) on top of the result to mix in the seed, instead of seeding the standard hash initially.

...So I killed CityHash with fire.

Enter Murmur

MurmurHash has always been my favorite real-world hash function, and in retrospect I should have skipped City and gone straight for it.

It offers brilliant performance for all kind of string values and is always linear with string sizes without requiring special-casing for short strings. It also takes alignment issues into account.

To top it off, Murmur doesn't return the hash value on the stack/registers but writes it directly to a provided buffer. This makes it exceedingly easy to fill the bloom->hashes buffer with a lot of random data and perform the modularization incrementally.

    for (i = 0; i < bloom->nsalts; i++, hashes += 4) {
        MurmurHash3_x64_128(key, key_len, bloom->salts[i], hashes);
        hashes[0] = hashes[0] % bloom->counts_per_func;
        hashes[1] = hashes[1] % bloom->counts_per_func;
        hashes[2] = hashes[2] % bloom->counts_per_func;
        hashes[3] = hashes[3] % bloom->counts_per_func;

(note that we have aligned the hashes buffer to 16-bytes to prevent corner-case overflow checks). This is simple and straightforward, and makes my nipples tingle. n salts, and each salt throws 128 bits. Wrap'em and we're done here!

Enlarge your files

After dropping in an optimal hash function, the instrumentation showed a hilariously high percent of time spent in the kernel performing lseeks. I wondered where it was coming from...

        for (; size < new_size; size++) {
            if (lseek(fd, size, SEEK_SET) < 0) {
                perror("Error, calling lseek() to set file size");
                return NULL;

Apparently the code to resize a file on the filesystem was performing an absolute seek for every single byte that the file had to be increased. This is... heuh... I don't know if this is for compatibility reasons, but the POSIX standard defines a very very handy ftruncate call:

The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.

If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').

This works wonders on both Mac OS X and Linux, and lets the kernel fill the file efficiently with those pesky NULL bytes, even in highly fragmented filesystems. After replacing the lseek calls with a ftruncate, all kernel operations (including the mmaps) became noise in the instrumentation. Awesome!

This is where we're at now


As far as I'm concerned, the instrumentation trace has been obliterated.

  • Murmur cannot be made any faster, that's the way it is.
  • hash_func is stalling with all the modulo operations (even though they have no interdeps and should be going simultaneously on the pipeline, I think...). There are no SIMD modulo instructions, so I don't see how to work around this.
  • All the small bumps there come from the actual test program, not the library itself. Mostly strchr for splitting up the words in the dictionary file.
  • bitmap_check and bitmap_increment are tiny and fast. Nothing to do here. :/
  • Everything else is noise.

Also, binary strings

This is not performance related (at least not directly), but it totally bummed me that the API was requiring NULL-terminated strings, specially since I'm pretty sure you wrote this to be wrapped from dynamic languages, and all these languages incur on a penalty when asking for a NULL-terminated string (see: Python string slices yo, that's some memory being duped all over the place for NULL-termination) instead of grabbing the raw buffer + it's length.

I've changed the API accordingly, adding a len argument to all calls. Obviously, NULL-terminated strings can still be used by passing strlen(string) in the external API, instead of performing the measurement internally like before.

Final benchmarks

Averaged 8 runs for the original code, words is still the corpus.

Run 1: 2.182463s
Run 2: 2.177441s
Run 3: 2.174175s
Run 4: 2.178066s
Run 5: 2.190548s
Run 6: 2.179080s
Run 7: 2.180691s
Run 8: 2.184210s
AVG: 2.180834

Averaged 8 runs for the updated code, same corpus.

Run 1: 0.321654s
Run 2: 0.310658s
Run 3: 0.314666s
Run 4: 0.307526s
Run 5: 0.311680s
Run 6: 0.316963s
Run 7: 0.307528s
Run 8: 0.309479s
AVG: 0.312519

700% increase on this synthetic benchmark. For our specific corpus (bigger count, strings significantly larger than dictionary words), I get a 1300% increase. This is basically Murmur at work. Results may vary.

Hey you piece of shit did you break the filtering?

Don't think so. Murmur generates very high quality entropy, high enough to come close to MD5 for all measurements.

It's on my TODO list to perform some tests and see if there's an statistically significant variance on the amount of false positives between the two hash functions. Anecdotally, for the words dataset, MD5 was generating 1859 positives, while Murmur decreased that to 1815. THIS IS NOT SCIENTIFIC.

Common sense tells us that MD5, being cryptographically sound, should always stay ahead on pure entropy measurement, but the avalanching properties of Murmur are gorgeous. So I'm happy with this. 100% Pinkie Pie approved.


Ayup, as far as I'm concerned this now has acceptable performance to start building big stuff with it. I may look into making this even faster when I can play with more of our real-world data.

I understand these are very serious changes coming out of fucking nowhere, so I don't expect this to be merged straight away. Feel free to look at them with a magnifying glass, test it, see how it performs with your corpus (I assume they are links?), call me a moron and set my ass on 🔥... Anyway, you know how the song goes.

Hey, I just met you, and this is crazy, but I rewrote your bloom hashes, so merge me, maybe?

vmg added some commits Aug 4, 2012
@vmg vmg Switch MD5 with City 82a791d
@vmg vmg ...and this is how you resize a file in Unix
"The truncate() and ftruncate() functions cause the regular file named
by path or referenced by fd to be truncated to a size of precisely
length bytes.

If the file previously was larger than this size, the extra data is
lost. If the file previously was shorter, it is extended, and the
extended part reads as null bytes ('\0')."
@vmg vmg Add support for binary keys
There's no need for the keys to be NULL-terminated... That's so from the

Also, note that Python strings (and pretty much any dynamic language)
already keep length information. This will save us quite a few
`strlen` calls.
@vmg vmg Ok, disregard City that was a bad idea 21ae03e
@vmg vmg Documentation is nice! f88ec1c
@vmg vmg Oops! Indentation! cc4b676
@vmg vmg Better seeds? 86a865c

You should have seen how slow it was in pure python...

Anyway, I like the switch to a more efficient hash function, I like the change from NULL-terminated string to buffer+length, I like the fix of that crazy fseek loop.

Right now I'm tracking down what I think might be a bug, but (depending on what @hines and @mreiferson think) this could go in soon after.


I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.



Honestly though, thanks for this excellent contribution! We're really excited to see all the interest in this project and we're glad GitHub can find some use for it.

Also, I'm on board with these changes.

Can't argue much with the raw speed improvements of the change to the hash function + file resize.

I also completely agree with the API change to take length args. It is more robust, was overlooked, and we might as well make the breaking changes now while the project is young.

We'll need to do some testing early this week and as @ploxiln mentioned theres a possible bug being investigated so we'll bring this in soon.

Thanks and keep 'em coming!

vmg commented Aug 6, 2012


Glad you like the changes! Sorry it took me a while to answer, I was watching SCIENCE.

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

I see what you mean. I brought Murmur mostly intact from the original C++ sources (just with minimal translation to make it build as C), but it would make things simpler if we were just using the same hash for generating the salts and the key hashes. I'll look into this.

Regarding Murmur_x86_128 vs Murmur_x64_128: we could certainly drop the x86 version if you're not concerned about 32-bit performance (I am not). If you plan to target x86 systems, I can conditionally swap the appropriate hash at build time. You'll get a nice performance boost from the smaller (internal) word size.


We think you should drop the x86 version; we're not concerned about absolute best 32-bit performance, and it'll be faster than md5 on x86 anyway.

@mreiferson mreiferson commented on the diff Aug 6, 2012
- }
- bloom->num_salts = div;
- bloom->salts = calloc(div, SALT_SIZE);
- for (i = 0; i < div; i++) {
- struct cvs_MD5Context context;
- unsigned char checksum[16];
- cvs_MD5Init (&context);
- cvs_MD5Update (&context, (unsigned char *) &i, sizeof(int));
- cvs_MD5Final (checksum, &context);
- memcpy(bloom->salts + i * SALT_SIZE, &checksum, SALT_SIZE);
+ const uint32_t root = 0xba11742c;
+ const uint32_t seed = 0xd5702acb;
+ int i, num_salts = bloom->nfuncs / 4;
+ if (bloom->nfuncs % 4)
mreiferson Aug 6, 2012

we should document this better in the README...

we use astyle on our C stuff, specifically this command line:

astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
        --fill-empty-lines --pad-header --indent-switches           \
        --align-pointer=name --align-reference=name --pad-oper -n <file(s)>

the rest of your changes look fine, but this line would get the curly brace police all over it :)

also, we notably dont run astyle on code we've imported that isn't ours (like the md5/murmur files)

ploxiln Aug 6, 2012

by the way, we use revision 353 from the astyle sourceforge svn repo; it has some fixes not present in the latest release; I'll definitly make note of this stuff in the README

mreiferson Aug 6, 2012

now documented in the README

jinghao Aug 9, 2012

Micro-optimization: %4 is the same as & 3

klickverbot Aug 9, 2012

@jinghao: Wouldn't any decent compiler backend transform integer arithmetic with constant parameters into the most efficient representation on the target automatically?

ploxiln Aug 9, 2012

In case anyone had doubts, yes, gcc optimises trivial short computations like x/4 and x%4, even without -O2. For demonstration:


void new_salts(counting_bloom_t *bloom)
    int div = bloom->nfuncs / 4;
    int mod = bloom->nfuncs % 4;

the assembly generated

objdump -d build/test_dablooms | less
000000000040184c <new_salts>:
  40184c:       55                             push   %rbp
  40184d:       48 89 e5                       mov    %rsp,%rbp
  401850:       48 81 ec a0 00 00 00           sub    $0xa0,%rsp
  401857:       48 89 bd 68 ff ff ff           mov    %rdi,-0x98(%rbp)
  40185e:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401865:       48 8b 40 28                    mov    0x28(%rax),%rax
  401869:       48 c1 e8 02                    shr    $0x2,%rax                /* ">> 2" */
  40186d:       89 45 fc                       mov    %eax,-0x4(%rbp)
  401870:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401877:       48 8b 40 28                    mov    0x28(%rax),%rax
  40187b:       83 e0 03                       and    $0x3,%eax                /* "& 3" */

@vmg can you please rebase on current master (there's going to be conflicts in the test_libdablooms.c due to recently merged changes of mine, sorry), and squash into 3 commits:

1) replace lseek() loop with lseek() || ftruncate()
2) replace md5 hash with murmur 128-bit hash
3) change interface to take buffer-length instead of null-terminated string (this will require fixes to the new also)

thanks :)

vmg commented Aug 9, 2012

Sorry for the delay, I've just landed on SF. I'll rebase the PR as soon as the jet lag allows.

(...why is this in the frontpage of HN?)

@jinghao jinghao commented on the diff Aug 9, 2012
+ uint32_t h4 = seed;
+ uint32_t c1 = 0x239b961b;
+ uint32_t c2 = 0xab0e9789;
+ uint32_t c3 = 0x38b34ae5;
+ uint32_t c4 = 0xa1e38b93;
+ int i;
+ //----------
+ // body
+ const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
+ for(i = -nblocks; i; i++) {
+ uint32_t k1 = getblock(blocks,i*4+0);
jinghao Aug 9, 2012

Micro-optimization: << 2 is the same as * 4

(i << 4) | 1 is the same as i*4 + 1

same with 2, 3

klickverbot Aug 9, 2012

This is in MurmurHash code, so if you think this is worth the loss in readability, you might want to open an issue at But then again, don't underestimate optimizing compilers – in my experience, most backends recognize simple peephole optimizations like this just fine.

@jinghao jinghao commented on the diff Aug 9, 2012
+ uint64_t h1 = seed;
+ uint64_t h2 = seed;
+ uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
+ uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
+ int i;
+ //----------
+ // body
+ const uint64_t * blocks = (const uint64_t *)(data);
+ for(i = 0; i < nblocks; i++) {
+ uint64_t k1 = getblock(blocks,i*2+0);
jinghao Aug 9, 2012

i << 1
(i << 1) | 1

mattgodbolt Aug 9, 2012

As discussed above, compilers are smart enough to do appropriate transforms of this type nowadays. In fact, the (i<<1)|1 is a slight pessimization in my tests.



This is the best pull request I have ever seen. You obvisouly know your stuff @vmg the development community should very grateful for your contributions


Note that a Bloom filter only needs two hash functions: you can get an arbitrary number of hash values from a linear combination of the output of the two functions. This does not harm the accuracy of the Bloom filter compared to having many independent hash functions. See this paper for details:


"(...why is this in the frontpage of HN?)"

Nipple touching and SCIENCE watching.

This was referenced Aug 10, 2012

I've submitted a rebased version of this pull request as #39. I've kept authorship credits to @vmg (but please let me know if you don't want your name on the changes i made while rebasing and squashing)


We really liked this pull request, and it's now merged, although as #39. Thanks!

@ploxiln ploxiln closed this Aug 16, 2012

Oh shit. Sorry guys, I've spent the last week mostly sick, so this went totally over my head. Thanks tons for taking the time to squash and merge this, @ploxiln.

I'm going to strike back with faster hashing thanks to linear combination. Stay tuned.


might you be talking about something similar to #41 - if so, we're already looking at that


@vmg what tool did you use to generate this:


That is Apple's Instruments, in profiling mode, for Mac OS X 10.7.


best pull request ever.

sorry for the necropost, but I was digging for examples for some of my devs and I had to comment.

@mehulkar mehulkar referenced this pull request in halls-of-fame/github Sep 1, 2015

Some performance tweaks #37


@vmg This PR is referenced in this article Effective pull requests and other good practices for teams using github as excellent pull request.

I like your style btw. 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment