Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Some performance tweaks #19

Closed
wants to merge 7 commits into from

11 participants

Vicent Marti Pierce Lopez Matt Reiferson Justin Hines Nicholas Curtis Tony Finch J. Randall Hunt Sylvester Jakubowski Jinghao Yan David Nadlinger Matt Godbolt
Vicent Marti
vmg commented August 04, 2012

HAI FRIENDLY ENGINEERS OF BITLY I COME IN PEACE

I wanted to play around with your awesome bloom library for some internal super-secret GitHub stuff (HAH), but as it turns out, I found it a tad lacking on raw performance, despite being all ballin' and stuff when it comes to memory usage.

I ran some instrumentation on the library, made some minor (major?) changes, and wanted to share them back. Here's the rationale:

"So I was like, let me trace this, yo"

before

This is the first instrumentation I ran with the original code (unmodified). The corpus is the words file that ships with Mac OS 10.7:

$ wc /usr/share/dict/words
  235886  235886 2493109 /usr/share/dict/words

Clearly, the two main bottlenecks are:

  • 53.7% (1274.0ms) + 9.3% (221.0ms) + 2.9% (70.0ms) spent just in core hashing (without counting the hash finalization for MD5). As it turns
  • More than 5% spent in lseek, when the code only uses lseek in one place... (??)

Everything else in the trace was noise, so I decided to go HAM on just these two things. The lowest hanging fruit is certainly MD5: It is not a general purpose hash function, so its use in a bloom filter is ill-advised.

Sex & the CityHash

My first choice for a replacement was Google's CityHash (ayuup, I drank the Google koolaid -- I'm a moron and deserve to be mocked). I left the original commit in the branch for reference.

This simple change halved the runtime, but traces were still showing way too much time spent in hashing. The cause? Well, the bloom filter requires a pretty wide nfunc size for most corpuses if you want reasonable error rates, but CityHash has only two hash modes: Either 64 bit, or 128 bit. Neither of these modes is optimal for bloom filter operation.

  • 64-bit hash output is (supposedly) optimized for small strings, which is not our target corpus at GitHub, although it should perform well with the words file of this synthetic benchmark. In practice, we end up performing too many calls to CityHash to fill the nfunc spectrum because of the small output size.

  • 128-bit (also know as brain-damage-mode) performs poorly for small strings (by poorly I mean worse than other highly-optimized general purpose hashes) and doesn't offer any other specific advantages besides the adequate output word size.

To top off this disaster, CityHash doesn't really have a native "seeded mode". The seed API performs a standard hash and then an extra iteration (??) on top of the result to mix in the seed, instead of seeding the standard hash initially.

...So I killed CityHash with fire.

Enter Murmur

MurmurHash has always been my favorite real-world hash function, and in retrospect I should have skipped City and gone straight for it.

It offers brilliant performance for all kind of string values and is always linear with string sizes without requiring special-casing for short strings. It also takes alignment issues into account.

To top it off, Murmur doesn't return the hash value on the stack/registers but writes it directly to a provided buffer. This makes it exceedingly easy to fill the bloom->hashes buffer with a lot of random data and perform the modularization incrementally.

    for (i = 0; i < bloom->nsalts; i++, hashes += 4) {
        MurmurHash3_x64_128(key, key_len, bloom->salts[i], hashes);
        hashes[0] = hashes[0] % bloom->counts_per_func;
        hashes[1] = hashes[1] % bloom->counts_per_func;
        hashes[2] = hashes[2] % bloom->counts_per_func;
        hashes[3] = hashes[3] % bloom->counts_per_func;
    }

(note that we have aligned the hashes buffer to 16-bytes to prevent corner-case overflow checks). This is simple and straightforward, and makes my nipples tingle. n salts, and each salt throws 128 bits. Wrap'em and we're done here!

Enlarge your files

After dropping in an optimal hash function, the instrumentation showed a hilariously high percent of time spent in the kernel performing lseeks. I wondered where it was coming from...

        for (; size < new_size; size++) {
            if (lseek(fd, size, SEEK_SET) < 0) {
                perror("Error, calling lseek() to set file size");
                free_bitmap(bitmap);
                close(fd);
                return NULL;
            }
        }

Apparently the code to resize a file on the filesystem was performing an absolute seek for every single byte that the file had to be increased. This is... heuh... I don't know if this is for compatibility reasons, but the POSIX standard defines a very very handy ftruncate call:

The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.

If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').

This works wonders on both Mac OS X and Linux, and lets the kernel fill the file efficiently with those pesky NULL bytes, even in highly fragmented filesystems. After replacing the lseek calls with a ftruncate, all kernel operations (including the mmaps) became noise in the instrumentation. Awesome!

This is where we're at now

after

As far as I'm concerned, the instrumentation trace has been obliterated.

  • Murmur cannot be made any faster, that's the way it is.
  • hash_func is stalling with all the modulo operations (even though they have no interdeps and should be going simultaneously on the pipeline, I think...). There are no SIMD modulo instructions, so I don't see how to work around this.
  • All the small bumps there come from the actual test program, not the library itself. Mostly strchr for splitting up the words in the dictionary file.
  • bitmap_check and bitmap_increment are tiny and fast. Nothing to do here. :/
  • Everything else is noise. :sparkles:

Also, binary strings

This is not performance related (at least not directly), but it totally bummed me that the API was requiring NULL-terminated strings, specially since I'm pretty sure you wrote this to be wrapped from dynamic languages, and all these languages incur on a penalty when asking for a NULL-terminated string (see: Python string slices yo, that's some memory being duped all over the place for NULL-termination) instead of grabbing the raw buffer + it's length.

I've changed the API accordingly, adding a len argument to all calls. Obviously, NULL-terminated strings can still be used by passing strlen(string) in the external API, instead of performing the measurement internally like before.

Final benchmarks

Averaged 8 runs for the original code, words is still the corpus.

Run 1: 2.182463s
Run 2: 2.177441s
Run 3: 2.174175s
Run 4: 2.178066s
Run 5: 2.190548s
Run 6: 2.179080s
Run 7: 2.180691s
Run 8: 2.184210s
AVG: 2.180834

Averaged 8 runs for the updated code, same corpus.

Run 1: 0.321654s
Run 2: 0.310658s
Run 3: 0.314666s
Run 4: 0.307526s
Run 5: 0.311680s
Run 6: 0.316963s
Run 7: 0.307528s
Run 8: 0.309479s
AVG: 0.312519

700% increase on this synthetic benchmark. For our specific corpus (bigger count, strings significantly larger than dictionary words), I get a 1300% increase. This is basically Murmur at work. Results may vary.

Hey you piece of shit did you break the filtering?

Don't think so. Murmur generates very high quality entropy, high enough to come close to MD5 for all measurements.

It's on my TODO list to perform some tests and see if there's an statistically significant variance on the amount of false positives between the two hash functions. Anecdotally, for the words dataset, MD5 was generating 1859 positives, while Murmur decreased that to 1815. THIS IS NOT SCIENTIFIC.

Common sense tells us that MD5, being cryptographically sound, should always stay ahead on pure entropy measurement, but the avalanching properties of Murmur are gorgeous. So I'm happy with this. 100% Pinkie Pie approved.

THAT'S IT

Ayup, as far as I'm concerned this now has acceptable performance to start building big stuff with it. I may look into making this even faster when I can play with more of our real-world data.

I understand these are very serious changes coming out of fucking nowhere, so I don't expect this to be merged straight away. Feel free to look at them with a magnifying glass, test it, see how it performs with your corpus (I assume they are links?), call me a moron and set my ass on :fire:... Anyway, you know how the song goes.

Hey, I just met you, and this is crazy, but I rewrote your bloom hashes, so merge me, maybe?

added some commits August 04, 2012
Vicent Marti Switch MD5 with City 82a791d
Vicent Marti ...and this is how you resize a file in Unix
"The truncate() and ftruncate() functions cause the regular file named
by path or referenced by fd to be truncated to a size of precisely
length bytes.

If the file previously was larger than this size, the extra data is
lost. If the file previously was shorter, it is extended, and the
extended part reads as null bytes ('\0')."
d2397a4
Vicent Marti Add support for binary keys
There's no need for the keys to be NULL-terminated... That's so from the
90s!

Also, note that Python strings (and pretty much any dynamic language)
already keep length information. This will save us quite a few
`strlen` calls.
03155a0
Vicent Marti Ok, disregard City that was a bad idea 21ae03e
Vicent Marti Documentation is nice! f88ec1c
Vicent Marti Oops! Indentation! cc4b676
Vicent Marti Better seeds? 86a865c
Pierce Lopez
Collaborator

You should have seen how slow it was in pure python...

Anyway, I like the switch to a more efficient hash function, I like the change from NULL-terminated string to buffer+length, I like the fix of that crazy fseek loop.

Right now I'm tracking down what I think might be a bug, but (depending on what @hines and @mreiferson think) this could go in soon after.

Pierce Lopez
Collaborator

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

Matt Reiferson
Collaborator

tl;dr


Honestly though, thanks for this excellent contribution! We're really excited to see all the interest in this project and we're glad GitHub can find some use for it.

Also, I'm on board with these changes.

Can't argue much with the raw speed improvements of the change to the hash function + file resize.

I also completely agree with the API change to take length args. It is more robust, was overlooked, and we might as well make the breaking changes now while the project is young.

We'll need to do some testing early this week and as @ploxiln mentioned theres a possible bug being investigated so we'll bring this in soon.

Thanks and keep 'em coming!

Justin Hines
Collaborator
Vicent Marti
vmg commented August 06, 2012

:sparkles: Yey :sparkles:

Glad you like the changes! Sorry it took me a while to answer, I was watching SCIENCE.

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

I see what you mean. I brought Murmur mostly intact from the original C++ sources (just with minimal translation to make it build as C), but it would make things simpler if we were just using the same hash for generating the salts and the key hashes. I'll look into this.

Regarding Murmur_x86_128 vs Murmur_x64_128: we could certainly drop the x86 version if you're not concerned about 32-bit performance (I am not). If you plan to target x86 systems, I can conditionally swap the appropriate hash at build time. You'll get a nice performance boost from the smaller (internal) word size.

Pierce Lopez
Collaborator

We think you should drop the x86 version; we're not concerned about absolute best 32-bit performance, and it'll be faster than md5 on x86 anyway.

Matt Reiferson mreiferson commented on the diff August 06, 2012
src/dablooms.c
((38 lines not shown))
199  
-    }
200  
-    bloom->num_salts = div;
201  
-    bloom->salts = calloc(div, SALT_SIZE);
202  
-    for (i = 0; i < div; i++) {
203  
-        struct cvs_MD5Context context;
204  
-        unsigned char checksum[16];
205  
-        cvs_MD5Init (&context);
206  
-        cvs_MD5Update (&context, (unsigned char *) &i, sizeof(int));
207  
-        cvs_MD5Final (checksum, &context);
208  
-        memcpy(bloom->salts + i * SALT_SIZE, &checksum, SALT_SIZE);
  203
+    const uint32_t root = 0xba11742c;
  204
+    const uint32_t seed = 0xd5702acb;
  205
+
  206
+    int i, num_salts = bloom->nfuncs / 4;
  207
+
  208
+    if (bloom->nfuncs % 4)
6
Matt Reiferson Collaborator

we should document this better in the README...

we use astyle on our C stuff, specifically this command line:

astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
        --fill-empty-lines --pad-header --indent-switches           \
        --align-pointer=name --align-reference=name --pad-oper -n <file(s)>

the rest of your changes look fine, but this line would get the curly brace police all over it :)

also, we notably dont run astyle on code we've imported that isn't ours (like the md5/murmur files)

Pierce Lopez Collaborator
ploxiln added a note August 06, 2012

by the way, we use revision 353 from the astyle sourceforge svn repo; it has some fixes not present in the latest release; I'll definitly make note of this stuff in the README

Matt Reiferson Collaborator

now documented in the README

Jinghao Yan
jinghao added a note August 09, 2012

Micro-optimization: %4 is the same as & 3

@jinghao: Wouldn't any decent compiler backend transform integer arithmetic with constant parameters into the most efficient representation on the target automatically?

Pierce Lopez Collaborator
ploxiln added a note August 09, 2012

In case anyone had doubts, yes, gcc optimises trivial short computations like x/4 and x%4, even without -O2. For demonstration:

src/dablooms.c:new_salts()

void new_salts(counting_bloom_t *bloom)
{
    int div = bloom->nfuncs / 4;
    int mod = bloom->nfuncs % 4;
...

the assembly generated

objdump -d build/test_dablooms | less
...
000000000040184c <new_salts>:
  40184c:       55                             push   %rbp
  40184d:       48 89 e5                       mov    %rsp,%rbp
  401850:       48 81 ec a0 00 00 00           sub    $0xa0,%rsp
  401857:       48 89 bd 68 ff ff ff           mov    %rdi,-0x98(%rbp)
  40185e:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401865:       48 8b 40 28                    mov    0x28(%rax),%rax
  401869:       48 c1 e8 02                    shr    $0x2,%rax                /* ">> 2" */
  40186d:       89 45 fc                       mov    %eax,-0x4(%rbp)
  401870:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401877:       48 8b 40 28                    mov    0x28(%rax),%rax
  40187b:       83 e0 03                       and    $0x3,%eax                /* "& 3" */
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Pierce Lopez
Collaborator

@vmg can you please rebase on current master (there's going to be conflicts in the test_libdablooms.c due to recently merged changes of mine, sorry), and squash into 3 commits:

1) replace lseek() loop with lseek() || ftruncate()
2) replace md5 hash with murmur 128-bit hash
3) change interface to take buffer-length instead of null-terminated string (this will require fixes to the new test_pydablooms.py also)

thanks :)

Vicent Marti
vmg commented August 08, 2012

Sorry for the delay, I've just landed on SF. I'll rebase the PR as soon as the jet lag allows.

(...why is this in the frontpage of HN?)

Jinghao Yan jinghao commented on the diff August 09, 2012
src/murmur.c
((125 lines not shown))
  125
+	uint32_t h4 = seed;
  126
+
  127
+	uint32_t c1 = 0x239b961b; 
  128
+	uint32_t c2 = 0xab0e9789;
  129
+	uint32_t c3 = 0x38b34ae5; 
  130
+	uint32_t c4 = 0xa1e38b93;
  131
+
  132
+	int i;
  133
+
  134
+	//----------
  135
+	// body
  136
+
  137
+	const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
  138
+
  139
+	for(i = -nblocks; i; i++) {
  140
+		uint32_t k1 = getblock(blocks,i*4+0);
2
Jinghao Yan
jinghao added a note August 09, 2012

Micro-optimization: << 2 is the same as * 4

(i << 4) | 1 is the same as i*4 + 1

same with 2, 3

This is in MurmurHash code, so if you think this is worth the loss in readability, you might want to open an issue at http://code.google.com/p/smhasher/issues/list. But then again, don't underestimate optimizing compilers – in my experience, most backends recognize simple peephole optimizations like this just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Jinghao Yan jinghao commented on the diff August 09, 2012
src/murmur.c
((226 lines not shown))
  226
+
  227
+	uint64_t h1 = seed;
  228
+	uint64_t h2 = seed;
  229
+
  230
+	uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
  231
+	uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
  232
+
  233
+	int i;
  234
+
  235
+	//----------
  236
+	// body
  237
+
  238
+	const uint64_t * blocks = (const uint64_t *)(data);
  239
+
  240
+	for(i = 0; i < nblocks; i++) {
  241
+		uint64_t k1 = getblock(blocks,i*2+0);
2
Jinghao Yan
jinghao added a note August 09, 2012

i << 1
(i << 1) | 1

As discussed above, compilers are smart enough to do appropriate transforms of this type nowadays. In fact, the (i<<1)|1 is a slight pessimization in my tests.

CF http://url.godbolt.org/shiftVsMultiply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Nicholas Curtis

This is the best pull request I have ever seen. You obvisouly know your stuff @vmg the development community should very grateful for your contributions

Tony Finch

Note that a Bloom filter only needs two hash functions: you can get an arbitrary number of hash values from a linear combination of the output of the two functions. This does not harm the accuracy of the Bloom filter compared to having many independent hash functions. See this paper for details: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf

Deleted user

The content you are editing has changed. Reload the page and try again.

"(...why is this in the frontpage of HN?)"

Nipple touching and SCIENCE watching.

Sending Request…

Attach images by dragging & dropping or selecting them. Octocat-spinner-32 Uploading your images… Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG. Yowza, that's a big file. Try again with an image file smaller than 10MB. This browser doesn't support image attachments. We recommend updating to the latest Internet Explorer, Google Chrome, or Firefox. Something went really wrong, and we can't process that image. Try again.

Pierce Lopez ploxiln referenced this pull request August 10, 2012
Closed

Faster hash_func #38

Pierce Lopez ploxiln referenced this pull request August 13, 2012
Merged

Vmg perf branch #39

Pierce Lopez
Collaborator

I've submitted a rebased version of this pull request as #39. I've kept authorship credits to @vmg (but please let me know if you don't want your name on the changes i made while rebasing and squashing)

Pierce Lopez
Collaborator

We really liked this pull request, and it's now merged, although as #39. Thanks!

Pierce Lopez ploxiln closed this August 15, 2012
Vicent Marti
vmg commented August 17, 2012

Oh shit. Sorry guys, I've spent the last week mostly sick, so this went totally over my head. Thanks tons for taking the time to squash and merge this, @ploxiln. :sparkles:

I'm going to strike back with faster hashing thanks to linear combination. Stay tuned.

Pierce Lopez
Collaborator

might you be talking about something similar to #41 - if so, we're already looking at that

J. Randall Hunt

@vmg what tool did you use to generate this:
trace

Vicent Marti
vmg commented August 31, 2012

That is Apple's Instruments, in profiling mode, for Mac OS X 10.7. :sparkles:

Sylvester Jakubowski

best pull request ever.

sorry for the necropost, but I was digging for examples for some of my devs and I had to comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 7 unique commits by 1 author.

Aug 04, 2012
Vicent Marti Switch MD5 with City 82a791d
Vicent Marti ...and this is how you resize a file in Unix
"The truncate() and ftruncate() functions cause the regular file named
by path or referenced by fd to be truncated to a size of precisely
length bytes.

If the file previously was larger than this size, the extra data is
lost. If the file previously was shorter, it is extended, and the
extended part reads as null bytes ('\0')."
d2397a4
Vicent Marti Add support for binary keys
There's no need for the keys to be NULL-terminated... That's so from the
90s!

Also, note that Python strings (and pretty much any dynamic language)
already keep length information. This will save us quite a few
`strlen` calls.
03155a0
Vicent Marti Ok, disregard City that was a bad idea 21ae03e
Vicent Marti Documentation is nice! f88ec1c
Vicent Marti Oops! Indentation! cc4b676
Vicent Marti Better seeds? 86a865c
This page is out of date. Refresh to see the latest.
2  Makefile
@@ -59,7 +59,7 @@ PY_MOD_DIR := $(shell $(PYTHON) -c "import distutils.sysconfig ; print(distutils
59 59
 PY_FLAGS = --build-lib=$(PY_BLDDIR) --build-temp=$(PY_BLDDIR)
60 60
 PY_BLD_ENV = INCPATH="$(SRCDIR) $(INCPATH)" LIBPATH="$(BLDDIR) $(LIBPATH)"
61 61
 
62  
-SRCS_LIBDABLOOMS = md5.c dablooms.c
  62
+SRCS_LIBDABLOOMS = murmur.c dablooms.c
63 63
 SRCS_TESTS = test_dablooms.c
64 64
 WORDS = /usr/share/dict/words
65 65
 OBJS_LIBDABLOOMS = $(patsubst %.c, $(BLDDIR)/%.o, $(SRCS_LIBDABLOOMS))
18  pydablooms/pydablooms.c
@@ -51,37 +51,39 @@ static int Dablooms_init(Dablooms *self, PyObject *args, PyObject *kwds)
51 51
 static PyObject *check(Dablooms *self, PyObject *args)
52 52
 {
53 53
     const char *hash;
54  
-    if (!PyArg_ParseTuple(args, "s", &hash)) {
  54
+    int len;
  55
+
  56
+    if (!PyArg_ParseTuple(args, "s#", &hash, &len)) {
55 57
         return NULL;
56 58
     }
57  
-    return Py_BuildValue("i", scaling_bloom_check(self->filter, hash));
  59
+    return Py_BuildValue("i", scaling_bloom_check(self->filter, hash, len));
58 60
 }
59 61
 
60 62
 static PyObject *add(Dablooms *self, PyObject *args, PyObject *kwds)
61 63
 {
62 64
     const char *hash;
63  
-    int id;
  65
+    int id, len;
64 66
     
65 67
     static char *kwlist[] = {"hash", "id", NULL};
66 68
     
67  
-    if (! PyArg_ParseTupleAndKeywords(args, kwds, "|si", kwlist, &hash, &id)) {
  69
+    if (! PyArg_ParseTupleAndKeywords(args, kwds, "|s#i", kwlist, &hash, &len, &id)) {
68 70
         return NULL;
69 71
     }
70 72
     
71  
-    return Py_BuildValue("i", scaling_bloom_add(self->filter, hash, id));
  73
+    return Py_BuildValue("i", scaling_bloom_add(self->filter, hash, len, id));
72 74
 }
73 75
 
74 76
 static PyObject *delete(Dablooms *self, PyObject *args, PyObject *kwds)
75 77
 {
76 78
     const char *hash;
77  
-    int id;
  79
+    int id, len;
78 80
     static char *kwlist[] = {"hash", "id", NULL};
79 81
     
80  
-    if (! PyArg_ParseTupleAndKeywords(args, kwds, "|si", kwlist, &hash, &id)) {
  82
+    if (! PyArg_ParseTupleAndKeywords(args, kwds, "|s#i", kwlist, &hash, &len, &id)) {
81 83
         return NULL;
82 84
     }
83 85
     
84  
-    return Py_BuildValue("i", scaling_bloom_remove(self->filter, hash, id));
  86
+    return Py_BuildValue("i", scaling_bloom_remove(self->filter, hash, len, id));
85 87
 }
86 88
 
87 89
 static PyObject *flush(Dablooms *self, PyObject *args, PyObject *kwds)
150  src/dablooms.c
@@ -12,14 +12,13 @@
12 12
 #include <sys/mman.h>
13 13
 #include <unistd.h>
14 14
 
15  
-#include "md5.h"
  15
+#include "murmur.h"
16 16
 #include "dablooms.h"
17 17
 
18 18
 #define DABLOOMS_VERSION "0.8.1"
19 19
 
20 20
 #define HEADER_BYTES (2*sizeof(uint32_t))
21 21
 #define SCALE_HEADER_BYTES (3*sizeof(uint64_t))
22  
-#define SALT_SIZE 16
23 22
 
24 23
 const char *dablooms_version(void)
25 24
 {
@@ -45,16 +44,9 @@ bitmap_t *bitmap_resize(bitmap_t *bitmap, size_t old_size, size_t new_size)
45 44
     
46 45
     /* Write something to the end of the file to insure allocated the space */
47 46
     if (size == old_size) {
48  
-        for (; size < new_size; size++) {
49  
-            if (lseek(fd, size, SEEK_SET) < 0) {
50  
-                perror("Error, calling lseek() to set file size");
51  
-                free_bitmap(bitmap);
52  
-                close(fd);
53  
-                return NULL;
54  
-            }
55  
-        }
56  
-        if (write(fd, "", 1) < 0) {
57  
-            perror("Error, writing last byte of the file");
  47
+        if (lseek(fd, new_size, SEEK_SET) < 0 ||
  48
+                ftruncate(fd, (off_t)new_size) < 0) {
  49
+            perror("Error increasing file size with ftruncate");
58 50
             free_bitmap(bitmap);
59 51
             close(fd);
60 52
             return NULL;
@@ -184,58 +176,76 @@ int bitmap_flush(bitmap_t *bitmap)
184 176
     }
185 177
 }
186 178
 
187  
-/* Each function has a unique salt, so we need at least nfuncs salts.
188  
- * An MD5 hash is 16 bytes long, and each salt only needds to be 4 bytes
189  
- * Thus we can proportion 4 salts per each md5 hash we create as a salt.
  179
+/*
  180
+ * Build some sexy new salts for the bloom filter. How?
  181
+ *
  182
+ * With Murmur3_128, we turn a key and a 4-byte salt into a 16 bytes
  183
+ * hash; this hash can be split in four 4-byte hashes, which are
  184
+ * the target size for our bloom filter.
  185
+ *
  186
+ * Hence if we require `nfunc` 4-byte hashes, we need to generate
  187
+ * `nfunc` / 4 different salts (this number in rounded upwards for
  188
+ * the cases where `nfunc` doesn't divide evenly, and we only need
  189
+ * to take 1, 2 or 3 words from the 128-bit hash seeded with the
  190
+ * last salt).
  191
+ *
  192
+ * We build these salts incrementally using Murmur3_32 (4-byte output,
  193
+ * matches our target salt size). The intitial salt is a function
  194
+ * of a predefined root; consequent salts are chained on top of the
  195
+ * first one using the same seed but xor'ed with the salt index.
  196
+ *
  197
+ * Note that this salt generation is stable, i.e. will always remain
  198
+ * the same between different instantiations of a filter. There is
  199
+ * no pure randomness involved.
190 200
  */
191  
-void new_salts(counting_bloom_t *bloom)
  201
+static void new_salts(counting_bloom_t *bloom)
192 202
 {
193  
-    int div = bloom->nfuncs / 4;
194  
-    int mod = bloom->nfuncs % 4;
195  
-    int i;
196  
-    
197  
-    if (mod) {
198  
-        div += 1;
199  
-    }
200  
-    bloom->num_salts = div;
201  
-    bloom->salts = calloc(div, SALT_SIZE);
202  
-    for (i = 0; i < div; i++) {
203  
-        struct cvs_MD5Context context;
204  
-        unsigned char checksum[16];
205  
-        cvs_MD5Init (&context);
206  
-        cvs_MD5Update (&context, (unsigned char *) &i, sizeof(int));
207  
-        cvs_MD5Final (checksum, &context);
208  
-        memcpy(bloom->salts + i * SALT_SIZE, &checksum, SALT_SIZE);
  203
+    const uint32_t root = 0xba11742c;
  204
+    const uint32_t seed = 0xd5702acb;
  205
+
  206
+    int i, num_salts = bloom->nfuncs / 4;
  207
+
  208
+    if (bloom->nfuncs % 4)
  209
+        num_salts++;
  210
+
  211
+    bloom->salts = calloc(num_salts, sizeof(uint32_t));
  212
+    bloom->nsalts = num_salts;
  213
+
  214
+    /* initial salt, seeded from root */
  215
+    MurmurHash3_x86_32((char *)&root, sizeof(uint32_t), seed, bloom->salts);
  216
+
  217
+    for (i = 1; i < num_salts; i++) {
  218
+        /* remaining salts are chained on top */
  219
+        uint32_t base = bloom->salts[i - 1] ^ i;
  220
+        MurmurHash3_x86_32((char *)&base, sizeof(uint32_t), seed, bloom->salts + i);
209 221
     }
210 222
 }
211 223
 
212  
-/* We are are using the salts, adding them to the new md5 hash, adding the key,
213  
- * converting said md5 hash to 4 byte indexes
  224
+/*
  225
+ * Perform the actual hashing for `key`
  226
+ *
  227
+ * We get one 128-bit hash for every salt we've previously
  228
+ * allocated. From this 128-bit hash, we get 4 32-bit hashes
  229
+ * with our target size; we need to wrap them around
  230
+ * individually.
  231
+ *
  232
+ * Note that there are no overflow checks for the cases where
  233
+ * we have a non-multiple of 4 number of hashes, because we've
  234
+ * allocated the `hashes` array in 16-byte boundaries. In these
  235
+ * cases, the remaining 1, 2 or 3 hashes will simply not be
  236
+ * accessed.
214 237
  */
215  
-unsigned int *hash_func(counting_bloom_t *bloom, const char *key, unsigned int *hashes)
  238
+static void hash_func(counting_bloom_t *bloom, const char *key, size_t key_len, uint32_t *hashes)
216 239
 {
  240
+    int i;
217 241
 
218  
-    int i, j, hash_cnt, hash;
219  
-    unsigned char *salts = bloom->salts;
220  
-    hash_cnt = 0;
221  
-    
222  
-    for (i = 0; i < bloom->num_salts; i++) {
223  
-        struct cvs_MD5Context context;
224  
-        unsigned char checksum[16];
225  
-        cvs_MD5Init(&context);
226  
-        cvs_MD5Update(&context, salts + i * SALT_SIZE, SALT_SIZE);
227  
-        cvs_MD5Update(&context, (unsigned char *)key, strlen(key));
228  
-        cvs_MD5Final(checksum, &context);
229  
-        for (j = 0; j < sizeof(checksum); j += 4) {
230  
-            if (hash_cnt >= (bloom->nfuncs)) {
231  
-                break;
232  
-            }
233  
-            hash = *(uint32_t *)(checksum + j);
234  
-            hashes[hash_cnt] = hash % bloom->counts_per_func;
235  
-            hash_cnt++;
236  
-        }
  242
+    for (i = 0; i < bloom->nsalts; i++, hashes += 4) {
  243
+        MurmurHash3_x64_128(key, key_len, bloom->salts[i], hashes);
  244
+        hashes[0] = hashes[0] % bloom->counts_per_func;
  245
+        hashes[1] = hashes[1] % bloom->counts_per_func;
  246
+        hashes[2] = hashes[2] % bloom->counts_per_func;
  247
+        hashes[3] = hashes[3] % bloom->counts_per_func;
237 248
     }
238  
-    return hashes;
239 249
 }
240 250
 
241 251
 int free_counting_bloom(counting_bloom_t *bloom)
@@ -278,8 +288,12 @@ counting_bloom_t *counting_bloom_init(unsigned int capacity, double error_rate,
278 288
     bloom->counts_per_func = (int) ceil(capacity * fabs(log(error_rate)) / (bloom->nfuncs * pow(log(2), 2)));
279 289
     bloom->size = ceil(bloom->nfuncs * bloom->counts_per_func);
280 290
     bloom->num_bytes = (int) ceil(bloom->size / 2 + HEADER_BYTES);
281  
-    bloom->hashes = calloc(bloom->nfuncs, sizeof(unsigned int));
  291
+
282 292
     new_salts(bloom);
  293
+
  294
+    /* hashes; make sure they are always allocated as a multiple of 16
  295
+     * to skip the overflow check when generating 128-bit hashes */
  296
+    bloom->hashes = malloc(bloom->nsalts * 16);
283 297
     
284 298
     return bloom;
285 299
 }
@@ -318,12 +332,12 @@ counting_bloom_t *counting_bloom_from_file(unsigned capacity, double error_rate,
318 332
     return cur_bloom;
319 333
 }
320 334
 
321  
-int counting_bloom_add(counting_bloom_t *bloom, const char *s)
  335
+int counting_bloom_add(counting_bloom_t *bloom, const char *s, size_t len)
322 336
 {
323 337
     unsigned int index, i, offset;
324 338
     unsigned int *hashes = bloom->hashes;
325 339
     
326  
-    hash_func(bloom, s, hashes);
  340
+    hash_func(bloom, s, len, hashes);
327 341
     
328 342
     for (i = 0; i < bloom->nfuncs; i++) {
329 343
         offset = i * bloom->counts_per_func;
@@ -335,12 +349,12 @@ int counting_bloom_add(counting_bloom_t *bloom, const char *s)
335 349
     return 0;
336 350
 }
337 351
 
338  
-int counting_bloom_remove(counting_bloom_t *bloom, const char *s)
  352
+int counting_bloom_remove(counting_bloom_t *bloom, const char *s, size_t len)
339 353
 {
340 354
     unsigned int index, i, offset;
341 355
     unsigned int *hashes = bloom->hashes;
342 356
     
343  
-    hash_func(bloom, s, hashes);
  357
+    hash_func(bloom, s, len, hashes);
344 358
     
345 359
     for (i = 0; i < bloom->nfuncs; i++) {
346 360
         offset = i * bloom->counts_per_func;
@@ -352,12 +366,12 @@ int counting_bloom_remove(counting_bloom_t *bloom, const char *s)
352 366
     return 0;
353 367
 }
354 368
 
355  
-int counting_bloom_check(counting_bloom_t *bloom, const char *s)
  369
+int counting_bloom_check(counting_bloom_t *bloom, const char *s, size_t len)
356 370
 {
357 371
     unsigned int index, i, offset;
358 372
     unsigned int *hashes = bloom->hashes;
359 373
     
360  
-    hash_func(bloom, s, hashes);
  374
+    hash_func(bloom, s, len, hashes);
361 375
     
362 376
     for (i = 0; i < bloom->nfuncs; i++) {
363 377
         offset = i * bloom->counts_per_func;
@@ -425,7 +439,7 @@ counting_bloom_t *new_counting_bloom_from_scale(scaling_bloom_t *bloom, uint32_t
425 439
 }
426 440
 
427 441
 
428  
-int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, uint32_t id)
  442
+int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id)
429 443
 {
430 444
     int i;
431 445
     int nblooms = bloom->num_blooms;
@@ -444,14 +458,14 @@ int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, uint32_t id)
444 458
     if ((*bloom->header->max_id) < id) {
445 459
         (*bloom->header->max_id) = id;
446 460
     }
447  
-    counting_bloom_add(cur_bloom, s);
  461
+    counting_bloom_add(cur_bloom, s, len);
448 462
     
449 463
     (*bloom->header->posseq) ++;
450 464
     
451 465
     return 1;
452 466
 }
453 467
 
454  
-int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id)
  468
+int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id)
455 469
 {
456 470
     counting_bloom_t *cur_bloom;
457 471
     int id_diff, i;
@@ -461,7 +475,7 @@ int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id)
461 475
         id_diff = id - (*cur_bloom->header->id);
462 476
         if (id_diff >= 0) {
463 477
             (*bloom->header->preseq)++;
464  
-            counting_bloom_remove(cur_bloom, s);
  478
+            counting_bloom_remove(cur_bloom, s, len);
465 479
             (*bloom->header->posseq)++;
466 480
             return 1;
467 481
         }
@@ -469,13 +483,13 @@ int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id)
469 483
     return 0;
470 484
 }
471 485
 
472  
-int scaling_bloom_check(scaling_bloom_t *bloom, const char *s)
  486
+int scaling_bloom_check(scaling_bloom_t *bloom, const char *s, size_t len)
473 487
 {
474 488
     int i;
475 489
     counting_bloom_t *cur_bloom;
476 490
     for (i = bloom->num_blooms - 1; i >= 0; i--) {
477 491
         cur_bloom = bloom->blooms[i];
478  
-        if (counting_bloom_check(cur_bloom, s)) {
  492
+        if (counting_bloom_check(cur_bloom, s, len)) {
479 493
             return 1;
480 494
         }
481 495
     }
20  src/dablooms.h
@@ -35,9 +35,11 @@
35 35
     unsigned int capacity;
36 36
     unsigned int offset;
37 37
     unsigned int counts_per_func;
38  
-    unsigned int num_salts;
39  
-    unsigned char *salts;
40  
-    unsigned int *hashes;
  38
+
  39
+    uint32_t *salts;
  40
+    uint32_t *hashes;
  41
+
  42
+    size_t nsalts;
41 43
     size_t nfuncs;
42 44
     size_t size;
43 45
     size_t num_bytes;
@@ -48,9 +50,9 @@
48 50
 int free_counting_bloom(counting_bloom_t *bloom);
49 51
 counting_bloom_t *new_counting_bloom(unsigned int capacity, double error_rate, const char *filename);
50 52
 counting_bloom_t *new_counting_bloom_from_file(unsigned int capacity, double error_rate, const char *filename);
51  
-int counting_bloom_add(counting_bloom_t *bloom, const char *s);
52  
-int counting_bloom_remove(counting_bloom_t *bloom, const char *s);
53  
-int counting_bloom_check(counting_bloom_t *bloom, const char *s);
  53
+int counting_bloom_add(counting_bloom_t *bloom, const char *s, size_t len);
  54
+int counting_bloom_remove(counting_bloom_t *bloom, const char *s, size_t len);
  55
+int counting_bloom_check(counting_bloom_t *bloom, const char *s, size_t len);
54 56
 
55 57
 
56 58
 typedef struct {
@@ -74,8 +76,8 @@
74 76
 scaling_bloom_t *new_scaling_bloom(unsigned int capacity, double error_rate, const char *filename, uint32_t id);
75 77
 scaling_bloom_t *new_scaling_bloom_from_file(unsigned int capacity, double error_rate, const char *filename);
76 78
 int free_scaling_bloom(scaling_bloom_t *bloom);
77  
-int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, uint32_t id);
78  
-int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id);
79  
-int scaling_bloom_check(scaling_bloom_t *bloom, const char *s);
  79
+int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id);
  80
+int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id);
  81
+int scaling_bloom_check(scaling_bloom_t *bloom, const char *s, size_t len);
80 82
 int scaling_bloom_flush(scaling_bloom_t *bloom);
81 83
 #endif
332  src/md5.c
... ...
@@ -1,332 +0,0 @@
1  
-/*
2  
- * This code implements the MD5 message-digest algorithm.
3  
- * The algorithm is due to Ron Rivest.  This code was
4  
- * written by Colin Plumb in 1993, no copyright is claimed.
5  
- * This code is in the public domain; do with it what you wish.
6  
- *
7  
- * Equivalent code is available from RSA Data Security, Inc.
8  
- * This code has been tested against that, and is equivalent,
9  
- * except that you don't need to include two pages of legalese
10  
- * with every copy.
11  
- *
12  
- * To compute the message digest of a chunk of bytes, declare an
13  
- * MD5Context structure, pass it to MD5Init, call MD5Update as
14  
- * needed on buffers full of bytes, and then call MD5Final, which
15  
- * will fill a supplied 16-byte array with the digest.
16  
- */
17  
-
18  
-/* This code was modified in 1997 by Jim Kingdon of Cyclic Software to
19  
-   not require an integer type which is exactly 32 bits.  This work
20  
-   draws on the changes for the same purpose by Tatu Ylonen
21  
-   <ylo@cs.hut.fi> as part of SSH, but since I didn't actually use
22  
-   that code, there is no copyright issue.  I hereby disclaim
23  
-   copyright in any changes I have made; this code remains in the
24  
-   public domain.  */
25  
-
26  
-/* Note regarding cvs_* namespace: this avoids potential conflicts
27  
-   with libraries such as some versions of Kerberos.  No particular
28  
-   need to worry about whether the system supplies an MD5 library, as
29  
-   this file is only about 3k of object code.  */
30  
-
31  
-#ifdef HAVE_CONFIG_H
32  
-#include "config.h"
33  
-#endif
34  
-
35  
-#include <string.h>	/* for memcpy() and memset() */
36  
-
37  
-/* Add prototype support.  */
38  
-#ifndef PROTO
39  
-#if defined (USE_PROTOTYPES) ? USE_PROTOTYPES : defined (__STDC__)
40  
-#define PROTO(ARGS) ARGS
41  
-#else
42  
-#define PROTO(ARGS) ()
43  
-#endif
44  
-#endif
45  
-
46  
-#include "md5.h"
47  
-
48  
-/* Little-endian byte-swapping routines.  Note that these do not
49  
-   depend on the size of datatypes such as cvs_uint32, nor do they require
50  
-   us to detect the endianness of the machine we are running on.  It
51  
-   is possible they should be macros for speed, but I would be
52  
-   surprised if they were a performance bottleneck for MD5.  */
53  
-
54  
-static cvs_uint32
55  
-getu32 (addr)
56  
-     const unsigned char *addr;
57  
-{
58  
-	return (((((unsigned long)addr[3] << 8) | addr[2]) << 8)
59  
-		| addr[1]) << 8 | addr[0];
60  
-}
61  
-
62  
-static void
63  
-putu32 (data, addr)
64  
-     cvs_uint32 data;
65  
-     unsigned char *addr;
66  
-{
67  
-	addr[0] = (unsigned char)data;
68  
-	addr[1] = (unsigned char)(data >> 8);
69  
-	addr[2] = (unsigned char)(data >> 16);
70  
-	addr[3] = (unsigned char)(data >> 24);
71  
-}
72  
-
73  
-/*
74  
- * Start MD5 accumulation.  Set bit count to 0 and buffer to mysterious
75  
- * initialization constants.
76  
- */
77  
-void
78  
-cvs_MD5Init (ctx)
79  
-     struct cvs_MD5Context *ctx;
80  
-{
81  
-	ctx->buf[0] = 0x67452301;
82  
-	ctx->buf[1] = 0xefcdab89;
83  
-	ctx->buf[2] = 0x98badcfe;
84  
-	ctx->buf[3] = 0x10325476;
85  
-
86  
-	ctx->bits[0] = 0;
87  
-	ctx->bits[1] = 0;
88  
-}
89  
-
90  
-/*
91  
- * Update context to reflect the concatenation of another buffer full
92  
- * of bytes.
93  
- */
94  
-void
95  
-cvs_MD5Update (ctx, buf, len)
96  
-     struct cvs_MD5Context *ctx;
97  
-     unsigned char const *buf;
98  
-     unsigned len;
99  
-{
100  
-	cvs_uint32 t;
101  
-
102  
-	/* Update bitcount */
103  
-
104  
-	t = ctx->bits[0];
105  
-	if ((ctx->bits[0] = (t + ((cvs_uint32)len << 3)) & 0xffffffff) < t)
106  
-		ctx->bits[1]++;	/* Carry from low to high */
107  
-	ctx->bits[1] += len >> 29;
108  
-
109  
-	t = (t >> 3) & 0x3f;	/* Bytes already in shsInfo->data */
110  
-
111  
-	/* Handle any leading odd-sized chunks */
112  
-
113  
-	if ( t ) {
114  
-		unsigned char *p = ctx->in + t;
115  
-
116  
-		t = 64-t;
117  
-		if (len < t) {
118  
-			memcpy(p, buf, len);
119  
-			return;
120  
-		}
121  
-		memcpy(p, buf, t);
122  
-		cvs_MD5Transform (ctx->buf, ctx->in);
123  
-		buf += t;
124  
-		len -= t;
125  
-	}
126  
-
127  
-	/* Process data in 64-byte chunks */
128  
-
129  
-	while (len >= 64) {
130  
-		memcpy(ctx->in, buf, 64);
131  
-		cvs_MD5Transform (ctx->buf, ctx->in);
132  
-		buf += 64;
133  
-		len -= 64;
134  
-	}
135  
-
136  
-	/* Handle any remaining bytes of data. */
137  
-
138  
-	memcpy(ctx->in, buf, len);
139  
-}
140  
-
141  
-/*
142  
- * Final wrapup - pad to 64-byte boundary with the bit pattern 
143  
- * 1 0* (64-bit count of bits processed, MSB-first)
144  
- */
145  
-void
146  
-cvs_MD5Final (digest, ctx)
147  
-     unsigned char digest[16];
148  
-     struct cvs_MD5Context *ctx;
149  
-{
150  
-	unsigned count;
151  
-	unsigned char *p;
152  
-
153  
-	/* Compute number of bytes mod 64 */
154  
-	count = (ctx->bits[0] >> 3) & 0x3F;
155  
-
156  
-	/* Set the first char of padding to 0x80.  This is safe since there is
157  
-	   always at least one byte free */
158  
-	p = ctx->in + count;
159  
-	*p++ = 0x80;
160  
-
161  
-	/* Bytes of padding needed to make 64 bytes */
162  
-	count = 64 - 1 - count;
163  
-
164  
-	/* Pad out to 56 mod 64 */
165  
-	if (count < 8) {
166  
-		/* Two lots of padding:  Pad the first block to 64 bytes */
167  
-		memset(p, 0, count);
168  
-		cvs_MD5Transform (ctx->buf, ctx->in);
169  
-
170  
-		/* Now fill the next block with 56 bytes */
171  
-		memset(ctx->in, 0, 56);
172  
-	} else {
173  
-		/* Pad block to 56 bytes */
174  
-		memset(p, 0, count-8);
175  
-	}
176  
-
177  
-	/* Append length in bits and transform */
178  
-	putu32(ctx->bits[0], ctx->in + 56);
179  
-	putu32(ctx->bits[1], ctx->in + 60);
180  
-
181  
-	cvs_MD5Transform (ctx->buf, ctx->in);
182  
-	putu32(ctx->buf[0], digest);
183  
-	putu32(ctx->buf[1], digest + 4);
184  
-	putu32(ctx->buf[2], digest + 8);
185  
-	putu32(ctx->buf[3], digest + 12);
186  
-	memset(ctx, 0, sizeof(ctx));	/* In case it's sensitive */
187  
-}
188  
-
189  
-#ifndef ASM_MD5
190  
-
191  
-/* The four core functions - F1 is optimized somewhat */
192  
-
193  
-/* #define F1(x, y, z) (x & y | ~x & z) */
194  
-#define F1(x, y, z) (z ^ (x & (y ^ z)))
195  
-#define F2(x, y, z) F1(z, x, y)
196  
-#define F3(x, y, z) (x ^ y ^ z)
197  
-#define F4(x, y, z) (y ^ (x | ~z))
198  
-
199  
-/* This is the central step in the MD5 algorithm. */
200  
-#define MD5STEP(f, w, x, y, z, data, s) \
201  
-	( w += f(x, y, z) + data, w &= 0xffffffff, w = w<<s | w>>(32-s), w += x )
202  
-
203  
-/*
204  
- * The core of the MD5 algorithm, this alters an existing MD5 hash to
205  
- * reflect the addition of 16 longwords of new data.  MD5Update blocks
206  
- * the data and converts bytes into longwords for this routine.
207  
- */
208  
-void
209  
-cvs_MD5Transform (buf, inraw)
210  
-     cvs_uint32 buf[4];
211  
-     const unsigned char inraw[64];
212  
-{
213  
-	register cvs_uint32 a, b, c, d;
214  
-	cvs_uint32 in[16];
215  
-	int i;
216  
-
217  
-	for (i = 0; i < 16; ++i)
218  
-		in[i] = getu32 (inraw + 4 * i);
219  
-
220  
-	a = buf[0];
221  
-	b = buf[1];
222  
-	c = buf[2];
223  
-	d = buf[3];
224  
-
225  
-	MD5STEP(F1, a, b, c, d, in[ 0]+0xd76aa478,  7);
226  
-	MD5STEP(F1, d, a, b, c, in[ 1]+0xe8c7b756, 12);
227  
-	MD5STEP(F1, c, d, a, b, in[ 2]+0x242070db, 17);
228  
-	MD5STEP(F1, b, c, d, a, in[ 3]+0xc1bdceee, 22);
229  
-	MD5STEP(F1, a, b, c, d, in[ 4]+0xf57c0faf,  7);
230  
-	MD5STEP(F1, d, a, b, c, in[ 5]+0x4787c62a, 12);
231  
-	MD5STEP(F1, c, d, a, b, in[ 6]+0xa8304613, 17);
232  
-	MD5STEP(F1, b, c, d, a, in[ 7]+0xfd469501, 22);
233  
-	MD5STEP(F1, a, b, c, d, in[ 8]+0x698098d8,  7);
234  
-	MD5STEP(F1, d, a, b, c, in[ 9]+0x8b44f7af, 12);
235  
-	MD5STEP(F1, c, d, a, b, in[10]+0xffff5bb1, 17);
236  
-	MD5STEP(F1, b, c, d, a, in[11]+0x895cd7be, 22);
237  
-	MD5STEP(F1, a, b, c, d, in[12]+0x6b901122,  7);
238  
-	MD5STEP(F1, d, a, b, c, in[13]+0xfd987193, 12);
239  
-	MD5STEP(F1, c, d, a, b, in[14]+0xa679438e, 17);
240  
-	MD5STEP(F1, b, c, d, a, in[15]+0x49b40821, 22);
241  
-
242  
-	MD5STEP(F2, a, b, c, d, in[ 1]+0xf61e2562,  5);
243  
-	MD5STEP(F2, d, a, b, c, in[ 6]+0xc040b340,  9);
244  
-	MD5STEP(F2, c, d, a, b, in[11]+0x265e5a51, 14);
245  
-	MD5STEP(F2, b, c, d, a, in[ 0]+0xe9b6c7aa, 20);
246  
-	MD5STEP(F2, a, b, c, d, in[ 5]+0xd62f105d,  5);
247  
-	MD5STEP(F2, d, a, b, c, in[10]+0x02441453,  9);
248  
-	MD5STEP(F2, c, d, a, b, in[15]+0xd8a1e681, 14);
249  
-	MD5STEP(F2, b, c, d, a, in[ 4]+0xe7d3fbc8, 20);
250  
-	MD5STEP(F2, a, b, c, d, in[ 9]+0x21e1cde6,  5);
251  
-	MD5STEP(F2, d, a, b, c, in[14]+0xc33707d6,  9);
252  
-	MD5STEP(F2, c, d, a, b, in[ 3]+0xf4d50d87, 14);
253  
-	MD5STEP(F2, b, c, d, a, in[ 8]+0x455a14ed, 20);
254  
-	MD5STEP(F2, a, b, c, d, in[13]+0xa9e3e905,  5);
255  
-	MD5STEP(F2, d, a, b, c, in[ 2]+0xfcefa3f8,  9);
256  
-	MD5STEP(F2, c, d, a, b, in[ 7]+0x676f02d9, 14);
257  
-	MD5STEP(F2, b, c, d, a, in[12]+0x8d2a4c8a, 20);
258  
-
259  
-	MD5STEP(F3, a, b, c, d, in[ 5]+0xfffa3942,  4);
260  
-	MD5STEP(F3, d, a, b, c, in[ 8]+0x8771f681, 11);
261  
-	MD5STEP(F3, c, d, a, b, in[11]+0x6d9d6122, 16);
262  
-	MD5STEP(F3, b, c, d, a, in[14]+0xfde5380c, 23);
263  
-	MD5STEP(F3, a, b, c, d, in[ 1]+0xa4beea44,  4);
264  
-	MD5STEP(F3, d, a, b, c, in[ 4]+0x4bdecfa9, 11);
265  
-	MD5STEP(F3, c, d, a, b, in[ 7]+0xf6bb4b60, 16);
266  
-	MD5STEP(F3, b, c, d, a, in[10]+0xbebfbc70, 23);
267  
-	MD5STEP(F3, a, b, c, d, in[13]+0x289b7ec6,  4);
268  
-	MD5STEP(F3, d, a, b, c, in[ 0]+0xeaa127fa, 11);
269  
-	MD5STEP(F3, c, d, a, b, in[ 3]+0xd4ef3085, 16);
270  
-	MD5STEP(F3, b, c, d, a, in[ 6]+0x04881d05, 23);
271  
-	MD5STEP(F3, a, b, c, d, in[ 9]+0xd9d4d039,  4);
272  
-	MD5STEP(F3, d, a, b, c, in[12]+0xe6db99e5, 11);
273  
-	MD5STEP(F3, c, d, a, b, in[15]+0x1fa27cf8, 16);
274  
-	MD5STEP(F3, b, c, d, a, in[ 2]+0xc4ac5665, 23);
275  
-
276  
-	MD5STEP(F4, a, b, c, d, in[ 0]+0xf4292244,  6);
277  
-	MD5STEP(F4, d, a, b, c, in[ 7]+0x432aff97, 10);
278  
-	MD5STEP(F4, c, d, a, b, in[14]+0xab9423a7, 15);
279  
-	MD5STEP(F4, b, c, d, a, in[ 5]+0xfc93a039, 21);
280  
-	MD5STEP(F4, a, b, c, d, in[12]+0x655b59c3,  6);
281  
-	MD5STEP(F4, d, a, b, c, in[ 3]+0x8f0ccc92, 10);
282  
-	MD5STEP(F4, c, d, a, b, in[10]+0xffeff47d, 15);
283  
-	MD5STEP(F4, b, c, d, a, in[ 1]+0x85845dd1, 21);
284  
-	MD5STEP(F4, a, b, c, d, in[ 8]+0x6fa87e4f,  6);
285  
-	MD5STEP(F4, d, a, b, c, in[15]+0xfe2ce6e0, 10);
286  
-	MD5STEP(F4, c, d, a, b, in[ 6]+0xa3014314, 15);
287  
-	MD5STEP(F4, b, c, d, a, in[13]+0x4e0811a1, 21);
288  
-	MD5STEP(F4, a, b, c, d, in[ 4]+0xf7537e82,  6);
289  
-	MD5STEP(F4, d, a, b, c, in[11]+0xbd3af235, 10);
290  
-	MD5STEP(F4, c, d, a, b, in[ 2]+0x2ad7d2bb, 15);
291  
-	MD5STEP(F4, b, c, d, a, in[ 9]+0xeb86d391, 21);
292  
-
293  
-	buf[0] += a;
294  
-	buf[1] += b;
295  
-	buf[2] += c;
296  
-	buf[3] += d;
297  
-}
298  
-#endif
299  
-
300  
-#ifdef TEST
301  
-/* Simple test program.  Can use it to manually run the tests from
302  
-   RFC1321 for example.  */
303  
-#include <stdio.h>
304  
-
305  
-int
306  
-main (int argc, char **argv)
307  
-{
308  
-	struct cvs_MD5Context context;
309  
-	unsigned char checksum[16];
310  
-	int i;
311  
-	int j;
312  
-
313  
-	if (argc < 2)
314  
-	{
315  
-		fprintf (stderr, "usage: %s string-to-hash\n", argv[0]);
316  
-		exit (1);
317  
-	}
318  
-	for (j = 1; j < argc; ++j)
319  
-	{
320  
-		printf ("MD5 (\"%s\") = ", argv[j]);
321  
-		cvs_MD5Init (&context);
322  
-		cvs_MD5Update (&context, argv[j], strlen (argv[j]));
323  
-		cvs_MD5Final (checksum, &context);
324  
-		for (i = 0; i < 16; i++)
325  
-		{
326  
-			printf ("%02x", (unsigned int) checksum[i]);
327  
-		}
328  
-		printf ("\n");
329  
-	}
330  
-	return 0;
331  
-}
332  
-#endif /* TEST */
39  src/md5.h
... ...
@@ -1,39 +0,0 @@
1  
-/* See md5.c for explanation and copyright information.  */
2  
-
3  
-/*
4  
- * $FreeBSD: src/contrib/cvs/lib/md5.h,v 1.2 1999/12/11 15:10:02 peter Exp $
5  
- */
6  
-
7  
-/* Add prototype support.  */
8  
-#ifndef PROTO
9  
-#if defined (USE_PROTOTYPES) ? USE_PROTOTYPES : defined (__STDC__)
10  
-#define PROTO(ARGS) ARGS
11  
-#else
12  
-#define PROTO(ARGS) ()
13  
-#endif
14  
-#endif
15  
-
16  
-#ifndef MD5_H
17  
-#define MD5_H
18  
-
19  
-/* Unlike previous versions of this code, uint32 need not be exactly
20  
-   32 bits, merely 32 bits or more.  Choosing a data type which is 32
21  
-   bits instead of 64 is not important; speed is considerably more
22  
-   important.  ANSI guarantees that "unsigned long" will be big enough,
23  
-   and always using it seems to have few disadvantages.  */
24  
-typedef unsigned long cvs_uint32;
25  
-
26  
-struct cvs_MD5Context {
27  
-	cvs_uint32 buf[4];
28  
-	cvs_uint32 bits[2];
29  
-	unsigned char in[64];
30  
-};
31  
-
32  
-void cvs_MD5Init PROTO ((struct cvs_MD5Context *context));
33  
-void cvs_MD5Update PROTO ((struct cvs_MD5Context *context,
34  
-			   unsigned char const *buf, unsigned len));
35  
-void cvs_MD5Final PROTO ((unsigned char digest[16],
36  
-			  struct cvs_MD5Context *context));
37  
-void cvs_MD5Transform PROTO ((cvs_uint32 buf[4], const unsigned char in[64]));
38  
-
39  
-#endif /* !MD5_H */
300  src/murmur.c
... ...
@@ -0,0 +1,300 @@
  1
+//-----------------------------------------------------------------------------
  2
+// MurmurHash3 was written by Austin Appleby, and is placed in the public
  3
+// domain. The author hereby disclaims copyright to this source code.
  4
+
  5
+// Note - The x86 and x64 versions do _not_ produce the same results, as the
  6
+// algorithms are optimized for their respective platforms. You can still
  7
+// compile and run any of them on any platform, but your performance with the
  8
+// non-native version will be less than optimal.
  9
+
  10
+#include "murmur.h"
  11
+
  12
+#define	FORCE_INLINE __attribute__((always_inline))
  13
+
  14
+FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
  15
+{
  16
+	return (x << r) | (x >> (32 - r));
  17
+}
  18
+
  19
+FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
  20
+{
  21
+	return (x << r) | (x >> (64 - r));
  22
+}
  23
+
  24
+#define	ROTL32(x,y)	rotl32(x,y)
  25
+#define ROTL64(x,y)	rotl64(x,y)
  26
+
  27
+#define BIG_CONSTANT(x) (x##LLU)
  28
+
  29
+#define getblock(x, i) (x[i])
  30
+
  31
+//-----------------------------------------------------------------------------
  32
+// Finalization mix - force all bits of a hash block to avalanche
  33
+
  34
+FORCE_INLINE uint32_t fmix32(uint32_t h)
  35
+{
  36
+	h ^= h >> 16;
  37
+	h *= 0x85ebca6b;
  38
+	h ^= h >> 13;
  39
+	h *= 0xc2b2ae35;
  40
+	h ^= h >> 16;
  41
+
  42
+	return h;
  43
+}
  44
+
  45
+//----------
  46
+
  47
+FORCE_INLINE uint64_t fmix64(uint64_t k)
  48
+{
  49
+	k ^= k >> 33;
  50
+	k *= BIG_CONSTANT(0xff51afd7ed558ccd);
  51
+	k ^= k >> 33;
  52
+	k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
  53
+	k ^= k >> 33;
  54
+
  55
+	return k;
  56
+}
  57
+
  58
+//-----------------------------------------------------------------------------
  59
+
  60
+void MurmurHash3_x86_32 ( const void * key, int len,
  61
+		uint32_t seed, void * out )
  62
+{
  63
+	const uint8_t * data = (const uint8_t*)key;
  64
+	const int nblocks = len / 4;
  65
+
  66
+	uint32_t h1 = seed;
  67
+
  68
+	uint32_t c1 = 0xcc9e2d51;
  69
+	uint32_t c2 = 0x1b873593;
  70
+
  71
+	int i;
  72
+
  73
+	//----------
  74
+	// body
  75
+
  76
+	const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);
  77
+
  78
+	for(i = -nblocks; i; i++) {
  79
+		uint32_t k1 = getblock(blocks,i);
  80
+
  81
+		k1 *= c1;
  82
+		k1 = ROTL32(k1,15);
  83
+		k1 *= c2;
  84
+
  85
+		h1 ^= k1;
  86
+		h1 = ROTL32(h1,13); 
  87
+		h1 = h1*5+0xe6546b64;
  88
+	}
  89
+
  90
+	//----------
  91
+	// tail
  92
+
  93
+	const uint8_t * tail = (const uint8_t*)(data + nblocks*4);
  94
+
  95
+	uint32_t k1 = 0;
  96
+
  97
+	switch(len & 3) {
  98
+		case 3: k1 ^= tail[2] << 16;
  99
+		case 2: k1 ^= tail[1] << 8;
  100
+		case 1: k1 ^= tail[0];
  101
+				k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  102
+	}
  103
+
  104
+	//----------
  105
+	// finalization
  106
+
  107
+	h1 ^= len;
  108
+
  109
+	h1 = fmix32(h1);
  110
+
  111
+	*(uint32_t*)out = h1;
  112
+} 
  113
+
  114
+//-----------------------------------------------------------------------------
  115
+
  116
+void MurmurHash3_x86_128 ( const void * key, const int len,
  117
+		uint32_t seed, void * out )
  118
+{
  119
+	const uint8_t * data = (const uint8_t*)key;
  120
+	const int nblocks = len / 16;
  121
+
  122
+	uint32_t h1 = seed;
  123
+	uint32_t h2 = seed;
  124
+	uint32_t h3 = seed;
  125
+	uint32_t h4 = seed;
  126
+
  127
+	uint32_t c1 = 0x239b961b; 
  128
+	uint32_t c2 = 0xab0e9789;
  129
+	uint32_t c3 = 0x38b34ae5; 
  130
+	uint32_t c4 = 0xa1e38b93;
  131
+
  132
+	int i;
  133
+
  134
+	//----------
  135
+	// body
  136
+
  137
+	const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
  138
+
  139
+	for(i = -nblocks; i; i++) {
  140
+		uint32_t k1 = getblock(blocks,i*4+0);
  141
+		uint32_t k2 = getblock(blocks,i*4+1);
  142
+		uint32_t k3 = getblock(blocks,i*4+2);
  143
+		uint32_t k4 = getblock(blocks,i*4+3);
  144
+
  145
+		k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  146
+
  147
+		h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;
  148
+
  149
+		k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
  150
+
  151
+		h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;
  152
+
  153
+		k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
  154
+
  155
+		h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;
  156
+
  157
+		k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
  158
+
  159
+		h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
  160
+	}
  161
+
  162
+	//----------
  163
+	// tail
  164
+
  165
+	const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
  166
+
  167
+	uint32_t k1 = 0;
  168
+	uint32_t k2 = 0;
  169
+	uint32_t k3 = 0;
  170
+	uint32_t k4 = 0;
  171
+
  172
+	switch(len & 15) {
  173
+		case 15: k4 ^= tail[14] << 16;
  174
+		case 14: k4 ^= tail[13] << 8;
  175
+		case 13: k4 ^= tail[12] << 0;
  176
+				 k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
  177
+
  178
+		case 12: k3 ^= tail[11] << 24;
  179
+		case 11: k3 ^= tail[10] << 16;
  180
+		case 10: k3 ^= tail[ 9] << 8;
  181
+		case  9: k3 ^= tail[ 8] << 0;
  182
+				 k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
  183
+
  184
+		case  8: k2 ^= tail[ 7] << 24;
  185
+		case  7: k2 ^= tail[ 6] << 16;
  186
+		case  6: k2 ^= tail[ 5] << 8;
  187
+		case  5: k2 ^= tail[ 4] << 0;
  188
+				 k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
  189
+
  190
+		case  4: k1 ^= tail[ 3] << 24;
  191
+		case  3: k1 ^= tail[ 2] << 16;
  192
+		case  2: k1 ^= tail[ 1] << 8;
  193
+		case  1: k1 ^= tail[ 0] << 0;
  194
+				 k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  195
+	}
  196
+
  197
+	//----------
  198
+	// finalization
  199
+
  200
+	h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;
  201
+
  202
+	h1 += h2; h1 += h3; h1 += h4;
  203
+	h2 += h1; h3 += h1; h4 += h1;
  204
+
  205
+	h1 = fmix32(h1);
  206
+	h2 = fmix32(h2);
  207
+	h3 = fmix32(h3);
  208
+	h4 = fmix32(h4);
  209
+
  210
+	h1 += h2; h1 += h3; h1 += h4;
  211
+	h2 += h1; h3 += h1; h4 += h1;
  212
+
  213
+	((uint32_t*)out)[0] = h1;
  214
+	((uint32_t*)out)[1] = h2;
  215
+	((uint32_t*)out)[2] = h3;
  216
+	((uint32_t*)out)[3] = h4;
  217
+}
  218
+
  219
+//-----------------------------------------------------------------------------
  220
+
  221
+void MurmurHash3_x64_128 ( const void * key, const int len,
  222
+		const uint32_t seed, void * out )
  223
+{
  224
+	const uint8_t * data = (const uint8_t*)key;
  225
+	const int nblocks = len / 16;
  226
+
  227
+	uint64_t h1 = seed;
  228
+	uint64_t h2 = seed;
  229
+
  230
+	uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
  231
+	uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
  232
+
  233
+	int i;
  234
+
  235
+	//----------
  236
+	// body
  237
+
  238
+	const uint64_t * blocks = (const uint64_t *)(data);
  239
+
  240
+	for(i = 0; i < nblocks; i++) {
  241
+		uint64_t k1 = getblock(blocks,i*2+0);
  242
+		uint64_t k2 = getblock(blocks,i*2+1);
  243
+
  244
+		k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
  245
+
  246
+		h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;
  247
+
  248
+		k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
  249
+
  250
+		h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
  251
+	}
  252
+
  253
+	//----------
  254
+	// tail
  255
+
  256
+	const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
  257
+
  258
+	uint64_t k1 = 0;
  259
+	uint64_t k2 = 0;
  260
+
  261
+	switch(len & 15) {
  262
+		case 15: k2 ^= ((uint64_t)tail[14]) << 48;
  263
+		case 14: k2 ^= ((uint64_t)tail[13]) << 40;
  264
+		case 13: k2 ^= ((uint64_t)tail[12]) << 32;
  265
+		case 12: k2 ^= ((uint64_t)tail[11]) << 24;
  266
+		case 11: k2 ^= ((uint64_t)tail[10]) << 16;
  267
+		case 10: k2 ^= ((uint64_t)tail[ 9]) << 8;
  268
+		case  9: k2 ^= ((uint64_t)tail[ 8]) << 0;
  269
+				 k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
  270
+
  271
+		case  8: k1 ^= ((uint64_t)tail[ 7]) << 56;
  272
+		case  7: k1 ^= ((uint64_t)tail[ 6]) << 48;
  273
+		case  6: k1 ^= ((uint64_t)tail[ 5]) << 40;
  274
+		case  5: k1 ^= ((uint64_t)tail[ 4]) << 32;
  275
+		case  4: k1 ^= ((uint64_t)tail[ 3]) << 24;
  276
+		case  3: k1 ^= ((uint64_t)tail[ 2]) << 16;
  277
+		case  2: k1 ^= ((uint64_t)tail[ 1]) << 8;
  278
+		case  1: k1 ^= ((uint64_t)tail[ 0]) << 0;
  279
+				 k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
  280
+	}
  281
+
  282
+	//----------
  283
+	// finalization
  284
+
  285
+	h1 ^= len; h2 ^= len;
  286
+
  287
+	h1 += h2;
  288
+	h2 += h1;
  289
+
  290
+	h1 = fmix64(h1);
  291
+	h2 = fmix64(h2);
  292
+
  293
+	h1 += h2;
  294
+	h2 += h1;
  295
+
  296
+	((uint64_t*)out)[0] = h1;
  297
+	((uint64_t*)out)[1] = h2;