Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Some performance tweaks #19

Closed
wants to merge 7 commits into from

11 participants

Vicent Marti Pierce Lopez Matt Reiferson Justin Hines Nicholas Curtis Tony Finch J. Randall Hunt Sylvester Jakubowski David Nadlinger Jinghao Yan Matt Godbolt
Vicent Marti
vmg commented

HAI FRIENDLY ENGINEERS OF BITLY I COME IN PEACE

I wanted to play around with your awesome bloom library for some internal super-secret GitHub stuff (HAH), but as it turns out, I found it a tad lacking on raw performance, despite being all ballin' and stuff when it comes to memory usage.

I ran some instrumentation on the library, made some minor (major?) changes, and wanted to share them back. Here's the rationale:

"So I was like, let me trace this, yo"

before

This is the first instrumentation I ran with the original code (unmodified). The corpus is the words file that ships with Mac OS 10.7:

$ wc /usr/share/dict/words
  235886  235886 2493109 /usr/share/dict/words

Clearly, the two main bottlenecks are:

  • 53.7% (1274.0ms) + 9.3% (221.0ms) + 2.9% (70.0ms) spent just in core hashing (without counting the hash finalization for MD5). As it turns
  • More than 5% spent in lseek, when the code only uses lseek in one place... (??)

Everything else in the trace was noise, so I decided to go HAM on just these two things. The lowest hanging fruit is certainly MD5: It is not a general purpose hash function, so its use in a bloom filter is ill-advised.

Sex & the CityHash

My first choice for a replacement was Google's CityHash (ayuup, I drank the Google koolaid -- I'm a moron and deserve to be mocked). I left the original commit in the branch for reference.

This simple change halved the runtime, but traces were still showing way too much time spent in hashing. The cause? Well, the bloom filter requires a pretty wide nfunc size for most corpuses if you want reasonable error rates, but CityHash has only two hash modes: Either 64 bit, or 128 bit. Neither of these modes is optimal for bloom filter operation.

  • 64-bit hash output is (supposedly) optimized for small strings, which is not our target corpus at GitHub, although it should perform well with the words file of this synthetic benchmark. In practice, we end up performing too many calls to CityHash to fill the nfunc spectrum because of the small output size.

  • 128-bit (also know as brain-damage-mode) performs poorly for small strings (by poorly I mean worse than other highly-optimized general purpose hashes) and doesn't offer any other specific advantages besides the adequate output word size.

To top off this disaster, CityHash doesn't really have a native "seeded mode". The seed API performs a standard hash and then an extra iteration (??) on top of the result to mix in the seed, instead of seeding the standard hash initially.

...So I killed CityHash with fire.

Enter Murmur

MurmurHash has always been my favorite real-world hash function, and in retrospect I should have skipped City and gone straight for it.

It offers brilliant performance for all kind of string values and is always linear with string sizes without requiring special-casing for short strings. It also takes alignment issues into account.

To top it off, Murmur doesn't return the hash value on the stack/registers but writes it directly to a provided buffer. This makes it exceedingly easy to fill the bloom->hashes buffer with a lot of random data and perform the modularization incrementally.

    for (i = 0; i < bloom->nsalts; i++, hashes += 4) {
        MurmurHash3_x64_128(key, key_len, bloom->salts[i], hashes);
        hashes[0] = hashes[0] % bloom->counts_per_func;
        hashes[1] = hashes[1] % bloom->counts_per_func;
        hashes[2] = hashes[2] % bloom->counts_per_func;
        hashes[3] = hashes[3] % bloom->counts_per_func;
    }

(note that we have aligned the hashes buffer to 16-bytes to prevent corner-case overflow checks). This is simple and straightforward, and makes my nipples tingle. n salts, and each salt throws 128 bits. Wrap'em and we're done here!

Enlarge your files

After dropping in an optimal hash function, the instrumentation showed a hilariously high percent of time spent in the kernel performing lseeks. I wondered where it was coming from...

        for (; size < new_size; size++) {
            if (lseek(fd, size, SEEK_SET) < 0) {
                perror("Error, calling lseek() to set file size");
                free_bitmap(bitmap);
                close(fd);
                return NULL;
            }
        }

Apparently the code to resize a file on the filesystem was performing an absolute seek for every single byte that the file had to be increased. This is... heuh... I don't know if this is for compatibility reasons, but the POSIX standard defines a very very handy ftruncate call:

The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.

If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').

This works wonders on both Mac OS X and Linux, and lets the kernel fill the file efficiently with those pesky NULL bytes, even in highly fragmented filesystems. After replacing the lseek calls with a ftruncate, all kernel operations (including the mmaps) became noise in the instrumentation. Awesome!

This is where we're at now

after

As far as I'm concerned, the instrumentation trace has been obliterated.

  • Murmur cannot be made any faster, that's the way it is.
  • hash_func is stalling with all the modulo operations (even though they have no interdeps and should be going simultaneously on the pipeline, I think...). There are no SIMD modulo instructions, so I don't see how to work around this.
  • All the small bumps there come from the actual test program, not the library itself. Mostly strchr for splitting up the words in the dictionary file.
  • bitmap_check and bitmap_increment are tiny and fast. Nothing to do here. :/
  • Everything else is noise. :sparkles:

Also, binary strings

This is not performance related (at least not directly), but it totally bummed me that the API was requiring NULL-terminated strings, specially since I'm pretty sure you wrote this to be wrapped from dynamic languages, and all these languages incur on a penalty when asking for a NULL-terminated string (see: Python string slices yo, that's some memory being duped all over the place for NULL-termination) instead of grabbing the raw buffer + it's length.

I've changed the API accordingly, adding a len argument to all calls. Obviously, NULL-terminated strings can still be used by passing strlen(string) in the external API, instead of performing the measurement internally like before.

Final benchmarks

Averaged 8 runs for the original code, words is still the corpus.

Run 1: 2.182463s
Run 2: 2.177441s
Run 3: 2.174175s
Run 4: 2.178066s
Run 5: 2.190548s
Run 6: 2.179080s
Run 7: 2.180691s
Run 8: 2.184210s
AVG: 2.180834

Averaged 8 runs for the updated code, same corpus.

Run 1: 0.321654s
Run 2: 0.310658s
Run 3: 0.314666s
Run 4: 0.307526s
Run 5: 0.311680s
Run 6: 0.316963s
Run 7: 0.307528s
Run 8: 0.309479s
AVG: 0.312519

700% increase on this synthetic benchmark. For our specific corpus (bigger count, strings significantly larger than dictionary words), I get a 1300% increase. This is basically Murmur at work. Results may vary.

Hey you piece of shit did you break the filtering?

Don't think so. Murmur generates very high quality entropy, high enough to come close to MD5 for all measurements.

It's on my TODO list to perform some tests and see if there's an statistically significant variance on the amount of false positives between the two hash functions. Anecdotally, for the words dataset, MD5 was generating 1859 positives, while Murmur decreased that to 1815. THIS IS NOT SCIENTIFIC.

Common sense tells us that MD5, being cryptographically sound, should always stay ahead on pure entropy measurement, but the avalanching properties of Murmur are gorgeous. So I'm happy with this. 100% Pinkie Pie approved.

THAT'S IT

Ayup, as far as I'm concerned this now has acceptable performance to start building big stuff with it. I may look into making this even faster when I can play with more of our real-world data.

I understand these are very serious changes coming out of fucking nowhere, so I don't expect this to be merged straight away. Feel free to look at them with a magnifying glass, test it, see how it performs with your corpus (I assume they are links?), call me a moron and set my ass on :fire:... Anyway, you know how the song goes.

Hey, I just met you, and this is crazy, but I rewrote your bloom hashes, so merge me, maybe?

vmg added some commits
Vicent Marti vmg Switch MD5 with City 82a791d
Vicent Marti vmg ...and this is how you resize a file in Unix
"The truncate() and ftruncate() functions cause the regular file named
by path or referenced by fd to be truncated to a size of precisely
length bytes.

If the file previously was larger than this size, the extra data is
lost. If the file previously was shorter, it is extended, and the
extended part reads as null bytes ('\0')."
d2397a4
Vicent Marti vmg Add support for binary keys
There's no need for the keys to be NULL-terminated... That's so from the
90s!

Also, note that Python strings (and pretty much any dynamic language)
already keep length information. This will save us quite a few
`strlen` calls.
03155a0
Vicent Marti vmg Ok, disregard City that was a bad idea 21ae03e
Vicent Marti vmg Documentation is nice! f88ec1c
Vicent Marti vmg Oops! Indentation! cc4b676
Vicent Marti vmg Better seeds? 86a865c
Pierce Lopez
Owner

You should have seen how slow it was in pure python...

Anyway, I like the switch to a more efficient hash function, I like the change from NULL-terminated string to buffer+length, I like the fix of that crazy fseek loop.

Right now I'm tracking down what I think might be a bug, but (depending on what @hines and @mreiferson think) this could go in soon after.

Pierce Lopez
Owner

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

Matt Reiferson
Owner

tl;dr


Honestly though, thanks for this excellent contribution! We're really excited to see all the interest in this project and we're glad GitHub can find some use for it.

Also, I'm on board with these changes.

Can't argue much with the raw speed improvements of the change to the hash function + file resize.

I also completely agree with the API change to take length args. It is more robust, was overlooked, and we might as well make the breaking changes now while the project is young.

We'll need to do some testing early this week and as @ploxiln mentioned theres a possible bug being investigated so we'll bring this in soon.

Thanks and keep 'em coming!

Justin Hines
Owner
Vicent Marti
vmg commented

:sparkles: Yey :sparkles:

Glad you like the changes! Sorry it took me a while to answer, I was watching SCIENCE.

I notice that 3 different murmur hash functions are included, and 2 of them are used... maybe the x64_128 one isn't very appropriate for generating the seeds, but it would be nice if just one hash function was provided, and used.

I see what you mean. I brought Murmur mostly intact from the original C++ sources (just with minimal translation to make it build as C), but it would make things simpler if we were just using the same hash for generating the salts and the key hashes. I'll look into this.

Regarding Murmur_x86_128 vs Murmur_x64_128: we could certainly drop the x86 version if you're not concerned about 32-bit performance (I am not). If you plan to target x86 systems, I can conditionally swap the appropriate hash at build time. You'll get a nice performance boost from the smaller (internal) word size.

Pierce Lopez
Owner

We think you should drop the x86 version; we're not concerned about absolute best 32-bit performance, and it'll be faster than md5 on x86 anyway.

Matt Reiferson mreiferson commented on the diff
src/dablooms.c
((38 lines not shown))
- }
- bloom->num_salts = div;
- bloom->salts = calloc(div, SALT_SIZE);
- for (i = 0; i < div; i++) {
- struct cvs_MD5Context context;
- unsigned char checksum[16];
- cvs_MD5Init (&context);
- cvs_MD5Update (&context, (unsigned char *) &i, sizeof(int));
- cvs_MD5Final (checksum, &context);
- memcpy(bloom->salts + i * SALT_SIZE, &checksum, SALT_SIZE);
+ const uint32_t root = 0xba11742c;
+ const uint32_t seed = 0xd5702acb;
+
+ int i, num_salts = bloom->nfuncs / 4;
+
+ if (bloom->nfuncs % 4)
Matt Reiferson Owner

we should document this better in the README...

we use astyle on our C stuff, specifically this command line:

astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
        --fill-empty-lines --pad-header --indent-switches           \
        --align-pointer=name --align-reference=name --pad-oper -n <file(s)>

the rest of your changes look fine, but this line would get the curly brace police all over it :)

also, we notably dont run astyle on code we've imported that isn't ours (like the md5/murmur files)

Pierce Lopez Owner
ploxiln added a note

by the way, we use revision 353 from the astyle sourceforge svn repo; it has some fixes not present in the latest release; I'll definitly make note of this stuff in the README

Matt Reiferson Owner

now documented in the README

Jinghao Yan
jinghao added a note

Micro-optimization: %4 is the same as & 3

@jinghao: Wouldn't any decent compiler backend transform integer arithmetic with constant parameters into the most efficient representation on the target automatically?

Pierce Lopez Owner
ploxiln added a note

In case anyone had doubts, yes, gcc optimises trivial short computations like x/4 and x%4, even without -O2. For demonstration:

src/dablooms.c:new_salts()

void new_salts(counting_bloom_t *bloom)
{
    int div = bloom->nfuncs / 4;
    int mod = bloom->nfuncs % 4;
...

the assembly generated

objdump -d build/test_dablooms | less
...
000000000040184c <new_salts>:
  40184c:       55                             push   %rbp
  40184d:       48 89 e5                       mov    %rsp,%rbp
  401850:       48 81 ec a0 00 00 00           sub    $0xa0,%rsp
  401857:       48 89 bd 68 ff ff ff           mov    %rdi,-0x98(%rbp)
  40185e:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401865:       48 8b 40 28                    mov    0x28(%rax),%rax
  401869:       48 c1 e8 02                    shr    $0x2,%rax                /* ">> 2" */
  40186d:       89 45 fc                       mov    %eax,-0x4(%rbp)
  401870:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401877:       48 8b 40 28                    mov    0x28(%rax),%rax
  40187b:       83 e0 03                       and    $0x3,%eax                /* "& 3" */
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Pierce Lopez
Owner

@vmg can you please rebase on current master (there's going to be conflicts in the test_libdablooms.c due to recently merged changes of mine, sorry), and squash into 3 commits:

1) replace lseek() loop with lseek() || ftruncate()
2) replace md5 hash with murmur 128-bit hash
3) change interface to take buffer-length instead of null-terminated string (this will require fixes to the new test_pydablooms.py also)

thanks :)

Vicent Marti
vmg commented

Sorry for the delay, I've just landed on SF. I'll rebase the PR as soon as the jet lag allows.

(...why is this in the frontpage of HN?)

Jinghao Yan jinghao commented on the diff
src/murmur.c
((125 lines not shown))
+ uint32_t h4 = seed;
+
+ uint32_t c1 = 0x239b961b;
+ uint32_t c2 = 0xab0e9789;
+ uint32_t c3 = 0x38b34ae5;
+ uint32_t c4 = 0xa1e38b93;
+
+ int i;
+
+ //----------
+ // body
+
+ const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
+
+ for(i = -nblocks; i; i++) {
+ uint32_t k1 = getblock(blocks,i*4+0);
Jinghao Yan
jinghao added a note

Micro-optimization: << 2 is the same as * 4

(i << 4) | 1 is the same as i*4 + 1

same with 2, 3

This is in MurmurHash code, so if you think this is worth the loss in readability, you might want to open an issue at http://code.google.com/p/smhasher/issues/list. But then again, don't underestimate optimizing compilers – in my experience, most backends recognize simple peephole optimizations like this just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Jinghao Yan jinghao commented on the diff
src/murmur.c
((226 lines not shown))
+
+ uint64_t h1 = seed;
+ uint64_t h2 = seed;
+
+ uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
+ uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
+
+ int i;
+
+ //----------
+ // body
+
+ const uint64_t * blocks = (const uint64_t *)(data);
+
+ for(i = 0; i < nblocks; i++) {
+ uint64_t k1 = getblock(blocks,i*2+0);
Jinghao Yan
jinghao added a note

i << 1
(i << 1) | 1

As discussed above, compilers are smart enough to do appropriate transforms of this type nowadays. In fact, the (i<<1)|1 is a slight pessimization in my tests.

CF http://url.godbolt.org/shiftVsMultiply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Nicholas Curtis

This is the best pull request I have ever seen. You obvisouly know your stuff @vmg the development community should very grateful for your contributions

Tony Finch

Note that a Bloom filter only needs two hash functions: you can get an arbitrary number of hash values from a linear combination of the output of the two functions. This does not harm the accuracy of the Bloom filter compared to having many independent hash functions. See this paper for details: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf

Deleted user

"(...why is this in the frontpage of HN?)"

Nipple touching and SCIENCE watching.

This was referenced
Pierce Lopez
Owner

I've submitted a rebased version of this pull request as #39. I've kept authorship credits to @vmg (but please let me know if you don't want your name on the changes i made while rebasing and squashing)

Pierce Lopez
Owner

We really liked this pull request, and it's now merged, although as #39. Thanks!

Pierce Lopez ploxiln closed this
Vicent Marti

Oh shit. Sorry guys, I've spent the last week mostly sick, so this went totally over my head. Thanks tons for taking the time to squash and merge this, @ploxiln. :sparkles:

I'm going to strike back with faster hashing thanks to linear combination. Stay tuned.

Pierce Lopez
Owner

might you be talking about something similar to #41 - if so, we're already looking at that

J. Randall Hunt

@vmg what tool did you use to generate this:
trace

Vicent Marti

That is Apple's Instruments, in profiling mode, for Mac OS X 10.7. :sparkles:

Sylvester Jakubowski

best pull request ever.

sorry for the necropost, but I was digging for examples for some of my devs and I had to comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Aug 4, 2012
  1. Vicent Marti

    Switch MD5 with City

    vmg authored
  2. Vicent Marti

    ...and this is how you resize a file in Unix

    vmg authored
    "The truncate() and ftruncate() functions cause the regular file named
    by path or referenced by fd to be truncated to a size of precisely
    length bytes.
    
    If the file previously was larger than this size, the extra data is
    lost. If the file previously was shorter, it is extended, and the
    extended part reads as null bytes ('\0')."
  3. Vicent Marti

    Add support for binary keys

    vmg authored
    There's no need for the keys to be NULL-terminated... That's so from the
    90s!
    
    Also, note that Python strings (and pretty much any dynamic language)
    already keep length information. This will save us quite a few
    `strlen` calls.
  4. Vicent Marti
  5. Vicent Marti

    Documentation is nice!

    vmg authored
  6. Vicent Marti

    Oops! Indentation!

    vmg authored
Commits on Aug 5, 2012
  1. Vicent Marti

    Better seeds?

    vmg authored
This page is out of date. Refresh to see the latest.
2  Makefile
View
@@ -59,7 +59,7 @@ PY_MOD_DIR := $(shell $(PYTHON) -c "import distutils.sysconfig ; print(distutils
PY_FLAGS = --build-lib=$(PY_BLDDIR) --build-temp=$(PY_BLDDIR)
PY_BLD_ENV = INCPATH="$(SRCDIR) $(INCPATH)" LIBPATH="$(BLDDIR) $(LIBPATH)"
-SRCS_LIBDABLOOMS = md5.c dablooms.c
+SRCS_LIBDABLOOMS = murmur.c dablooms.c
SRCS_TESTS = test_dablooms.c
WORDS = /usr/share/dict/words
OBJS_LIBDABLOOMS = $(patsubst %.c, $(BLDDIR)/%.o, $(SRCS_LIBDABLOOMS))
18 pydablooms/pydablooms.c
View
@@ -51,37 +51,39 @@ static int Dablooms_init(Dablooms *self, PyObject *args, PyObject *kwds)
static PyObject *check(Dablooms *self, PyObject *args)
{
const char *hash;
- if (!PyArg_ParseTuple(args, "s", &hash)) {
+ int len;
+
+ if (!PyArg_ParseTuple(args, "s#", &hash, &len)) {
return NULL;
}
- return Py_BuildValue("i", scaling_bloom_check(self->filter, hash));
+ return Py_BuildValue("i", scaling_bloom_check(self->filter, hash, len));
}
static PyObject *add(Dablooms *self, PyObject *args, PyObject *kwds)
{
const char *hash;
- int id;
+ int id, len;
static char *kwlist[] = {"hash", "id", NULL};
- if (! PyArg_ParseTupleAndKeywords(args, kwds, "|si", kwlist, &hash, &id)) {
+ if (! PyArg_ParseTupleAndKeywords(args, kwds, "|s#i", kwlist, &hash, &len, &id)) {
return NULL;
}
- return Py_BuildValue("i", scaling_bloom_add(self->filter, hash, id));
+ return Py_BuildValue("i", scaling_bloom_add(self->filter, hash, len, id));
}
static PyObject *delete(Dablooms *self, PyObject *args, PyObject *kwds)
{
const char *hash;
- int id;
+ int id, len;
static char *kwlist[] = {"hash", "id", NULL};
- if (! PyArg_ParseTupleAndKeywords(args, kwds, "|si", kwlist, &hash, &id)) {
+ if (! PyArg_ParseTupleAndKeywords(args, kwds, "|s#i", kwlist, &hash, &len, &id)) {
return NULL;
}
- return Py_BuildValue("i", scaling_bloom_remove(self->filter, hash, id));
+ return Py_BuildValue("i", scaling_bloom_remove(self->filter, hash, len, id));
}
static PyObject *flush(Dablooms *self, PyObject *args, PyObject *kwds)
150 src/dablooms.c
View
@@ -12,14 +12,13 @@
#include <sys/mman.h>
#include <unistd.h>
-#include "md5.h"
+#include "murmur.h"
#include "dablooms.h"
#define DABLOOMS_VERSION "0.8.1"
#define HEADER_BYTES (2*sizeof(uint32_t))
#define SCALE_HEADER_BYTES (3*sizeof(uint64_t))
-#define SALT_SIZE 16
const char *dablooms_version(void)
{
@@ -45,16 +44,9 @@ bitmap_t *bitmap_resize(bitmap_t *bitmap, size_t old_size, size_t new_size)
/* Write something to the end of the file to insure allocated the space */
if (size == old_size) {
- for (; size < new_size; size++) {
- if (lseek(fd, size, SEEK_SET) < 0) {
- perror("Error, calling lseek() to set file size");
- free_bitmap(bitmap);
- close(fd);
- return NULL;
- }
- }
- if (write(fd, "", 1) < 0) {
- perror("Error, writing last byte of the file");
+ if (lseek(fd, new_size, SEEK_SET) < 0 ||
+ ftruncate(fd, (off_t)new_size) < 0) {
+ perror("Error increasing file size with ftruncate");
free_bitmap(bitmap);
close(fd);
return NULL;
@@ -184,58 +176,76 @@ int bitmap_flush(bitmap_t *bitmap)
}
}
-/* Each function has a unique salt, so we need at least nfuncs salts.
- * An MD5 hash is 16 bytes long, and each salt only needds to be 4 bytes
- * Thus we can proportion 4 salts per each md5 hash we create as a salt.
+/*
+ * Build some sexy new salts for the bloom filter. How?
+ *
+ * With Murmur3_128, we turn a key and a 4-byte salt into a 16 bytes
+ * hash; this hash can be split in four 4-byte hashes, which are
+ * the target size for our bloom filter.
+ *
+ * Hence if we require `nfunc` 4-byte hashes, we need to generate
+ * `nfunc` / 4 different salts (this number in rounded upwards for
+ * the cases where `nfunc` doesn't divide evenly, and we only need
+ * to take 1, 2 or 3 words from the 128-bit hash seeded with the
+ * last salt).
+ *
+ * We build these salts incrementally using Murmur3_32 (4-byte output,
+ * matches our target salt size). The intitial salt is a function
+ * of a predefined root; consequent salts are chained on top of the
+ * first one using the same seed but xor'ed with the salt index.
+ *
+ * Note that this salt generation is stable, i.e. will always remain
+ * the same between different instantiations of a filter. There is
+ * no pure randomness involved.
*/
-void new_salts(counting_bloom_t *bloom)
+static void new_salts(counting_bloom_t *bloom)
{
- int div = bloom->nfuncs / 4;
- int mod = bloom->nfuncs % 4;
- int i;
-
- if (mod) {
- div += 1;
- }
- bloom->num_salts = div;
- bloom->salts = calloc(div, SALT_SIZE);
- for (i = 0; i < div; i++) {
- struct cvs_MD5Context context;
- unsigned char checksum[16];
- cvs_MD5Init (&context);
- cvs_MD5Update (&context, (unsigned char *) &i, sizeof(int));
- cvs_MD5Final (checksum, &context);
- memcpy(bloom->salts + i * SALT_SIZE, &checksum, SALT_SIZE);
+ const uint32_t root = 0xba11742c;
+ const uint32_t seed = 0xd5702acb;
+
+ int i, num_salts = bloom->nfuncs / 4;
+
+ if (bloom->nfuncs % 4)
Matt Reiferson Owner

we should document this better in the README...

we use astyle on our C stuff, specifically this command line:

astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
        --fill-empty-lines --pad-header --indent-switches           \
        --align-pointer=name --align-reference=name --pad-oper -n <file(s)>

the rest of your changes look fine, but this line would get the curly brace police all over it :)

also, we notably dont run astyle on code we've imported that isn't ours (like the md5/murmur files)

Pierce Lopez Owner
ploxiln added a note

by the way, we use revision 353 from the astyle sourceforge svn repo; it has some fixes not present in the latest release; I'll definitly make note of this stuff in the README

Matt Reiferson Owner

now documented in the README

Jinghao Yan
jinghao added a note

Micro-optimization: %4 is the same as & 3

@jinghao: Wouldn't any decent compiler backend transform integer arithmetic with constant parameters into the most efficient representation on the target automatically?

Pierce Lopez Owner
ploxiln added a note

In case anyone had doubts, yes, gcc optimises trivial short computations like x/4 and x%4, even without -O2. For demonstration:

src/dablooms.c:new_salts()

void new_salts(counting_bloom_t *bloom)
{
    int div = bloom->nfuncs / 4;
    int mod = bloom->nfuncs % 4;
...

the assembly generated

objdump -d build/test_dablooms | less
...
000000000040184c <new_salts>:
  40184c:       55                             push   %rbp
  40184d:       48 89 e5                       mov    %rsp,%rbp
  401850:       48 81 ec a0 00 00 00           sub    $0xa0,%rsp
  401857:       48 89 bd 68 ff ff ff           mov    %rdi,-0x98(%rbp)
  40185e:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401865:       48 8b 40 28                    mov    0x28(%rax),%rax
  401869:       48 c1 e8 02                    shr    $0x2,%rax                /* ">> 2" */
  40186d:       89 45 fc                       mov    %eax,-0x4(%rbp)
  401870:       48 8b 85 68 ff ff ff           mov    -0x98(%rbp),%rax
  401877:       48 8b 40 28                    mov    0x28(%rax),%rax
  40187b:       83 e0 03                       and    $0x3,%eax                /* "& 3" */
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ num_salts++;
+
+ bloom->salts = calloc(num_salts, sizeof(uint32_t));
+ bloom->nsalts = num_salts;
+
+ /* initial salt, seeded from root */
+ MurmurHash3_x86_32((char *)&root, sizeof(uint32_t), seed, bloom->salts);
+
+ for (i = 1; i < num_salts; i++) {
+ /* remaining salts are chained on top */
+ uint32_t base = bloom->salts[i - 1] ^ i;
+ MurmurHash3_x86_32((char *)&base, sizeof(uint32_t), seed, bloom->salts + i);
}
}
-/* We are are using the salts, adding them to the new md5 hash, adding the key,
- * converting said md5 hash to 4 byte indexes
+/*
+ * Perform the actual hashing for `key`
+ *
+ * We get one 128-bit hash for every salt we've previously
+ * allocated. From this 128-bit hash, we get 4 32-bit hashes
+ * with our target size; we need to wrap them around
+ * individually.
+ *
+ * Note that there are no overflow checks for the cases where
+ * we have a non-multiple of 4 number of hashes, because we've
+ * allocated the `hashes` array in 16-byte boundaries. In these
+ * cases, the remaining 1, 2 or 3 hashes will simply not be
+ * accessed.
*/
-unsigned int *hash_func(counting_bloom_t *bloom, const char *key, unsigned int *hashes)
+static void hash_func(counting_bloom_t *bloom, const char *key, size_t key_len, uint32_t *hashes)
{
+ int i;
- int i, j, hash_cnt, hash;
- unsigned char *salts = bloom->salts;
- hash_cnt = 0;
-
- for (i = 0; i < bloom->num_salts; i++) {
- struct cvs_MD5Context context;
- unsigned char checksum[16];
- cvs_MD5Init(&context);
- cvs_MD5Update(&context, salts + i * SALT_SIZE, SALT_SIZE);
- cvs_MD5Update(&context, (unsigned char *)key, strlen(key));
- cvs_MD5Final(checksum, &context);
- for (j = 0; j < sizeof(checksum); j += 4) {
- if (hash_cnt >= (bloom->nfuncs)) {
- break;
- }
- hash = *(uint32_t *)(checksum + j);
- hashes[hash_cnt] = hash % bloom->counts_per_func;
- hash_cnt++;
- }
+ for (i = 0; i < bloom->nsalts; i++, hashes += 4) {
+ MurmurHash3_x64_128(key, key_len, bloom->salts[i], hashes);
+ hashes[0] = hashes[0] % bloom->counts_per_func;
+ hashes[1] = hashes[1] % bloom->counts_per_func;
+ hashes[2] = hashes[2] % bloom->counts_per_func;
+ hashes[3] = hashes[3] % bloom->counts_per_func;
}
- return hashes;
}
int free_counting_bloom(counting_bloom_t *bloom)
@@ -278,8 +288,12 @@ counting_bloom_t *counting_bloom_init(unsigned int capacity, double error_rate,
bloom->counts_per_func = (int) ceil(capacity * fabs(log(error_rate)) / (bloom->nfuncs * pow(log(2), 2)));
bloom->size = ceil(bloom->nfuncs * bloom->counts_per_func);
bloom->num_bytes = (int) ceil(bloom->size / 2 + HEADER_BYTES);
- bloom->hashes = calloc(bloom->nfuncs, sizeof(unsigned int));
+
new_salts(bloom);
+
+ /* hashes; make sure they are always allocated as a multiple of 16
+ * to skip the overflow check when generating 128-bit hashes */
+ bloom->hashes = malloc(bloom->nsalts * 16);
return bloom;
}
@@ -318,12 +332,12 @@ counting_bloom_t *counting_bloom_from_file(unsigned capacity, double error_rate,
return cur_bloom;
}
-int counting_bloom_add(counting_bloom_t *bloom, const char *s)
+int counting_bloom_add(counting_bloom_t *bloom, const char *s, size_t len)
{
unsigned int index, i, offset;
unsigned int *hashes = bloom->hashes;
- hash_func(bloom, s, hashes);
+ hash_func(bloom, s, len, hashes);
for (i = 0; i < bloom->nfuncs; i++) {
offset = i * bloom->counts_per_func;
@@ -335,12 +349,12 @@ int counting_bloom_add(counting_bloom_t *bloom, const char *s)
return 0;
}
-int counting_bloom_remove(counting_bloom_t *bloom, const char *s)
+int counting_bloom_remove(counting_bloom_t *bloom, const char *s, size_t len)
{
unsigned int index, i, offset;
unsigned int *hashes = bloom->hashes;
- hash_func(bloom, s, hashes);
+ hash_func(bloom, s, len, hashes);
for (i = 0; i < bloom->nfuncs; i++) {
offset = i * bloom->counts_per_func;
@@ -352,12 +366,12 @@ int counting_bloom_remove(counting_bloom_t *bloom, const char *s)
return 0;
}
-int counting_bloom_check(counting_bloom_t *bloom, const char *s)
+int counting_bloom_check(counting_bloom_t *bloom, const char *s, size_t len)
{
unsigned int index, i, offset;
unsigned int *hashes = bloom->hashes;
- hash_func(bloom, s, hashes);
+ hash_func(bloom, s, len, hashes);
for (i = 0; i < bloom->nfuncs; i++) {
offset = i * bloom->counts_per_func;
@@ -425,7 +439,7 @@ counting_bloom_t *new_counting_bloom_from_scale(scaling_bloom_t *bloom, uint32_t
}
-int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, uint32_t id)
+int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id)
{
int i;
int nblooms = bloom->num_blooms;
@@ -444,14 +458,14 @@ int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, uint32_t id)
if ((*bloom->header->max_id) < id) {
(*bloom->header->max_id) = id;
}
- counting_bloom_add(cur_bloom, s);
+ counting_bloom_add(cur_bloom, s, len);
(*bloom->header->posseq) ++;
return 1;
}
-int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id)
+int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id)
{
counting_bloom_t *cur_bloom;
int id_diff, i;
@@ -461,7 +475,7 @@ int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id)
id_diff = id - (*cur_bloom->header->id);
if (id_diff >= 0) {
(*bloom->header->preseq)++;
- counting_bloom_remove(cur_bloom, s);
+ counting_bloom_remove(cur_bloom, s, len);
(*bloom->header->posseq)++;
return 1;
}
@@ -469,13 +483,13 @@ int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id)
return 0;
}
-int scaling_bloom_check(scaling_bloom_t *bloom, const char *s)
+int scaling_bloom_check(scaling_bloom_t *bloom, const char *s, size_t len)
{
int i;
counting_bloom_t *cur_bloom;
for (i = bloom->num_blooms - 1; i >= 0; i--) {
cur_bloom = bloom->blooms[i];
- if (counting_bloom_check(cur_bloom, s)) {
+ if (counting_bloom_check(cur_bloom, s, len)) {
return 1;
}
}
20 src/dablooms.h
View
@@ -35,9 +35,11 @@ typedef struct {
unsigned int capacity;
unsigned int offset;
unsigned int counts_per_func;
- unsigned int num_salts;
- unsigned char *salts;
- unsigned int *hashes;
+
+ uint32_t *salts;
+ uint32_t *hashes;
+
+ size_t nsalts;
size_t nfuncs;
size_t size;
size_t num_bytes;
@@ -48,9 +50,9 @@ typedef struct {
int free_counting_bloom(counting_bloom_t *bloom);
counting_bloom_t *new_counting_bloom(unsigned int capacity, double error_rate, const char *filename);
counting_bloom_t *new_counting_bloom_from_file(unsigned int capacity, double error_rate, const char *filename);
-int counting_bloom_add(counting_bloom_t *bloom, const char *s);
-int counting_bloom_remove(counting_bloom_t *bloom, const char *s);
-int counting_bloom_check(counting_bloom_t *bloom, const char *s);
+int counting_bloom_add(counting_bloom_t *bloom, const char *s, size_t len);
+int counting_bloom_remove(counting_bloom_t *bloom, const char *s, size_t len);
+int counting_bloom_check(counting_bloom_t *bloom, const char *s, size_t len);
typedef struct {
@@ -74,8 +76,8 @@ typedef struct {
scaling_bloom_t *new_scaling_bloom(unsigned int capacity, double error_rate, const char *filename, uint32_t id);
scaling_bloom_t *new_scaling_bloom_from_file(unsigned int capacity, double error_rate, const char *filename);
int free_scaling_bloom(scaling_bloom_t *bloom);
-int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, uint32_t id);
-int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, uint32_t id);
-int scaling_bloom_check(scaling_bloom_t *bloom, const char *s);
+int scaling_bloom_add(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id);
+int scaling_bloom_remove(scaling_bloom_t *bloom, const char *s, size_t len, uint32_t id);
+int scaling_bloom_check(scaling_bloom_t *bloom, const char *s, size_t len);
int scaling_bloom_flush(scaling_bloom_t *bloom);
#endif
332 src/md5.c
View
@@ -1,332 +0,0 @@
-/*
- * This code implements the MD5 message-digest algorithm.
- * The algorithm is due to Ron Rivest. This code was
- * written by Colin Plumb in 1993, no copyright is claimed.
- * This code is in the public domain; do with it what you wish.
- *
- * Equivalent code is available from RSA Data Security, Inc.
- * This code has been tested against that, and is equivalent,
- * except that you don't need to include two pages of legalese
- * with every copy.
- *
- * To compute the message digest of a chunk of bytes, declare an
- * MD5Context structure, pass it to MD5Init, call MD5Update as
- * needed on buffers full of bytes, and then call MD5Final, which
- * will fill a supplied 16-byte array with the digest.
- */
-
-/* This code was modified in 1997 by Jim Kingdon of Cyclic Software to
- not require an integer type which is exactly 32 bits. This work
- draws on the changes for the same purpose by Tatu Ylonen
- <ylo@cs.hut.fi> as part of SSH, but since I didn't actually use
- that code, there is no copyright issue. I hereby disclaim
- copyright in any changes I have made; this code remains in the
- public domain. */
-
-/* Note regarding cvs_* namespace: this avoids potential conflicts
- with libraries such as some versions of Kerberos. No particular
- need to worry about whether the system supplies an MD5 library, as
- this file is only about 3k of object code. */
-
-#ifdef HAVE_CONFIG_H
-#include "config.h"
-#endif
-
-#include <string.h> /* for memcpy() and memset() */
-
-/* Add prototype support. */
-#ifndef PROTO
-#if defined (USE_PROTOTYPES) ? USE_PROTOTYPES : defined (__STDC__)
-#define PROTO(ARGS) ARGS
-#else
-#define PROTO(ARGS) ()
-#endif
-#endif
-
-#include "md5.h"
-
-/* Little-endian byte-swapping routines. Note that these do not
- depend on the size of datatypes such as cvs_uint32, nor do they require
- us to detect the endianness of the machine we are running on. It
- is possible they should be macros for speed, but I would be
- surprised if they were a performance bottleneck for MD5. */
-
-static cvs_uint32
-getu32 (addr)
- const unsigned char *addr;
-{
- return (((((unsigned long)addr[3] << 8) | addr[2]) << 8)
- | addr[1]) << 8 | addr[0];
-}
-
-static void
-putu32 (data, addr)
- cvs_uint32 data;
- unsigned char *addr;
-{
- addr[0] = (unsigned char)data;
- addr[1] = (unsigned char)(data >> 8);
- addr[2] = (unsigned char)(data >> 16);
- addr[3] = (unsigned char)(data >> 24);
-}
-
-/*
- * Start MD5 accumulation. Set bit count to 0 and buffer to mysterious
- * initialization constants.
- */
-void
-cvs_MD5Init (ctx)
- struct cvs_MD5Context *ctx;
-{
- ctx->buf[0] = 0x67452301;
- ctx->buf[1] = 0xefcdab89;
- ctx->buf[2] = 0x98badcfe;
- ctx->buf[3] = 0x10325476;
-
- ctx->bits[0] = 0;
- ctx->bits[1] = 0;
-}
-
-/*
- * Update context to reflect the concatenation of another buffer full
- * of bytes.
- */
-void
-cvs_MD5Update (ctx, buf, len)
- struct cvs_MD5Context *ctx;
- unsigned char const *buf;
- unsigned len;
-{
- cvs_uint32 t;
-
- /* Update bitcount */
-
- t = ctx->bits[0];
- if ((ctx->bits[0] = (t + ((cvs_uint32)len << 3)) & 0xffffffff) < t)
- ctx->bits[1]++; /* Carry from low to high */
- ctx->bits[1] += len >> 29;
-
- t = (t >> 3) & 0x3f; /* Bytes already in shsInfo->data */
-
- /* Handle any leading odd-sized chunks */
-
- if ( t ) {
- unsigned char *p = ctx->in + t;
-
- t = 64-t;
- if (len < t) {
- memcpy(p, buf, len);
- return;
- }
- memcpy(p, buf, t);
- cvs_MD5Transform (ctx->buf, ctx->in);
- buf += t;
- len -= t;
- }
-
- /* Process data in 64-byte chunks */
-
- while (len >= 64) {
- memcpy(ctx->in, buf, 64);
- cvs_MD5Transform (ctx->buf, ctx->in);
- buf += 64;
- len -= 64;
- }
-
- /* Handle any remaining bytes of data. */
-
- memcpy(ctx->in, buf, len);
-}
-
-/*
- * Final wrapup - pad to 64-byte boundary with the bit pattern
- * 1 0* (64-bit count of bits processed, MSB-first)
- */
-void
-cvs_MD5Final (digest, ctx)
- unsigned char digest[16];
- struct cvs_MD5Context *ctx;
-{
- unsigned count;
- unsigned char *p;
-
- /* Compute number of bytes mod 64 */
- count = (ctx->bits[0] >> 3) & 0x3F;
-
- /* Set the first char of padding to 0x80. This is safe since there is
- always at least one byte free */
- p = ctx->in + count;
- *p++ = 0x80;
-
- /* Bytes of padding needed to make 64 bytes */
- count = 64 - 1 - count;
-
- /* Pad out to 56 mod 64 */
- if (count < 8) {
- /* Two lots of padding: Pad the first block to 64 bytes */
- memset(p, 0, count);
- cvs_MD5Transform (ctx->buf, ctx->in);
-
- /* Now fill the next block with 56 bytes */
- memset(ctx->in, 0, 56);
- } else {
- /* Pad block to 56 bytes */
- memset(p, 0, count-8);
- }
-
- /* Append length in bits and transform */
- putu32(ctx->bits[0], ctx->in + 56);
- putu32(ctx->bits[1], ctx->in + 60);
-
- cvs_MD5Transform (ctx->buf, ctx->in);
- putu32(ctx->buf[0], digest);
- putu32(ctx->buf[1], digest + 4);
- putu32(ctx->buf[2], digest + 8);
- putu32(ctx->buf[3], digest + 12);
- memset(ctx, 0, sizeof(ctx)); /* In case it's sensitive */
-}
-
-#ifndef ASM_MD5
-
-/* The four core functions - F1 is optimized somewhat */
-
-/* #define F1(x, y, z) (x & y | ~x & z) */
-#define F1(x, y, z) (z ^ (x & (y ^ z)))
-#define F2(x, y, z) F1(z, x, y)
-#define F3(x, y, z) (x ^ y ^ z)
-#define F4(x, y, z) (y ^ (x | ~z))
-
-/* This is the central step in the MD5 algorithm. */
-#define MD5STEP(f, w, x, y, z, data, s) \
- ( w += f(x, y, z) + data, w &= 0xffffffff, w = w<<s | w>>(32-s), w += x )
-
-/*
- * The core of the MD5 algorithm, this alters an existing MD5 hash to
- * reflect the addition of 16 longwords of new data. MD5Update blocks
- * the data and converts bytes into longwords for this routine.
- */
-void
-cvs_MD5Transform (buf, inraw)
- cvs_uint32 buf[4];
- const unsigned char inraw[64];
-{
- register cvs_uint32 a, b, c, d;
- cvs_uint32 in[16];
- int i;
-
- for (i = 0; i < 16; ++i)
- in[i] = getu32 (inraw + 4 * i);
-
- a = buf[0];
- b = buf[1];
- c = buf[2];
- d = buf[3];
-
- MD5STEP(F1, a, b, c, d, in[ 0]+0xd76aa478, 7);
- MD5STEP(F1, d, a, b, c, in[ 1]+0xe8c7b756, 12);
- MD5STEP(F1, c, d, a, b, in[ 2]+0x242070db, 17);
- MD5STEP(F1, b, c, d, a, in[ 3]+0xc1bdceee, 22);
- MD5STEP(F1, a, b, c, d, in[ 4]+0xf57c0faf, 7);
- MD5STEP(F1, d, a, b, c, in[ 5]+0x4787c62a, 12);
- MD5STEP(F1, c, d, a, b, in[ 6]+0xa8304613, 17);
- MD5STEP(F1, b, c, d, a, in[ 7]+0xfd469501, 22);
- MD5STEP(F1, a, b, c, d, in[ 8]+0x698098d8, 7);
- MD5STEP(F1, d, a, b, c, in[ 9]+0x8b44f7af, 12);
- MD5STEP(F1, c, d, a, b, in[10]+0xffff5bb1, 17);
- MD5STEP(F1, b, c, d, a, in[11]+0x895cd7be, 22);
- MD5STEP(F1, a, b, c, d, in[12]+0x6b901122, 7);
- MD5STEP(F1, d, a, b, c, in[13]+0xfd987193, 12);
- MD5STEP(F1, c, d, a, b, in[14]+0xa679438e, 17);
- MD5STEP(F1, b, c, d, a, in[15]+0x49b40821, 22);
-
- MD5STEP(F2, a, b, c, d, in[ 1]+0xf61e2562, 5);
- MD5STEP(F2, d, a, b, c, in[ 6]+0xc040b340, 9);
- MD5STEP(F2, c, d, a, b, in[11]+0x265e5a51, 14);
- MD5STEP(F2, b, c, d, a, in[ 0]+0xe9b6c7aa, 20);
- MD5STEP(F2, a, b, c, d, in[ 5]+0xd62f105d, 5);
- MD5STEP(F2, d, a, b, c, in[10]+0x02441453, 9);
- MD5STEP(F2, c, d, a, b, in[15]+0xd8a1e681, 14);
- MD5STEP(F2, b, c, d, a, in[ 4]+0xe7d3fbc8, 20);
- MD5STEP(F2, a, b, c, d, in[ 9]+0x21e1cde6, 5);
- MD5STEP(F2, d, a, b, c, in[14]+0xc33707d6, 9);
- MD5STEP(F2, c, d, a, b, in[ 3]+0xf4d50d87, 14);
- MD5STEP(F2, b, c, d, a, in[ 8]+0x455a14ed, 20);
- MD5STEP(F2, a, b, c, d, in[13]+0xa9e3e905, 5);
- MD5STEP(F2, d, a, b, c, in[ 2]+0xfcefa3f8, 9);
- MD5STEP(F2, c, d, a, b, in[ 7]+0x676f02d9, 14);
- MD5STEP(F2, b, c, d, a, in[12]+0x8d2a4c8a, 20);
-
- MD5STEP(F3, a, b, c, d, in[ 5]+0xfffa3942, 4);
- MD5STEP(F3, d, a, b, c, in[ 8]+0x8771f681, 11);
- MD5STEP(F3, c, d, a, b, in[11]+0x6d9d6122, 16);
- MD5STEP(F3, b, c, d, a, in[14]+0xfde5380c, 23);
- MD5STEP(F3, a, b, c, d, in[ 1]+0xa4beea44, 4);
- MD5STEP(F3, d, a, b, c, in[ 4]+0x4bdecfa9, 11);
- MD5STEP(F3, c, d, a, b, in[ 7]+0xf6bb4b60, 16);
- MD5STEP(F3, b, c, d, a, in[10]+0xbebfbc70, 23);
- MD5STEP(F3, a, b, c, d, in[13]+0x289b7ec6, 4);
- MD5STEP(F3, d, a, b, c, in[ 0]+0xeaa127fa, 11);
- MD5STEP(F3, c, d, a, b, in[ 3]+0xd4ef3085, 16);
- MD5STEP(F3, b, c, d, a, in[ 6]+0x04881d05, 23);
- MD5STEP(F3, a, b, c, d, in[ 9]+0xd9d4d039, 4);
- MD5STEP(F3, d, a, b, c, in[12]+0xe6db99e5, 11);
- MD5STEP(F3, c, d, a, b, in[15]+0x1fa27cf8, 16);
- MD5STEP(F3, b, c, d, a, in[ 2]+0xc4ac5665, 23);
-
- MD5STEP(F4, a, b, c, d, in[ 0]+0xf4292244, 6);
- MD5STEP(F4, d, a, b, c, in[ 7]+0x432aff97, 10);
- MD5STEP(F4, c, d, a, b, in[14]+0xab9423a7, 15);
- MD5STEP(F4, b, c, d, a, in[ 5]+0xfc93a039, 21);
- MD5STEP(F4, a, b, c, d, in[12]+0x655b59c3, 6);
- MD5STEP(F4, d, a, b, c, in[ 3]+0x8f0ccc92, 10);
- MD5STEP(F4, c, d, a, b, in[10]+0xffeff47d, 15);
- MD5STEP(F4, b, c, d, a, in[ 1]+0x85845dd1, 21);
- MD5STEP(F4, a, b, c, d, in[ 8]+0x6fa87e4f, 6);
- MD5STEP(F4, d, a, b, c, in[15]+0xfe2ce6e0, 10);
- MD5STEP(F4, c, d, a, b, in[ 6]+0xa3014314, 15);
- MD5STEP(F4, b, c, d, a, in[13]+0x4e0811a1, 21);
- MD5STEP(F4, a, b, c, d, in[ 4]+0xf7537e82, 6);
- MD5STEP(F4, d, a, b, c, in[11]+0xbd3af235, 10);
- MD5STEP(F4, c, d, a, b, in[ 2]+0x2ad7d2bb, 15);
- MD5STEP(F4, b, c, d, a, in[ 9]+0xeb86d391, 21);
-
- buf[0] += a;
- buf[1] += b;
- buf[2] += c;
- buf[3] += d;
-}
-#endif
-
-#ifdef TEST
-/* Simple test program. Can use it to manually run the tests from
- RFC1321 for example. */
-#include <stdio.h>
-
-int
-main (int argc, char **argv)
-{
- struct cvs_MD5Context context;
- unsigned char checksum[16];
- int i;
- int j;
-
- if (argc < 2)
- {
- fprintf (stderr, "usage: %s string-to-hash\n", argv[0]);
- exit (1);
- }
- for (j = 1; j < argc; ++j)
- {
- printf ("MD5 (\"%s\") = ", argv[j]);
- cvs_MD5Init (&context);
- cvs_MD5Update (&context, argv[j], strlen (argv[j]));
- cvs_MD5Final (checksum, &context);
- for (i = 0; i < 16; i++)
- {
- printf ("%02x", (unsigned int) checksum[i]);
- }
- printf ("\n");
- }
- return 0;
-}
-#endif /* TEST */
39 src/md5.h
View
@@ -1,39 +0,0 @@
-/* See md5.c for explanation and copyright information. */
-
-/*
- * $FreeBSD: src/contrib/cvs/lib/md5.h,v 1.2 1999/12/11 15:10:02 peter Exp $
- */
-
-/* Add prototype support. */
-#ifndef PROTO
-#if defined (USE_PROTOTYPES) ? USE_PROTOTYPES : defined (__STDC__)
-#define PROTO(ARGS) ARGS
-#else
-#define PROTO(ARGS) ()
-#endif
-#endif
-
-#ifndef MD5_H
-#define MD5_H
-
-/* Unlike previous versions of this code, uint32 need not be exactly
- 32 bits, merely 32 bits or more. Choosing a data type which is 32
- bits instead of 64 is not important; speed is considerably more
- important. ANSI guarantees that "unsigned long" will be big enough,
- and always using it seems to have few disadvantages. */
-typedef unsigned long cvs_uint32;
-
-struct cvs_MD5Context {
- cvs_uint32 buf[4];
- cvs_uint32 bits[2];
- unsigned char in[64];
-};
-
-void cvs_MD5Init PROTO ((struct cvs_MD5Context *context));
-void cvs_MD5Update PROTO ((struct cvs_MD5Context *context,
- unsigned char const *buf, unsigned len));
-void cvs_MD5Final PROTO ((unsigned char digest[16],
- struct cvs_MD5Context *context));
-void cvs_MD5Transform PROTO ((cvs_uint32 buf[4], const unsigned char in[64]));
-
-#endif /* !MD5_H */
300 src/murmur.c
View
@@ -0,0 +1,300 @@
+//-----------------------------------------------------------------------------
+// MurmurHash3 was written by Austin Appleby, and is placed in the public
+// domain. The author hereby disclaims copyright to this source code.
+
+// Note - The x86 and x64 versions do _not_ produce the same results, as the
+// algorithms are optimized for their respective platforms. You can still
+// compile and run any of them on any platform, but your performance with the
+// non-native version will be less than optimal.
+
+#include "murmur.h"
+
+#define FORCE_INLINE __attribute__((always_inline))
+
+FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
+{
+ return (x << r) | (x >> (32 - r));
+}
+
+FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
+{
+ return (x << r) | (x >> (64 - r));
+}
+
+#define ROTL32(x,y) rotl32(x,y)
+#define ROTL64(x,y) rotl64(x,y)
+
+#define BIG_CONSTANT(x) (x##LLU)
+
+#define getblock(x, i) (x[i])
+
+//-----------------------------------------------------------------------------
+// Finalization mix - force all bits of a hash block to avalanche
+
+FORCE_INLINE uint32_t fmix32(uint32_t h)
+{
+ h ^= h >> 16;
+ h *= 0x85ebca6b;
+ h ^= h >> 13;
+ h *= 0xc2b2ae35;
+ h ^= h >> 16;
+
+ return h;
+}
+
+//----------
+
+FORCE_INLINE uint64_t fmix64(uint64_t k)
+{
+ k ^= k >> 33;
+ k *= BIG_CONSTANT(0xff51afd7ed558ccd);
+ k ^= k >> 33;
+ k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
+ k ^= k >> 33;
+
+ return k;
+}
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x86_32 ( const void * key, int len,
+ uint32_t seed, void * out )
+{
+ const uint8_t * data = (const uint8_t*)key;
+ const int nblocks = len / 4;
+
+ uint32_t h1 = seed;
+
+ uint32_t c1 = 0xcc9e2d51;
+ uint32_t c2 = 0x1b873593;
+
+ int i;
+
+ //----------
+ // body
+
+ const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);
+
+ for(i = -nblocks; i; i++) {
+ uint32_t k1 = getblock(blocks,i);
+
+ k1 *= c1;
+ k1 = ROTL32(k1,15);
+ k1 *= c2;
+
+ h1 ^= k1;
+ h1 = ROTL32(h1,13);
+ h1 = h1*5+0xe6546b64;
+ }
+
+ //----------
+ // tail
+
+ const uint8_t * tail = (const uint8_t*)(data + nblocks*4);
+
+ uint32_t k1 = 0;
+
+ switch(len & 3) {
+ case 3: k1 ^= tail[2] << 16;
+ case 2: k1 ^= tail[1] << 8;
+ case 1: k1 ^= tail[0];
+ k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
+ }
+
+ //----------
+ // finalization
+
+ h1 ^= len;
+
+ h1 = fmix32(h1);
+
+ *(uint32_t*)out = h1;
+}
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x86_128 ( const void * key, const int len,
+ uint32_t seed, void * out )
+{
+ const uint8_t * data = (const uint8_t*)key;
+ const int nblocks = len / 16;
+
+ uint32_t h1 = seed;
+ uint32_t h2 = seed;
+ uint32_t h3 = seed;
+ uint32_t h4 = seed;
+
+ uint32_t c1 = 0x239b961b;
+ uint32_t c2 = 0xab0e9789;
+ uint32_t c3 = 0x38b34ae5;
+ uint32_t c4 = 0xa1e38b93;
+
+ int i;
+
+ //----------
+ // body
+
+ const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
+
+ for(i = -nblocks; i; i++) {
+ uint32_t k1 = getblock(blocks,i*4+0);
Jinghao Yan
jinghao added a note

Micro-optimization: << 2 is the same as * 4

(i << 4) | 1 is the same as i*4 + 1

same with 2, 3

This is in MurmurHash code, so if you think this is worth the loss in readability, you might want to open an issue at http://code.google.com/p/smhasher/issues/list. But then again, don't underestimate optimizing compilers – in my experience, most backends recognize simple peephole optimizations like this just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ uint32_t k2 = getblock(blocks,i*4+1);
+ uint32_t k3 = getblock(blocks,i*4+2);
+ uint32_t k4 = getblock(blocks,i*4+3);
+
+ k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
+
+ h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;
+
+ k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
+
+ h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;
+
+ k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
+
+ h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;
+
+ k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
+
+ h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
+ }
+
+ //----------
+ // tail
+
+ const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
+
+ uint32_t k1 = 0;
+ uint32_t k2 = 0;
+ uint32_t k3 = 0;
+ uint32_t k4 = 0;
+
+ switch(len & 15) {
+ case 15: k4 ^= tail[14] << 16;
+ case 14: k4 ^= tail[13] << 8;
+ case 13: k4 ^= tail[12] << 0;
+ k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
+
+ case 12: k3 ^= tail[11] << 24;
+ case 11: k3 ^= tail[10] << 16;
+ case 10: k3 ^= tail[ 9] << 8;
+ case 9: k3 ^= tail[ 8] << 0;
+ k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
+
+ case 8: k2 ^= tail[ 7] << 24;
+ case 7: k2 ^= tail[ 6] << 16;
+ case 6: k2 ^= tail[ 5] << 8;
+ case 5: k2 ^= tail[ 4] << 0;
+ k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
+
+ case 4: k1 ^= tail[ 3] << 24;
+ case 3: k1 ^= tail[ 2] << 16;
+ case 2: k1 ^= tail[ 1] << 8;
+ case 1: k1 ^= tail[ 0] << 0;
+ k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
+ }
+
+ //----------
+ // finalization
+
+ h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;
+
+ h1 += h2; h1 += h3; h1 += h4;
+ h2 += h1; h3 += h1; h4 += h1;
+
+ h1 = fmix32(h1);
+ h2 = fmix32(h2);
+ h3 = fmix32(h3);
+ h4 = fmix32(h4);
+
+ h1 += h2; h1 += h3; h1 += h4;
+ h2 += h1; h3 += h1; h4 += h1;
+
+ ((uint32_t*)out)[0] = h1;
+ ((uint32_t*)out)[1] = h2;
+ ((uint32_t*)out)[2] = h3;
+ ((uint32_t*)out)[3] = h4;
+}
+
+//-----------------------------------------------------------------------------
+
+void MurmurHash3_x64_128 ( const void * key, const int len,
+ const uint32_t seed, void * out )
+{
+ const uint8_t * data = (const uint8_t*)key;
+ const int nblocks = len / 16;
+
+ uint64_t h1 = seed;
+ uint64_t h2 = seed;
+
+ uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
+ uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
+
+ int i;
+
+ //----------
+ // body
+
+ const uint64_t * blocks = (const uint64_t *)(data);
+
+ for(i = 0; i < nblocks; i++) {
+ uint64_t k1 = getblock(blocks,i*2+0);
Jinghao Yan
jinghao added a note

i << 1
(i << 1) | 1

As discussed above, compilers are smart enough to do appropriate transforms of this type nowadays. In fact, the (i<<1)|1 is a slight pessimization in my tests.

CF http://url.godbolt.org/shiftVsMultiply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ uint64_t k2 = getblock(blocks,i*2+1);
+
+ k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
+
+ h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;
+
+ k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
+
+ h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
+ }
+
+ //----------
+ // tail
+
+ const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
+
+ uint64_t k1 = 0;
+ uint64_t k2 = 0;
+
+ switch(len & 15) {
+ case 15: k2 ^= ((uint64_t)tail[14]) << 48;
+ case 14: k2 ^= ((uint64_t)tail[13]) << 40;
+ case 13: k2 ^= ((uint64_t)tail[12]) << 32;
+ case 12: k2 ^= ((uint64_t)tail[11]) << 24;
+ case 11: k2 ^= ((uint64_t)tail[10]) << 16;
+ case 10: k2 ^= ((uint64_t)tail[ 9]) << 8;
+ case 9: k2 ^= ((uint64_t)tail[ 8]) << 0;
+ k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
+
+ case 8: k1 ^= ((uint64_t)tail[ 7]) << 56;
+ case 7: k1 ^= ((uint64_t)tail[ 6]) << 48;
+ case 6: k1 ^= ((uint64_t)tail[ 5]) << 40;
+ case 5: k1 ^= ((uint64_t)tail[ 4]) << 32;
+ case 4: k1 ^= ((uint64_t)tail[ 3]) << 24;
+ case 3: k1 ^= ((uint64_t)tail[ 2]) << 16;
+ case 2: k1 ^= ((uint64_t)tail[ 1]) << 8;
+ case 1: k1 ^= ((uint64_t)tail[ 0]) << 0;
+ k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
+ }
+
+ //----------
+ // finalization
+
+ h1 ^= len; h2 ^= len;
+
+ h1 += h2;
+ h2 += h1;
+
+ h1 = fmix64(h1);
+ h2 = fmix64(h2);
+
+ h1 += h2;
+ h2 += h1;
+
+ ((uint64_t*)out)[0] = h1;
+ ((uint64_t*)out)[1] = h2;
+}
+
+//-----------------------------------------------------------------------------
14 src/murmur.h
View
@@ -0,0 +1,14 @@
+//-----------------------------------------------------------------------------
+// MurmurHash3 was written by Austin Appleby, and is placed in the public
+// domain. The author hereby disclaims copyright to this source code.
+
+#ifndef _MURMURHASH3_H_
+#define _MURMURHASH3_H_
+
+#include <stdint.h>
+
+void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out );
+void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );
+void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );
+
+#endif // _MURMURHASH3_H_
8 src/test_dablooms.c
View
@@ -50,7 +50,7 @@ int test_scale(const char * filepath)
for (i = 0; fgets(word, 128, fp); i++) {
if (word != NULL) {
chomp_line(word);
- scaling_bloom_add(bloom, word, i);
+ scaling_bloom_add(bloom, word, strlen(word), i);
}
}
@@ -59,7 +59,7 @@ int test_scale(const char * filepath)
if (word != NULL) {
if (iremove % 5 == 0) {
chomp_line(word);
- scaling_bloom_remove(bloom, word, iremove);
+ scaling_bloom_remove(bloom, word, strlen(word), iremove);
}
}
}
@@ -75,13 +75,13 @@ int test_scale(const char * filepath)
if (word != NULL) {
chomp_line(word);
if (i % 5 == 0) {
- if (!(scaling_bloom_check(bloom, word))) {
+ if (!(scaling_bloom_check(bloom, word, strlen(word)))) {
not_exist_pass ++;
} else {
not_exist_fail ++;
}
} else {
- if (scaling_bloom_check(bloom, word)) {
+ if (scaling_bloom_check(bloom, word, strlen(word))) {
exist_pass ++;
} else {
fprintf(stderr, "%s\n", word);
Something went wrong with that request. Please try again.