MurmurHash3 information and brief performance results
(Note - this information is a bit stale, and refers to an earlier version of Murmur3. I'll rewrite it when I have time.)
MurmurHash3 is the successor to MurmurHash2. It comes in 3 variants - a 32-bit version that targets low latency for hash table use and two 128-bit versions for generating unique identifiers for large blocks of data, one each for x86 and x64 platforms.
MurmurHash3's mix functions are based on this snippet -
k *= c1; k = rotl(k,r1); k *= c2; h ^= k; h = rotl(h,r1); h = h*m1+n1;
k is a block of the key,
h is a block of the hash state, and
cN are constants.
For each block of the key, we pre-mix it using two constants and a rotate, xor it into the hash block, and them mix the hash block using a rotate and a multiply-add.
MurmurHash3's 32-bit finalizer is
h ^= h >> 16; h *= 0x85ebca6b; h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16;
and its 64-bit finalizer is
h ^= h >> 33; h *= 0xff51afd7ed558ccd; h ^= h >> 33; h *= 0xc4ceb9fe1a85ec53; h ^= h >> 33;
The constants for the finalizers were generated by a simple simulated-annealing algorithm, and both avalanche all bits of
h to within 0.25% bias.
The 128-bit variants mix multiple blocks of key data in parallel. To ensure that all the intermediate hash blocks affect each other, Murmurhash3 does a few simple operations interleaved with the block mix -
The 64-bit, 2-block inter-mix is
h1 += h2; h2 = _rotl64(h2,41); h2 += h1;
The 32-bit, 4-block inter-mix is
h1 += h2; h1 += h3; h1 += h4; h1 = _rotl(h1,17); h2 += h1; h3 += h1; h4 += h1;
hN is one block of the hash value.
Bulk speed test, hashing an 8-byte-aligned 256k block
Results are from an Intel Core 2 Quad Q9650 running at 3.0 ghz, running on a single core.
|SuperFastHash_x86_32||1224 mb/sec (1)|
|MurmurHash2_x86_64||3352 mb/sec (2)|
|MurmurHash3_x64_128||5058 mb/sec (3)|
(1) - SuperFastHash has very poor collision properties, which have been documented elsewhere.
(2) - MurmurHash2_x86_64 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. I suggest avoiding this variant.
(3) - That's about 1.68 bytes per cycle, or about 9.5 cycles per 16-byte chunk. The inner loop is 20 instructions long, so we're sustaining over 2 instructions per cycle. Hooray for modern platforms with fast 64-bit multipliers and superscalar architectures. :)