Skip to content

Commit

Permalink
MurmurHash3 and XxHash implementations
Browse files Browse the repository at this point in the history
  • Loading branch information
desmondyeung committed Aug 21, 2019
1 parent 4d3a815 commit 8371464
Show file tree
Hide file tree
Showing 30 changed files with 1,925 additions and 2 deletions.
11 changes: 11 additions & 0 deletions .gitignore
@@ -1,2 +1,13 @@
/.idea
.DS_Store
target
*.orig
*#
*~
.#*
.*.swp
*.vim
.ensime*
*.class
*.log
*/*.iml
7 changes: 7 additions & 0 deletions .scalafmt.conf
@@ -0,0 +1,7 @@
version = 2.0.1
style = defaultWithAlign
maxColumn = 120
rewrite.rules = [ RedundantBraces, RedundantParens, SortImports, AvoidInfix, PreferCurlyFors ]
spaces.inImportCurlyBraces = false
danglingParentheses = true
align.openParenCallSite = false
107 changes: 105 additions & 2 deletions README.md
@@ -1,2 +1,105 @@
# scala-hashing
Hashing Functions for Scala
# Scala-Hashing

Fast non-cryptographic hash functions for Scala. This library provides APIs for computing 32-bit and 64-bit hashes.

Currently implemented hash functions
* [MurmurHash3](https://github.com/aappleby/smhasher) (32-bit)
* [XxHash](https://github.com/Cyan4973/xxHash) (32-bit and 64-bit)

Hash functions in this library can be access via either a standard API for hashing primitives, byte arrays, or Java ByteBuffers (direct and non-direct), or a streaming API for hashing stream-like objects such as InputStreams, Java NIO Channels, or Akka Streams. Hash functions should produce consistent output regardless of platform or endianness.

This library uses the `sun.misc.Unsafe` API internally. I might explore using the `VarHandle` API introduced in Java 9 in the future, but am currently still supporting Java 8.

## Performance

Benchmarked against various other open-source implementations
* [Guava](https://github.com/google/guava) (MurmurHash3)
* [LZ4 Java](https://github.com/lz4/lz4-java) (XxHash32 and XxHash64 - Includes JNI binding, pure Java, and Java+Unsafe implementations)
* [Scala](https://github.com/scala/scala) (Scala's built-in `scala.util.hashing.MurmurHash3`)
* [Zero-Allocation-Hashing](https://github.com/OpenHFT/Zero-Allocation-Hashing) (XxHash64)

### MurmurHash3_32
![MurmurHash3_32](https://github.com/desmondyeung/scala-hashing/blob/master/bench/src/main/resource/results/XxHash64.png)

### XxHash32
![XxHash32](https://github.com/desmondyeung/scala-hashing/blob/master/bench/src/main/resource/results/XxHash64.png)

### XxHash64
![XxHash64](https://github.com/desmondyeung/scala-hashing/blob/master/bench/src/main/resource/results/XxHash64.png)


### Running Locally

Benchmarks are located in the `bench` subproject and can be run using the [sbt-jmh](https://github.com/ktoso/sbt-jmh) plugin.

To run all benchmarks with default settings
```sbt
bench/jmh:run
```
To run a specific benchmark with custom settings
```sbt
bench/jmh:run -f 2 -wi 5 -i 5 XxHash64Bench
```

## Examples

This library defines the interfaces `Hash32` and `StreamingHash32` for computing 32-bit hashes and `Hash64` and `StreamingHash64` for computing 64-bit hashes. Classes extending `StreamingHash32` or `StreamingHash64` are not thread-safe.

The public API for `Hash64` and `StreamingHash64` can be seen below
```scala
trait Hash64 {
def hashByte(input: Byte, seed: Long): Long
def hashInt(input: Int, seed: Long): Long
def hashLong(input: Long, seed: Long): Long
def hashByteArray(input: Array[Byte], seed: Long): Long =
def hashByteArray(input: Array[Byte], offset: Int, length: Int, seed: Long): Long
def hashByteBuffer(input: ByteBuffer, seed: Long): Long =
def hashByteBuffer(input: ByteBuffer, offset: Int, length: Int, seed: Long): Long
}

trait StreamingHash64 {
def reset(): Unit
def value: Long
def updateByteArray(input: Array[Byte], offset: Int, length: Int): Unit
def updateByteBuffer(input: ByteBuffer, offset: Int, length: Int): Unit
}
```

Using the standard API
```scala
import com.desmondyeung.hashing.XxHash64
import java.nio.ByteBuffer

// hash a long
val hash = XxHash64.hashLong(123, seed = 0)

// hash a Array[Byte]
val hash = XxHash64.hashByteArray(Array[Byte](123), seed = 0)

// hash a ByteBuffer
val hash = XxHash64.hashByteBuffer(ByteBuffer.wrap(Array[Byte](123)), seed = 0)
```

Using the streaming API
```scala
import com.desmondyeung.hashing.StreamingXxHash64
import java.nio.ByteBuffer
import java.io.FileInputStream

val checksum = StreamingXxHash64(seed = 0)
val channel = new FileInputStream("/path/to/file.txt").getChannel
val chunk = ByteBuffer.allocate(1024)

var bytesRead = channel.read(chunk)
while (bytesRead > 0) {
checksum.updateByteBuffer(chunk, 0, bytesRead)
chunk.rewind
bytesRead = channel.read(chunk)
}

val hash = checksum.value
```

## License

Licensed under the Apache License, Version 2.0 (the "License").
27 changes: 27 additions & 0 deletions bench/src/main/resource/results/Murmur3Hash_32BenchResult.txt
@@ -0,0 +1,27 @@
[info] # Run complete. Total time: 00:09:42
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark (inputSize) Mode Cnt Score Error Units
[info] MurmurHash3_32Bench.com_desmondyeung_hashing 8 thrpt 5 143053204.310 ± 1793248.431 ops/s
[info] MurmurHash3_32Bench.com_desmondyeung_hashing 128 thrpt 5 19167534.091 ± 254169.664 ops/s
[info] MurmurHash3_32Bench.com_desmondyeung_hashing 512 thrpt 5 5561943.424 ± 85516.455 ops/s
[info] MurmurHash3_32Bench.com_desmondyeung_hashing 1024 thrpt 5 2929010.066 ± 80201.084 ops/s
[info] MurmurHash3_32Bench.com_desmondyeung_hashing 1536 thrpt 5 1943152.471 ± 82023.180 ops/s
[info] MurmurHash3_32Bench.com_desmondyeung_hashing 2048 thrpt 5 1403510.923 ± 44546.385 ops/s
[info] MurmurHash3_32Bench.com_google_common_hash 8 thrpt 5 116084764.014 ± 3715825.165 ops/s
[info] MurmurHash3_32Bench.com_google_common_hash 128 thrpt 5 11915395.823 ± 1434301.027 ops/s
[info] MurmurHash3_32Bench.com_google_common_hash 512 thrpt 5 3158079.416 ± 134154.390 ops/s
[info] MurmurHash3_32Bench.com_google_common_hash 1024 thrpt 5 1657552.706 ± 83520.818 ops/s
[info] MurmurHash3_32Bench.com_google_common_hash 1536 thrpt 5 1095388.998 ± 35546.813 ops/s
[info] MurmurHash3_32Bench.com_google_common_hash 2048 thrpt 5 830621.543 ± 6247.160 ops/s
[info] MurmurHash3_32Bench.scala_util_hashing 8 thrpt 5 92785579.984 ± 2564661.931 ops/s
[info] MurmurHash3_32Bench.scala_util_hashing 128 thrpt 5 16740708.874 ± 446764.432 ops/s
[info] MurmurHash3_32Bench.scala_util_hashing 512 thrpt 5 4769417.979 ± 55659.336 ops/s
[info] MurmurHash3_32Bench.scala_util_hashing 1024 thrpt 5 2362250.706 ± 112984.131 ops/s
[info] MurmurHash3_32Bench.scala_util_hashing 1536 thrpt 5 1625387.712 ± 76083.874 ops/s
[info] MurmurHash3_32Bench.scala_util_hashing 2048 thrpt 5 1219616.089 ± 65617.926 ops/s
[success] Total time: 585 s, completed Aug 20, 2019, 10:24:00 PM
sbt:Hashing>
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added bench/src/main/resource/results/XxHash32.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions bench/src/main/resource/results/XxHash32BenchResults.txt
@@ -0,0 +1,33 @@
[info] # Run complete. Total time: 00:12:58
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark (inputSize) Mode Cnt Score Error Units
[info] XxHash32Bench.com_desmondyeung_hashing 8 thrpt 5 181898630.061 ± 2086886.490 ops/s
[info] XxHash32Bench.com_desmondyeung_hashing 128 thrpt 5 44161794.692 ± 499905.676 ops/s
[info] XxHash32Bench.com_desmondyeung_hashing 512 thrpt 5 14488496.855 ± 107840.819 ops/s
[info] XxHash32Bench.com_desmondyeung_hashing 1024 thrpt 5 7523643.321 ± 104558.725 ops/s
[info] XxHash32Bench.com_desmondyeung_hashing 1536 thrpt 5 5005641.604 ± 52267.655 ops/s
[info] XxHash32Bench.com_desmondyeung_hashing 2048 thrpt 5 3789585.515 ± 31996.067 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_jni 8 thrpt 5 7328440.319 ± 275575.767 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_jni 128 thrpt 5 5930950.315 ± 303366.575 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_jni 512 thrpt 5 3968051.273 ± 149722.290 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_jni 1024 thrpt 5 2771389.170 ± 33486.356 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_jni 1536 thrpt 5 2148733.148 ± 145690.835 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_jni 2048 thrpt 5 1720267.164 ± 77320.929 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_pure 8 thrpt 5 103689821.011 ± 2704414.707 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_pure 128 thrpt 5 19236302.722 ± 730586.182 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_pure 512 thrpt 5 5823303.478 ± 224930.690 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_pure 1024 thrpt 5 3066582.769 ± 150944.281 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_pure 1536 thrpt 5 2076760.547 ± 65112.334 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_pure 2048 thrpt 5 1582100.654 ± 74129.324 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_unsafe 8 thrpt 5 134161752.760 ± 5035419.628 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_unsafe 128 thrpt 5 40852921.273 ± 2042634.150 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_unsafe 512 thrpt 5 12788488.138 ± 472049.488 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_unsafe 1024 thrpt 5 7085539.188 ± 935685.614 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_unsafe 1536 thrpt 5 4837437.179 ± 248957.476 ops/s
[info] XxHash32Bench.net_jpountz_xxhash_unsafe 2048 thrpt 5 3613914.693 ± 80983.807 ops/s
[success] Total time: 782 s, completed Aug 20, 2019, 10:01:32 PM
sbt:Hashing>
Binary file added bench/src/main/resource/results/XxHash64.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions bench/src/main/resource/results/XxHash64BenchResults.txt
@@ -0,0 +1,39 @@
[info] # Run complete. Total time: 00:16:12
[info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
[info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
[info] experiments, perform baseline and negative tests that provide experimental control, make sure
[info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
[info] Do not assume the numbers tell you what you want them to tell.
[info] Benchmark (inputSize) Mode Cnt Score Error Units
[info] XxHash64Bench.com_desmondyeung_hashing 8 thrpt 5 190852324.077 ± 5658859.753 ops/s
[info] XxHash64Bench.com_desmondyeung_hashing 128 thrpt 5 58888583.803 ± 634119.370 ops/s
[info] XxHash64Bench.com_desmondyeung_hashing 512 thrpt 5 24604486.760 ± 275765.670 ops/s
[info] XxHash64Bench.com_desmondyeung_hashing 1024 thrpt 5 13836850.373 ± 184524.530 ops/s
[info] XxHash64Bench.com_desmondyeung_hashing 1536 thrpt 5 9591339.692 ± 82457.246 ops/s
[info] XxHash64Bench.com_desmondyeung_hashing 2048 thrpt 5 7352860.162 ± 117217.475 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_jni 8 thrpt 5 7152589.007 ± 447581.357 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_jni 128 thrpt 5 6490503.456 ± 230647.108 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_jni 512 thrpt 5 5585357.802 ± 77475.894 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_jni 1024 thrpt 5 4853306.754 ± 276307.907 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_jni 1536 thrpt 5 4089062.237 ± 127466.632 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_jni 2048 thrpt 5 3665968.804 ± 72010.797 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_pure 8 thrpt 5 93522990.590 ± 2323318.279 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_pure 128 thrpt 5 19458130.129 ± 1155318.184 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_pure 512 thrpt 5 6057375.874 ± 204330.214 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_pure 1024 thrpt 5 3094880.189 ± 48461.958 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_pure 1536 thrpt 5 2150755.068 ± 231286.136 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_pure 2048 thrpt 5 1642711.800 ± 57553.147 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_unsafe 8 thrpt 5 138411866.674 ± 8485795.143 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_unsafe 128 thrpt 5 46821796.895 ± 459016.261 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_unsafe 512 thrpt 5 20018540.106 ± 132129.200 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_unsafe 1024 thrpt 5 11040708.417 ± 170272.302 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_unsafe 1536 thrpt 5 8353461.187 ± 934291.061 ops/s
[info] XxHash64Bench.net_jpountz_xxhash_unsafe 2048 thrpt 5 5886326.566 ± 73892.885 ops/s
[info] XxHash64Bench.net_openhft_hashing 8 thrpt 5 154836584.889 ± 1832416.876 ops/s
[info] XxHash64Bench.net_openhft_hashing 128 thrpt 5 52375849.975 ± 2606768.814 ops/s
[info] XxHash64Bench.net_openhft_hashing 512 thrpt 5 23084417.332 ± 526239.888 ops/s
[info] XxHash64Bench.net_openhft_hashing 1024 thrpt 5 12772614.056 ± 565939.397 ops/s
[info] XxHash64Bench.net_openhft_hashing 1536 thrpt 5 9319278.528 ± 286309.114 ops/s
[info] XxHash64Bench.net_openhft_hashing 2048 thrpt 5 6963195.027 ± 77335.953 ops/s
[success] Total time: 976 s, completed Aug 20, 2019, 9:32:29 PM
sbt:Hashing>
@@ -0,0 +1,54 @@
/*
* Copyright 2019 Desmond Yeung
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.desmondyeung.bench

import org.openjdk.jmh.annotations._
import com.desmondyeung.hashing.MurmurHash3_32
import com.google.common.hash.Hashing
import scala.util.hashing.MurmurHash3

import java.util.concurrent.TimeUnit

@BenchmarkMode(Array(Mode.Throughput))
@Fork(1)
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
class MurmurHash3_32Bench {

var input: Array[Byte] = _

@Param(Array("8", "128", "512", "1024", "1536", "2048"))
var inputSize: Int = _

@Setup
def prepare: Unit = {
input = new Array[Byte](inputSize)
scala.util.Random.nextBytes(input)
}

val guava = Hashing.murmur3_32(0)

@Benchmark
def com_desmondyeung_hashing: Int = MurmurHash3_32.hashByteArray(input, 0)

@Benchmark
def com_google_common_hash: Int = guava.hashBytes(input).asInt

@Benchmark
def scala_util_hashing(): Int = MurmurHash3.bytesHash(input, 0)
}
59 changes: 59 additions & 0 deletions bench/src/main/scala/com/desmondyeung/bench/XxHash32Bench.scala
@@ -0,0 +1,59 @@
/*
* Copyright 2019 Desmond Yeung
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.desmondyeung.bench

import org.openjdk.jmh.annotations._
import com.desmondyeung.hashing.XxHash32
import net.openhft.hashing.LongHashFunction
import net.jpountz.xxhash.XXHashFactory

import java.util.concurrent.TimeUnit

@BenchmarkMode(Array(Mode.Throughput))
@Fork(1)
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
class XxHash32Bench {

var input: Array[Byte] = _

@Param(Array("8", "128", "512", "1024", "1536", "2048"))
var inputSize: Int = _

@Setup
def prepare: Unit = {
input = new Array[Byte](inputSize)
scala.util.Random.nextBytes(input)
}

val jpountzJni = XXHashFactory.nativeInstance.hash32()
val jpountzUnsafe = XXHashFactory.unsafeInstance.hash32()
val jpountzPure = XXHashFactory.safeInstance.hash32()

@Benchmark
def com_desmondyeung_hashing: Int = XxHash32.hashByteArray(input, 0)

@Benchmark
def net_jpountz_xxhash_jni: Int = jpountzJni.hash(input, 0, inputSize, 0)

@Benchmark
def net_jpountz_xxhash_pure: Int = jpountzPure.hash(input, 0, inputSize, 0)

@Benchmark
def net_jpountz_xxhash_unsafe: Int = jpountzUnsafe.hash(input, 0, inputSize, 0)
}

0 comments on commit 8371464

Please sign in to comment.