Bloom Filter implementation in Clojure.
A bloom filter is a probabilistic data structure for testing set membership. A deterministic data structure for testing set membership is a
java.util.Set, some other languages would use a hash (or map) data structure with a boolean as the value and the target strings as the key. As the number of entries in your set grows, so will the memory usage of these data structures. Bloom filters sacrifice determinism in favor of significantly lower memory usage. Bloom filters do not suffer from false negatives (indicating that a given string is not in the set when it is in fact in the set). They do suffer from false positives (indicating that the string is in the set when it is in fact not), though the false positive rate can be estimated fairly accurately, more importantly, the size of the filter and the number of hashes can be chosen to give you a target false positive rate.
examples/words.clj example demonstrates that calculating these values based on an estimated set size and a target false positive rate gets you within a factor of about three of the desired rate.
A use for such a filter might be to quickly test if a given string is not a dictionary word, helping to ensure a user’s password is not a dictionary word.
This implementation uses a Java
java.util.BitSet as the bit array implementation and provides several helpers for different hashing functions.
This example code is contained in the file
examples/words.clj it initializes
a bloom filter with the dictionary from
/usr/local/dict/words. You can run
this example with the file
run.sh in the project root directory.
(ns words (:require [ clojure.contrib.duck-streams :as ds] [com.github.kyleburton.clj-bloom :as bf])) (def *words-file* "/usr/share/dict/words") (defn all-words  (ds/read-lines *words-file*)) (defn add-words-to-filter! [filter] (dorun (doseq [word (all-words)] (bf/add! filter (.toLowerCase word))))) (defn run [hash-fn] (let [filter (bf/make-bloom-filter (* 10 1024 1024) hash-fn)] (add-words-to-filter! filter) (dorun (doseq [w (.split "The quick brown ornithopter hyper-jumped over the lazy trollusk" "\\s+")] (if (bf/include? filter (.toLowerCase w)) (prn (format "HIT: '%s' in the filter" w)) (prn (format "MISS: '%s' not in the filter" w))))))) (defn make-words-filter [m-expected-entries k-hashes hash-fn] (let [flt (bf/make-bloom-filter m-expected-entries (bf/make-permuted-hash-fn (or hash-fn bf/make-hash-fn-hash-code) (map str (range 0 k-hashes))))] (add-words-to-filter! flt) flt)) (defn words-fp-rate [count flt] (loop [times count fps 0] (cond (= 0 times) fps (bf/include? flt (str times)) (recur (dec times) (inc fps)) :else (recur (dec times) fps)))) (defn report-fp-rate [count flt] (let [rate (words-fp-rate count flt)] (/ rate count 1.0))) (def word-flt-1pct (make-words-filter 2875518 7 bf/make-hash-fn-hash-code)) (printf "n=2875518, k=10, p=0.01: Java's hashCode: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-1pct) (.cardinality (:bitarray word-flt-1pct))) (def word-flt-crc32-1pct (make-words-filter 2875518 7 bf/make-hash-fn-crc32)) (printf "n=2875518, k=10, p=0.01: CRC32: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-crc32-1pct) (.cardinality (:bitarray word-flt-crc32-1pct))) (def word-flt-adler32-1pct (make-words-filter 2875518 7 bf/make-hash-fn-adler32)) (printf "n=2875518, k=10, p=0.01: Adler32: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-adler32-1pct) (.cardinality (:bitarray word-flt-adler32-1pct))) (def word-flt-md5-1pct (make-words-filter 2875518 7 bf/make-hash-fn-md5)) (printf "n=2875518, k=10, p=0.01: MD5: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-md5-1pct) (.cardinality (:bitarray word-flt-md5-1pct))) (def word-flt-sha1-1pct (make-words-filter 2875518 7 bf/make-hash-fn-sha1)) (printf "n=2875518, k=10, p=0.01: SHA1: fp=%f cardinality=%d\n" (report-fp-rate 100000 word-flt-sha1-1pct) (.cardinality (:bitarray word-flt-sha1-1pct)))
/usr/share/dict/words on my system is 24,86,813 bytes while the recommended size of this filter for a 1% FP rate is 2,875,518 bytes – a nearly ten fold reduction in required memory. If the entries were larger the memory savings would be larger as well.
The Wikipedia article on Bloom Filters discusses methods for determining optimal values for the number of hashes,
k, the size of the bit array,
m, given an expected number of entries
n and a target false positive probability,
This library provides a functions to help you determine those parameters for your data set. Following along with the
words.clj example, the
/usr/share/dict/words file on my system contains about 234,936 words. This gives us
n = 234936 entries, if we can tolerate a 1% error rate, the size of the bit array should be:
(num-bits-for-entries-and-fp-probability 234936 0.01) => 2251875
or about 2.2 million bits. Given this bit array size, we can now compute an optimal number of hashes:
(num-hash-fns-for-entries-and-bits 234936 2251875) => 6.643855378585771
rounding to 7. You can obtain both of these values in one call with:
(optimal-n-and-k 300000 0.01) => [2875518 7]
Based on the benchmarks in the
examples/words.clj file, I do not recommend using Java’s
hashCode function for use with the bloom filter. It results in a wildly higher false positive rate then the other hashes. My recommendation is to use the values reported by
optimal-n-and-k as a starting point and measure your actual false positive rate, keeping in mind the ‘4.8 bits per key’ rule as a way to decrease the false-positive rate by an order of magnitude.
There is also a helper function
make-optimal-filter that takes an estimated number of entries, a desired false-positive probability and an optional hash builder function.
words.clj example shows how to create your own hash function. As part of the definition of a bloom filter, a number of hashes
k are executed to determine the bits to set. The interface for this hash function should take a
java.lang.String and an
int. The string being the data to compute the hash for, and the
int being the size in bits (
n) of the bit array. The hash function must return the sequence of bit locations to be set in the filter for the given input string.
The current implementation uses a
java.util.BitSet, which is limited to
2^31 - 1 bits. This puts some boundaries on the number of elements and the false-positive probability.
java.math.BigInteger have this limitation. I will look into implementing a version of the filter that uses
byte arrays under the hood so that this limitation can be lifted.
If you’re using Leiningen, add the following to your
<dependencies> <dependency> <groupId>com.github.kyleburton</groupId> <artifactId>clj-bloom</artifactId> <version>1.0.1</version> </dependency> ... </dependencies>
To build using Leiningen:
clj-bloom$ lein test clj-bloom$ lein jar clj-bloom$ lein uberjar
- Bloom Filter on Wikipedia
- Bloom Filters: A Powerful Tool
- Counting Bloom Filter implemented in Ruby
- Scalable Data-sets: Bloom Filters in Ruby
- Bloom Filters – the math