Bloom filter implementation in Crystal lang
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
samples
spec
src
.gitignore Make task to run benchmarks Feb 9, 2016
.travis.yml
CHANGELOG.md
LICENSE
Makefile
README.md
shard.yml

README.md

Bloom Filter Build Status

Implementation of Bloom Filter in Crystal lang.

Installation

Add this to your application's shard.yml:

dependencies:
  bloom_filter:
    github: crystal-community/bloom_filter

Usage

Basic

require "bloom_filter"

# Create filter with bitmap size of 32 bytes and 3 hash functions.
filter = BloomFilter.new(bytesize = 32, hash_num = 3)

# Insert elements
filter.insert("Esperanto")
filter.insert("Toki Pona")

# Check elements presence
filter.has?("Esperanto")  # => true
filter.has?("Toki Pona")  # => true
filter.has?("Englsh")     # => false

Creating a filter with optimal parameters

Based on your needs(expected number of items and desired probability of false positives), your can create an optimal bloom filter:

# Create a filter, that with one million inserted items, gives 2% of false positives for #has? method
filter = BloomFilter.new_optimal(1_000_000, 0.02)
filter.bytesize # => 1017796 (993Kb)
filter.hash_num # => 6

Dumping into a file and loading

It's possible to save existing bloom filter as a binary file and then load it back.

filter = BloomFilter.new_optimal(2, 0.01)
filter.insert("Esperanto")
filter.dump_file("/tmp/bloom_languages")

loaded_filter = BloomFilter.load_file("/tmp/bloom_languages")
loaded_filter.has?("Esperanto") # => true
loaded_filter.has?("English")   # => false

Union and intersection

Having two filters of the same size and number of hash functions, it's possible to perform union and intersection operations:

f1 = BloomFilter.new(32, 3)
f1.insert("Esperanto")
f1.insert("Spanish")

f2 = BloomFilter.new(32, 3)
f2.insert("Esperanto")
f2.insert("English")

# Union
f3 = f1 | f2
f3.has?("Esperanto") # => true
f3.has?("Spanish")   # => true
f3.has?("English")   # => true

# Intersection
f4 = f1 & f2
f4.has?("Esperanto") # => true
f4.has?("Spanish")   # => false
f4.has?("English")   # => false

Visualization

If you want to see how your filter looks like, you can visualize it:

f1 = BloomFilter.new(16, 2)
f1.insert("Esperanto")
puts "f1 = (Esperanto)"
puts f1.visualize

f2 = BloomFilter.new(16, 2)
f2.insert("Spanish")
puts "f2 = (Spanish)"
puts f2.visualize

f3 = f1 | f2
puts "f3 = f1 | f2 = (Esperanto, Spanish)"
puts f3.visualize

Output:

f1 = (Esperanto)
░░░░░░░░ ░░░░░░█░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░█ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░

f2 = (Spanish)
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░█░ ░█░░░░░░

f3 = f1 | f2 = (Esperanto, Spanish)
░░░░░░░░ ░░░░░░█░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░█ ░░░░░░░░ ░░░░░░█░ ░█░░░░░░

In this way, you can actually see which bits are set:)

Benchmark

Performance of Bloom filter depends on the following parameters:

  • Size of the filter
  • Number of hash functions
  • Length of the input string

To run benchmark from ./samples/benchmark.cr, simply run make task:

$ make benchmark

Number of items: 100000000
Filter size: 117005Kb
Hash functions: 7
String size: 13

                     user     system      total        real
insert           0.004227   0.000000   0.004227 (  2.769349)
has? (present)   0.007980   0.000000   0.007980 (  5.223778)
has? (missing)   0.004318   0.000000   0.004318 (  2.829521)

Contributors