A simple bloom filter written in Python
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
__init__.py
bloom.py
requirements.txt
words.txt

README.md

Bloom

A simple bloom filter written in Python using the Murmur hash.

Bloom Filter

A bloom filter is a probabalistic data structure that tests for element membership in a set. The element is either probably in the set, or is definitely not in the set. You can refer to the Wikipedia article here and a good tutorial for the basics of bloom filters can be found here.

Thanks to this article for the single algorithm hashing strategy I use for this project.

How It Works

####Create an instance of the Bloom class

b_f = Bloom(expected-number-of-elements, false-positive, number-of-hashes)

Based on the number of expected elements, the false positive rate, and the number of hashes given by the user, the bit array size will be calculated. The bloom filter will attempt to have a false positive rate approximately equal to the given false positive rate.

####Add elements to the bloom filter

b_f.add('key')

####Query for membership

b_f.query('key')

This will return one of two possible things:

'Key' is possibily in the set and there is a false positive probability of calculated-false-positive-rate

or

'Key' is definitely not in the set

Short Example

b_f = Bloom(100, .1, 5)
b_f.add("Matt")
b_f.add("Daria")

b_f.query("Matt")
b_f.query("KitKat") # KitKat is my cat.

returns

'Matt' is possibily in the set and there is a false positive probability of 0.100925190275
'KitKat' is definitely not in the set

Notes

The probability of a false positive is:

(1 - e^(-kn/m))^k

  • where k is the number of hashing functions
  • n is the number of expected elements inserted into the filter
  • and m is the number of bits allocated in the bit array

The Bloom class takes as its arguments n, fp (the probability for a false positive), and k.

It calculates m thusly:

m = 1 / (1 - (1 - p ^(1 / k))^(1 / (k * n)))

###Todos:

  • Test with words.txt, which contains over 200000 words.
  • Make the bloom filter serializable using Pickle.
  • Unit testing using Nose.