Prototype of how we can use a Bloom filter for @reddit’s recently consumed feature. This package provides a Memcache-backed Bloom filter with an API similar to Python sets.
Bloom filters are a powerful data structure that help you to answer the question, “Have I seen this element before?” but not the question, “What are all of the elements that I've seen before?” So think of Bloom filters as Python sets that you can add elements to and use to test element membership, but that you can’t iterate through or get elements back out of.
Bloom filters are probabilistic, which means that they can sometimes generate false positives (as in, they may report that you’ve seen a particular element before even though you haven’t). But they will never generate false negatives (so every time that they report that you haven’t seen a particular element before, you really must never have seen it). You can tune your acceptable false positive probability, though at the expense of the storage size and the element insertion/lookup time of your Bloom filter.
Instantiate a BloomFilter
:
>>> from bloom import BloomFilter
>>> dilberts = BloomFilter(
... num_values=1000,
... false_positives=0.001,
... key='dilberts',
... )
Here, num_values
represents the number of elements that you expect to insert
into your BloomFilter
, and false_positives
represents your acceptable false
positive probability. Using these two parameters, BloomFilter
automatically
computes its own storage size and number of times to run its hash functions on
element insertion/lookup such that it can guarantee a false positive rate at or
below what you can tolerate, given that you’re going to insert your
specified number of elements.
Insert an element into the BloomFilter
:
>>> dilberts.add('rajiv')
This BloomFilter
implementation supports elements of any type that can be
dumped as JSON.
Test for membership in the BloomFilter
:
>>> 'rajiv' in dilberts
True
>>> 'raj' in dilberts
False
>>> 'dan' in dilberts
False
See how many elements you’ve inserted into the BloomFilter
:
>>> len(dilberts)
1
Note that BloomFilter.__len__()
is an approximation, so please don’t
rely on it for anything important like financial systems or cat gif websites.
Insert multiple elements into the BloomFilter
:
>>> dilberts.update({'raj', 'dan'})
I recommend using BloomFilter.update()
to insert multiple elements into the
BloomFilter
(over repeated BloomFilter.add()
calls) as
BloomFilter.update()
inserts all of the elements and then stores the
BloomFilter
to Memcache once (rather than inserting one element, storing the
BloomFilter
to Memcache, inserting another element, storing the BloomFilter
to Memcache again, etc.).
Remove all of the elements from the BloomFilter
:
>>> dilberts.clear()