## Bloom Filter

The bloom filter, and its relative the count-min sketch, are datastructures for testing set membership. These datastructures are extremely memory efficient, at the cost of having false-positives. This false-positive rate (FPR) is fixed and can be calculated (or estimated) using a few simple properties of both the datastructure and its dataset (more on this later). Both of these datastructures are straight-forward to implement, which is a huge bonus for folks like me -- I don't trust most bioinformatics software not to be full of bugs, so simple and elegant datastructures which are easy to debug put me at ease.

In bioinformatics, the bloom filter has proven quite useful for counting k-mers. With noisy data, the existence of some number of false positives is more or less irrelevant, and is an obvious trade-off for datasets that would require hundreds to thousands of gigabytes of data to store exactly. Even more importantly, a bloom filter storing k-mers implicitly encodes a de Bruijn graph which can be used for all manner of sequence traversal operations (most commonly, {gen,transcript,metagen}ome assembly).

### Code

In [82]:
def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in xrange(3, int(n ** 0.5) + 1, 2):
        if n % i == 0:
            return False
    return True

def get_primes(start, n_primes):
    primes = []
    cur_num = start
    if cur_num % 2 == 0:
        cur_num -= 1
    while len(primes) != n_primes and cur_num > 0:
        if is_prime(cur_num):
            primes.append(cur_num)
        cur_num -= 2
    return primes

In [83]:
class BloomFilter(object):
    ''' Super-basic Bloom filter. Interface is similar to the built-in `set`.
    
    Attributes:
        n_tables (int): Number of tables in the filter.
        primes (list): Sizes of the tables (and for modulus hashing).
        tables (list): List of bool lists storing item existence.
    '''
    
    def __init__(self, n_bins=int(1e6), n_tables=4):
        ''' Construct a new BloomFilter.
        
        The BloomFilter takes the `n_bins` argument and finds the `n_tables` prime numbers
        smaller than it. Those primes are the table size in the data structure.
        
        Args:
            n_bins (int): The size of the largest table.
            n_tables (int): Number of tables to use.
        '''
        self.n_tables = n_tables
        self.primes = get_primes(n_bins, n_tables)
        self.tables = [[False] * b for b in self.primes]
    
    def get_indices(self, item):
        ''' Hash the object and mod it against the `n_tables` primes.
        
        Args:
            item (object): The object to be hashed
        '''
        return [hash(item) % p for p in self.primes]
    
    def add(self, item):
        ''' Add an item to the filter.
        
        Args:
            item (object): The object to be inserted.
        '''
        for n, idx in enumerate(self.get_indices(item)):
            self.tables[n][idx] = True
    
    def __contains__(self, item):
        ''' Built-in; check if the item is in the filter. An item is in the filter
        if its entry in each of `n_tables` tables is True.
        
        Args:
            item (object): The item to check.
        '''
        # At first I went golfing, but this is for demonstration purposes, sooo....
        # return sum([1 for n, idx in enumerate(self.get_indices(item)) if self.table[n][idx]]) == self.n_tables
        count = 0
        for n, idx in enumerate(self.get_indices(item)):
            if self.tables[n][idx] == True:
                count += 1
        return count == self.n_tables

### Demonstration

In [88]:
hexamers = ['ATGCGC', 'GTCGGT', 'CGGTAG', 'AGGTCG', 'GGGGGG']
b = BloomFilter(100, 4)

In [89]:
for kmer in hexamers[:-1]:
    b.add(kmer)

In [90]:
for kmer in hexamers[:-1]:
    print kmer in b

True
True
True
True


In [91]:
print hexamers[-1] in b

False
