# Bloom Filter

Please refer to <a href="https://drive.google.com/file/d/18uY18kqGPMcW67bAjNiltTk6sSr6Wh2B/view?usp=sharing">lecture slides</a> for the theoretical details. 

To implement a Bloom filter, we need a family of hash functions and a bitarray. 

## The hash function (MurmurHash 3)

We can use one of the commonly used hash functions, namely <a href="https://en.wikipedia.org/wiki/MurmurHash">MurmurHash 3</a> or mmh3 with $k$ different seeds. 

In [None]:
import math
import numpy as np
import mmh3

# mmh3.hash(element, seed)
print(mmh3.hash('ISI Kolkata',1))
print(mmh3.hash('ISI Kolkata',2))
print(mmh3.hash('ISI Kolkata',1))

## The bitarray

We will use the bitarray module of Python package index (Pypi).

In [None]:
from bitarray import bitarray 

# An initial test
bit_array = bitarray(10)
print(bit_array)

bit_array.setall(0)
print(bit_array)

bit_array[3] = 1
print(bit_array)

## The Bloom Filter class

We are now ready to implement the bloom filter class. The parameters are nItems (number of items), nBits (number of bits) and nHash (number of hash functions).

In [None]:
class BloomFilter(object): 

    def __init__(self, nBits, nHash): 
        self.nBits = nBits 
        self.nHash = nHash
        self.bits = bitarray(self.nBits) 
        self.bits.setall(0) 
    
    def add(self, item):
        for i in range(self.nHash):
            hash_value = mmh3.hash(item,i) % self.nBits
            self.bits[hash_value] = True
            
    def test(self, item):
        for i in range(self.nHash):
            hash_value = mmh3.hash(item, i) % self.nBits
            if (self.bits[hash_value] == 0):
                return False
        return True

## Constructing our Bloom Filter

Let us take a text file "words.txt" containing about 466,000 english words and consider the words as our member set. We will construct the filter with these words and then check if any new word is present in this set. 

In [None]:
nItems = 466000
nBits = nItems * 20
nHash = 10

bloom_filter = BloomFilter(nBits, nHash)

Now we add the member words to the filter. 

In [None]:
memberFile = open('words.txt', 'r') 
count = 0
  
while True: 
    count += 1
    
    # read the next line 
    word = memberFile.readline() 
  
    # if line is empty 
    # or we have exhausted our number of items, stop
    if not word or count >= nItems: 
        break
    
    # Otherwise add the word
    bloom_filter.add(word)
memberFile.close() 

## Testing: checking for false positives

Now we synthesize negative examples and test our filter with that. To make sure each example is negative (not present in the list of half-million English words), we simply add the prefix 'moolb' (bloom in opposite direction) to each word.

In [None]:
testFile = open('words.txt', 'r') 
count = 0
countFalsePositive = 0
prefix = 'moolb'

while True: 
    count += 1
    
    # read the next line 
    word = prefix + testFile.readline()
  
    # if line is empty 
    # or we have exhausted our number of items, stop
    if not word or count >= nItems: 
        break
    
    # Otherwise test the word with the filter
    isPresent = bloom_filter.test(word)
    
    if isPresent:
        countFalsePositive += 1
testFile.close() 

print("Percentage of false positives: ", (countFalsePositive*100/count))

We can validate the performance by computing our expected false positive rate using the known formula. 

In [None]:
expectedP = (1 - np.exp(-nHash*nItems/nBits))**nHash
print(expectedP*100)