## BLOOM FILTERS

### Motivation

Bloom filters are a data structure that allow rapidly checking whether a certain element belongs to a known set.
With a big set of data, a **linear search** would be reeeaaaaly slow! So, one way of solving this problem is, given that we only want a "yes/no" answer, only storing 1s and 0s representing the element's existance in the set, and not the whole data.

While this is super efficient, the downfall of this structure is its probabilistic nature, so it can contain some false positives. A **false positive** is when given a value, the algorithm returns 1 by mistake, when it should return 0. In this case, the false positive rate increases as the percentage of elements == 1 increases. The false positive rate can be calculated with:

$ g(z) = \frac{1}{(1+e^{-z})}$

 On the other hand, bloom filters never generate **false negatives**.


The way we implemented this code is that so over a given value are calculated *nhash* hash functions. Each of them will return a given bloom filter position (values between [0,m-1]), which we will set to 1. Then, when searching if an element is in the set, it only returns ***True*** if all *nhash* positions are set to 1. Using multiple indices combined helps lowering the false positive cases. It's important that the hash functions used are independent among them, because if they weren't, it would be the same as just using one hash function. In our implementation, we will use the fast and robust set of hash functions Murmurhash3.

In [None]:
import sys
!{sys.executable} -m pip install mmh3

In [None]:
import numpy
import mmh3
#lusdoç<vbo<svnoç<dvb buçbuibibilbbonnoçiºjio

#ok, já estou mais convencido
 
class BloomFilter(object):
    def __init__(self, nitems, false_positive_prob):
        self.size = int(-(nitems * numpy.log(false_positive_prob))/(numpy.log(2)**2)) # size of the bloom filter
        self.nhash = int((self.size/nitems) * numpy.log(2)) # number of hash functions

        self.bloomf = numpy.zeros(self.size) # initializing the bloom filter as an array of size self.size with everything at 0
 
    def add(self, item):
        for i in range(self.nhash):
            h = mmh3.hash(item, i) % self.size #i = seed
            self.bloomf[h] = 1
 
    def check(self, item):
        for i in range(self.nhash):

            h = mmh3.hash(item, i) % self.size

            if self.bloomf[h] == 0:
                return False

        return True

In [6]:
import pandas as pd

chunks = pd.read_csv("dataset/library-collection-inventory.csv", chunksize=1000000)
our_set = pd.concat(chunks)
print(type(our_set))

In [None]:
import pandas as pd

print(our_set.head())

titles = our_set[:,1]

print(titles)

n = len(our_set) #no of items to add
p = 0.05 #false positive probability

bloomf = BloomFilter(n,p)

print("Size of bit array:{}".format(bloomf.size))
print("Number of hash functions:{}".format(bloomf.nhash))

for item in titles:
	bloomf.add(item)

while True:
	inpt = input("Search: ")
	if not inpt:
		break

	if bloomf.check(inpt):
		print("The introduced query is in the dataset")
	else:
		print("We're sorry, but we didn't find a result.")


Error: Session cannot generate requests