# Lecture 2

# Part 1: A Crash course in object-oriented programming

This part is for those of you who have never done any object-oriented programming. That is, in Python, you have never used the keyword `class`. Object-oriented programming is a huge topic of which there are entire courses, and here I am basically going to give one example programmed in a non-object oriented way, and then show how to make it object-oriented, and hopefully you will see the benefits.

The example I will use is this: Suppose you wanted to have the ability to deal with fractions. You want to create them, print them, and do arithmetic with them. And, importantly, you always want them to be in lowest terms: $\frac{6}{15}$ should be $\frac{2}{5}$. Additionally, the denominator should never be negative.

## Without objects

What is a fraction? Two numbers. There are two obvious ways to glue together two numbers in Python, as a list or as a tuple. So we have a choice to make, should $\frac{2}{3}$ be $[2,3]$ or $(2,3)$? The choice comes down to whether we want our factions to be changeable or not (the technical term is mutable). Both would be reasonable choices, but we will choose the tuple representation, as numbers and strings in python are immutable, so this better matches what is there already. As a bonus, it make them hashable.

Given that we have made this decision, we need to decide what functions to write, and then write them.

The first, and most complex, function to write is a function that takes a numerator and denominator and returns a tuple containing the fraction. We will call this `fractionMake`. It is complicated as we need to make sure the fraction is in lowest terms, there is no division by zero, and the denominator is not negative. We also code a helper function, `gcd`, which implements the [method of Euclid](https://en.wikipedia.org/wiki/Euclidean_algorithm) to compute the GCD, which is needed to get the fraction into lowest terms.

Then we will have a number of functions that take one or two tuples that represent fractions, and do things like add, compare, and convert to string. These are all straightforward.


In [1]:
def gcd(x,y):
    if x == 0 :
        return y
    return gcd(y%x,x)

def fractionMake(numerator,denominator):
    if denominator==0:
        raise ZeroDivisionError(str(numerator)+"/"+str(denominator))       
    g=gcd(numerator,denominator)
    numerator=numerator//g
    denominator=denominator//g
    if denominator<0:
        numerator *= -1
        denominator *= -1
    return (numerator,denominator)

def fractionGetNumerator(fraction):
    return fraction[0]

def fractionGetDenominator(fraction):
    return fraction[1]

def fractionToString(fraction):
    numerator,denominator=fraction
    return str(numerator)+'/'+str(denominator)

def fractionAdd(fractionA,fractionB):
    numeratorA,denominatorA=fractionA
    numeratorB,denominatorB=fractionB
    return fractionMake(numeratorA*denominatorB+numeratorB*denominatorA,
                    denominatorA*denominatorB)

def fractionEquals(fractionA,fractionB):
    return fractionA==fractionB

def fractionLessThan(fractionA,fractionB):
    numeratorA,denominatorA=fractionA
    numeratorB,denominatorB=fractionB
    return numeratorA*denominatorB < numeratorB*denominatorA

Here is some sample code for computing $\displaystyle \sum_{i=1}^{10}\frac{2}{i}$

In [2]:
x=fractionMake(0,1)
for i in range(1,10):
    x=fractionAdd(x,fractionMake(2,i))
    print(fractionToString(x))

2/1
3/1
11/3
25/6
137/30
49/10
363/70
761/140
7129/1260


## With objects

Notice that in the above code, `fractionMake` makes a fraction, and all the other functions take a fraction as the first parameter.

In object oriented programming, the idea is to bundle together data and the functions that act on the data together into an *object*. In the object, the functions that act on the data are called *methods*

To make the above code object-oriented, we do the following:
- Start with `class fraction1:` and indent the rest. `fraction1` is the name of the class.
- Replace `fractionMake` with the special function `__init__`. This is called a constructor. This no longer returns anything.
- All functions have as their first parameter `self`
- We can use `self.XXX` to store anything we want inside the fraction. In our case we want to store the numerator and denominator, so we use `self.numerator` and `self.denominator` for this purpose, replacing all of the indexing and unpacking of the tuple.
- We remove `Fraction` from the name of the functions as this is redundant now that they are in the `fraction1` class
- To better match `self`, the functions that take two fractions will call the second one `other`


In [35]:

class fraction1:
    def __init__(self,numerator,denominator):
        if denominator==0:
            raise ZeroDivisionError(str(numerator)+"/"+str(denominator))       
        g=gcd(numerator,denominator)
        numerator=numerator//g
        denominator=denominator//g
        if denominator<0:
            numerator *= -1
            denominator *= -1
        self.numerator=numerator
        self.denominator=denominator

    def getNumerator(self):
        return self.numerator

    def getDenominator(self):
        return self.denominator

    def toString(self):
        return str(self.numerator)+'/'+str(self.denominator)

    def add(self,other):
        return fraction1(self.numerator*other.denominator+other.numerator*self.denominator,
                        self.denominator*other.denominator)

    def equals(self,other):
        return self.numerator==other.numerator and self.denominator==other.denominator
    
    def lessThan(self,other):
        return self.numerator*other.denominator < self.numerator*other.denominator



Now observe how the small piece of sample code changes:
    
- `fractionMake(0,1)` has been replaced with `fraction1(0,1)` where `fraction1` is the name of the class.
- `fractionToString(x)` becomes `x.toString`
- `fractionAdd(x,fractionMake(2,i))` becomes `x.toString()`

These reflect the fact that now `x` is not a tuple but an *instance* of a class, in this case `fraction1`. Observe that when we write `x.add(y)` the method `add(this,other)` is called with `this` being `x` and `other` being `y`. 

In [34]:
x=fractionMake(0,1)
for i in range(1,10):
    x=fractionAdd(x,fractionMake(2,i))
    print(fractionToString(x))
    
x=fraction1(0,1)
for i in range(1,10):
    x=x.add(fraction1(2,i))
    print(x.toString())

2/1
3/1
11/3
25/6
137/30
49/10
363/70
761/140
7129/1260


NameError: name 'fraction1' is not defined

## Magic methods

You have certainly seen functions such as `str(x)` and `hash(x)`. You have also certainly seen expressions like `2+3`. You have probably only used them on built-in classes such as numbers, strings, and lists. However, you can get these and many more to work for classes you create through the use of magic methods. 

- `str(x)` works if you have defined `__str__(self)`
- `hash(x)` works if you have defined `__hash__(self)`
- `x+y` works of you have defined `__add__(self,other)`

There are many more, see the documentation, every operation in Python has a corresponding magic method.

We now re-write the above, using the magic methods, along with a few other changes:

- We put the `gcd` function inside the class. Note that this does not have `self` as its first parameter. It works exactly the same as when it was outside of the class, except that we need to call it with `Fraction.` first. Although gcd does not directly act on the data of our class, as it is only used by our class, it logically belongs inside it.
- The function `gcd` as well as `self.numerator` and `self.denominator` are meant to be used only by the methods of the class. We do not want the user of the class to be able to change `self.numerator` directly. Thus we change the name to add an underscore before. Such functions and variable are not accessible outside of the class.
- We add a `__hash__`, which simply hashes the hashes of the numerator and denominator.


In [41]:
class Fraction:
    def _gcd(x,y):
        if x == 0 :
            return y
        return Fraction._gcd(y%x,x)
    def __init__(self,numerator,denominator):
        if denominator==0:
            raise ZeroDivisionError(str(numerator)+"/"+str(denominator))       
        g=Fraction._gcd(numerator,denominator)
        numerator=numerator//g
        denominator=denominator//g
        if denominator<0:
            numerator *= -1
            denominator *= -1
        self._numerator=numerator
        self._denominator=denominator
    def __str__(self):
        return str(self._numerator)+'/'+str(self._denominator)
    def __repr__(self):
        return str(self)
    def getNumerator(self):
        return self._numerator
    def getDenominator(self):
        return self._denominator
    def __eq__(self,other):
        return self._numerator==other._numerator and self._denominator==other._denominator
    def __add__(self,other):
        return Fraction(self._numerator*other._denominator+other._numerator*self._denominator,
                        self._denominator*other._denominator)
    def __mul__(self,other):
        return Fraction(self._numerator*other._numerator,self._denominator*other._denominator)
    def __div__(self,other):
        return Fraction(self._numerator*other._denominator,self._denominator*other._numerator)
    def __hash__(self):
        return hash( (self._numerator,self._denominator) )
        

In [42]:
S=[Fraction(1,n) for n in range(1,10)]
print(S)
print(sum(S,Fraction(0,1))) # See how sum uses our __add__function
print(Fraction(3,4)==Fraction(6,9))
print(hash(Fraction(5,10)))

[1/1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9]
7129/2520
False
-3550055125485641917


There is still much more work to make this into a finished class. All the arithmetic and comparison operators need to be written, and it would be good if we could do things like add fractions to integers.

# Part 2: Storing unordered data

There are many different types of data, but the most simple is unordered data. 

This is where each data item has some key, and all you want to do to maintain a collection of data and support simple operations such as insert, delete, and membership queries. By membership queries, I mean queries where you give a key and ask if there is a data item stored with this key and the answer is true or false. This is often called a set.

A slightly more powerful structure allows key,value pairs. Here the key is what we search for, and the value is additional data that goes along for the ride. When we search for a key in such a structure, if a key,value pair is in the structure, the value is returned. Such a structure is often called an associative array, in Python this is a dictionary.

The most efficient way to implement both of these structures is through a data structure known as a hash table.

Note that the approaches we describe today are not appropriate for data where you want to do non-exact queries. Meaning, for example, to find data close to a query or within a certain range. These will be topics for another day.


# Hash tables

Hash tables are the go-to structure for when you want to quickly store and retrieve items by an exact search. That is, they are great if you want to know "Is the item with key XXX in the table?" but useless if you want to know "Give me the largest item with key at most XXX" or "Tell me know many items have keys at most XXX", for these operations a Binary Search Tree is needed.

Hash tables are very simple. Let $n$ be the current number of items in a hash table. A hash table is simply a list of $k$ buckets, where $k$ is typically chosen to be approximately $cn$ for some small constant.

To insert an item `x` with hash value `h(x)`, we simply put it in the bucket `h(x) mod k`. To find an item `x` with hash `h(x)` we simply exhaustively search the bucket `h(x) mod k`.

Assuming a perfect hash function, and $k=cn$, the expected size of each bucket is $c$, and given an item $x$, the expected size of the `h(x) mod k`th bucket is at most $1+c$. Thus each bucket can be represented as a list, and when we search we can just use an exhaustive search as the lists are expected to be small.

Below I present a python implementation of this idea. This is a class that stores key,value pairs, where the key must be hashable but the value can be anything. The operations are

- `H=Hash.init(size)` makes a new hash table of given size
- `H[k]=v` (`__setitem__`) if there is a pair `(k,v')` stored already, replaces `v'` with `v`, otherwise adds `(k,v)` to the table
- `H[k]` (`__getitem__`) returns the value if `k` is in the table
- `k in H` (`__contains__`) returns if there is a pair stored with key `k`
- `delete H[k]` (`__delitem__`) deletes the pair with key `k` 
- `str(H)` (`__str__`) Shows the data currently stored
- `repr(H)` (`__repr__`) Shows the current structure of the hash table

The implementation stores the hash table itself as a list `self._A`. This is private as the user of the class should never directly access the hash table. Each bucket is also a list. Each item in a bucket is a list of size two storing `[key,value]` (note we want to be able to change `value` so we don't use a tuple.


In [18]:
class HashTable:
    def __init__(self,size):
        self._A=[ [] for i in range(size)]
        
    def _bucket(self,key):
        #print(hash(key))
        #print(hash(key)%10)
        return self._A[hash(key)%len(self._A)]
    
    def __setitem__(self,key,value):
        bucket=self._bucket(key)
        for item in bucket:
            if item[0]==key:
                item[1]=value
                return
        bucket.append([key,value])
        
    def __getitem__(self,key):
        for k,v in self._bucket(key):
            if k==key:
                return v
        raise KeyError(key)
        
    def __contains__(self,key):
        for k,v in self._bucket(key):
            if k==key:
                return True
        return False
    
    def __delitem__(self,key):
        bucket=self._bucket(key)
        for i in range(len(bucket)):
            if bucket[i][0]==key:
                del bucket[i]
                return
        raise KeyError(str(key))
        
    def __str__(self):
        items=["{"]
        for bucket in self._A:
            for x in bucket:
                items.append(x)
        items.append('}')
        return "".join((str(x) for x in items))

    def __repr__(self):
        rep=""
        for i in range(len(self._A)):
            rep+=str(i)+": "+str(self._A[i])+"\n"
        return rep

In [21]:
H=HashTable(10)
for i in range(10):
    H["Data"+str(i)]="Value"+str(i)
print(repr(H))

print(3 in H)
print('Data3' in H)
print(hash('Data3')%10)
del H["Data1"]
H["Data5"]="Replaced"
print(H["Data2"])
print(repr(H))
print(str(H))

0: []
1: [['Data1', 'Value1'], ['Data8', 'Value8']]
2: [['Data2', 'Value2'], ['Data4', 'Value4']]
3: [['Data6', 'Value6']]
4: [['Data0', 'Value0']]
5: []
6: []
7: []
8: [['Data3', 'Value3'], ['Data5', 'Value5']]
9: [['Data7', 'Value7'], ['Data9', 'Value9']]

False
True
8
Value2
0: []
1: [['Data8', 'Value8']]
2: [['Data2', 'Value2'], ['Data4', 'Value4']]
3: [['Data6', 'Value6']]
4: [['Data0', 'Value0']]
5: []
6: []
7: []
8: [['Data3', 'Value3'], ['Data5', 'Replaced']]
9: [['Data7', 'Value7'], ['Data9', 'Value9']]

{['Data8', 'Value8']['Data2', 'Value2']['Data4', 'Value4']['Data6', 'Value6']['Data0', 'Value0']['Data3', 'Value3']['Data5', 'Replaced']['Data7', 'Value7']['Data9', 'Value9']}


# Python dictionaries

Python dictionaries are just hash tables. Of course they are more complete then the few lines of code above. You do not need to give the size of the hash table in advance, as they will automatically resize if it gets too big or small. They also support various forms of iteration (which means you can have code like this work: `for x in H:`), I won't describe how to do this but certainly read about it elsewhere if you are interested.

In [5]:
H={}
H["This"]="that"
print(H["This"])

that


# Bit arrays

Suppose you wanted to have an array of boolean or 0/1 values. If you use a list in Python, this will take up very much space. Here I will show you how to create a simple class to store an array of bits very simply. The key is to make use of python's [low-level arrays](https://docs.python.org/3/library/array.html). Normal lists in python are very flexible and store any type of data, whereas with low level arrays you need to say in advance what kind of data will be stored.

We will store an array of bytes, each of which is 8 bits. We use the bit operations to extract out individual bits.

In [44]:
import array
import sys

class BitArray:
    def __init__(self,size):
        self._bits=8
        self._sizeInBits=size
        self._A=array.array("B",(0 for i in range(self._index(size)+1)))
    def _index(self,index):
        return index//self._bits
    def _mask(self,index):
        return 1<<(index%self._bits)        
    def __setitem__(self,indexInBits,value):
        if self[indexInBits]!=value:
            self._A[self._index(indexInBits)]^=self._mask(indexInBits)
    def __getitem__(self,indexInBits):
        if  self._A[self._index(indexInBits)] & self._mask(indexInBits) == 0:
            return 0
        else:
            return 1
    def __repr__(self):
        return "".join(str(self[i]) for i in range(self._sizeInBits))
    def __sizeof__(self):
        return sys.getsizeof(self._A)

    

In [45]:
B=BitArray(20)
print(repr(B))
B[3]=1
B[4]=1
print(repr(B))
B[3]=0
print(repr(B))

00000000000000000000
00011000000000000000
00001000000000000000


In [46]:
def spaceTest(n):
    s=n*2
    B=BitArray(s)
    L=[0]*s
    D={}
    S=set()
    for i in range(0,s,2):
        B[i]=1
        L[i]=1
        D[i]=1
        S.add(i)
    print("Space in bytes to store ",s//2," numbers from 0..",s)
    print("Bit array:",sys.getsizeof(B))
    print("List:",sys.getsizeof(L))
    print("Dictionary:",sys.getsizeof(D))
    print("Set:",sys.getsizeof(D))
    

spaceTest(1000)


Space in bytes to store  1000  numbers from 0.. 2000
Bit array: 345
List: 16056
Dictionary: 36960
Set: 36960


So in summary, our bit arrays are a factor 60-times more space efficient than python lists and 100 times more efficient than dictionaries or sets.

# Bloom Filters

Our byte arrays were super compact, but of limited value. Usually we want to store things more general than an array boolean values.

Here, we introduce the concept of a bloom filter. A bloom filter can store a set of anything, but is very, very compact. It supports inserting elements and checking to see if an element has been inserted. The tradeoff is that it is sometimes wrong. The good news is that the error only occurs in one direction: if you ask if $x$ is in the filter, if it is there the answer is always yes. But, if $x$ is not in the filter, it will usually say no, but there is a chance it will say yes.

The structure is very simple. It is initialized with two parameters: `k`, which is the size of the filter in bits, and `h` which is the number of hash functions. The structure is just a bit array of `k` bits. When you want to insert an item `x`, you set the $hash_i(x)$th element of the bit vector to one, for each of the $h$ different hash functions. To see if $x$ is in the has table, you check and see if the $hash_i(x)$th element of the bit vector is one, for each of the $h$ different hash functions.


In [28]:
class Bloom:
    def __init__(self,sizeInBits,hashCount):
        self._A=BitArray(sizeInBits)
        self._sizeInBits=sizeInBits
        self._hashCount=hashCount
    def add(self,item):
        for h in range(self._hashCount):
            self._A[hash((h,item))%self._sizeInBits]=1
    def __contains__(self,item):
        return all(( self._A[hash((h,item)) % self._sizeInBits] for h in range(self._hashCount)))
    def __repr__(self):
        return repr(self._A)

Now, lets test it. We will create a Bloom filter of size 80 bits and use three hash functions. We will store 10 random strings of length 20. These normally take up 200 bytes, but we use using only 10 bytes. Let's see how it works:

In [54]:
import random

def randomStringOfChars(length):
    letters = "qwertyuiopasdfghjklzxcvbnnm"
    return "".join((random.choice(letters) for i in range(length)))

A=[]
B=Bloom(80,3)

for i in range(10):
    s=randomStringOfChars(20)
    A.append(s)
    B.add(s)
    print("Adding ",s,"\n",repr(B))
print("Checking those in the filter")
for s in A:
    print("Checking ",s," ",s in B)
print("Checking random strings")
for s in (randomStringOfChars(20) for i in range(10)):
    print("Checking ",s," ",s in B)    

Adding  aghafthdtmvupzlfhsuv 
 00000000000000000000000000000000000000000000000000110001000000000000000000000000
Adding  rjrstzcnvpjmxqaewser 
 00000000100000000000000000000000000000010000100000110001000000000000000000000000
Adding  fhheislljbsjnvibgjzl 
 00000000100000000000000000000000000010010000100000110001000000000001000010000000
Adding  zomspweytnxjjqlglylu 
 00000000100000000000100000000000100010010000100000110001000000010001000010000000
Adding  nqxncemlnizocgimklcm 
 00000000100000000000100000000000100011010100100000110001000000010001100010000000
Adding  bntlqoflkpnmqidciswa 
 00000000100010001000100000000000100011010101100000110001000000010001100010000000
Adding  exigzqbydcvwtnxjopwp 
 00000000100010001000100000000000100011011101100000110001000000010001100110001000
Adding  xpidfuwajvkiaguwmpcx 
 00000000100010001010100000000000100011011101100000110011000000010001100110001000
Adding  pjtemgsznrixuqucmdmu 
 0000000011001000101010000000000010001101110110000011001100010001000110011

We can also try computing the false positive rate as a function of the various parameters.

In [33]:
def TestBloomFalsePositivePercent(tableSize,itemsToAdd,hashCount,trials):
    B=Bloom(tableSize,hashCount)
    for i in range(itemsToAdd):
        B.add(randomStringOfChars(100))
    falsePositive=0
    for i in range(trials):
        s=randomStringOfChars(100)   
        trials+=1
        if s in B:
            falsePositive+=1
    return 100*falsePositive/trials
            

In [15]:
print("Percent False positive ",TestBloomFalsePositivePercent(20000,2000,8,1000),"%")

Percent False positive  1.8 %


But, let's stop the experimenting for a minute. As the structure is so simple we can analyze it using math. There are several questions we want to answer:

- What is the chance of a false positive, given a specific Bloom filter?
- What is the optimal number of hash functions?
- How big should the Bloom filter be to achieve a specified error rate?

## Chance of false positive

We have three parameters:

- `k` is the size of the Bloom filter, in bits
- `h` is the number of hash functions
- `n` is the number of items stored

We start with basic assumption on hash functions: For any $i \in [k]$, for any $x$, and hash function $hash$ $Pr[hash(x)=i]=\frac{1}{k}$ and thus $Pr[hash(x) \not = i]=1-\frac{1}{k}$.

Now, what is the chance that with $h$ hash functions $hash_1,\ldots hash_h$ that none of these hash $x$ to $i$? As we assume independent hash functions, this can be computed as follows:

$\overbrace{Pr[ \forall_{\ell \in [h]} hash_\ell(x) \not = i] = \prod_{\ell \in [h]} Pr[ hash_\ell(x) \not = i]}^{\text{Assuming the events are independent}}=(1-\frac{1}{k})^h$. 


The above if for a single insertion where $h$ bits are set. For $n$ insertions where $hn$ bits are set, the chance that a single bit remains 0 is:


$ (1-\frac{1}{k})^{hn} $

On its own you don't have a good sense of what this is.
Whenever we have an expression like this it is good to have the following fact in mind:

$ \displaystyle \lim_{m \rightarrow \infty}\left(1-\frac{1}{x}\right)^x = \frac{1}{e} $

Thus we can rewrite using the fact as an approximation:

$Pr[\text{A bit is 0 after n insertions}] = (1-\frac{1}{k})^{hn}= (1-\frac{1}{k})^{k\cdot\frac{hn}{k}}\approx e^{\frac{-hn}{k}}$

Of course then we can get a the probability that a bit is 1:

$Pr[\text{A bit is 1 after n insertions}] \approx 1- e^{\frac{-hn}{k}}$

When do we have a false positive? When we search for some $x$ that has not been inserted, and all of $hash_\ell(x)$ for $\ell \in [h]$ are one? Since we know the probability a single bit is 1, and we assume they are independent, we can just multiply: (* see note at bottom*)

$Pr[\text{A false positive after $n$ insertions}] \approx \left(1- e^{\frac{-hn}{k}} \right)^h$

Using calculus, this is minimzed when $h = \frac{k}{n}\ln 2 \approx 0.69 \frac{k}{n}$. At this point the error rate is

$Error=\left(1- e^{\frac{-n\frac{k}{n}\ln 2}{k}} \right)^{\frac{k}{n}\ln 2}=
\left(1-e^{-\ln 2}\right)^{\frac{k}{n}\ln 2}
=\left( \frac{1}{2} \right)^{\frac{k}{n}\ln 2}$

Taking logarithms:

$ \ln Error =  \frac{k}{n}\ln 2 \ln \frac{1}{2} = -\frac{k}{n} \ln^2 2  \approx -\frac{1}{2}\cdot \frac{k}{n} $

This can be re-written in terms of $\frac{k}{n}$, the bits-per-item:

$ \frac{k}{n}= - \frac{1}{\ln^2 2} \ln Error $

Thus for 1% error, $-\frac{\ln 0.1}{\ln^2 2} \approx 10$ bits per item and $-\frac{\ln 0.1}{\ln 2} \approx 7$ hash functions should work.

For 0.01% error, $\approx 19$ bits per item and 13 hash functions should work.

** Note from above: They are not really independent, but independence here is a good approximation. For example, if I have two buckets and a put a coin randomly in one bucket, whether the coin is bucket one and bucket two are not independent, since if the coin is in bucket one, it is not in bucket two! However, if you have 2000 buckets and 1000 coins, there is a $1/2$ chance of a bucket having a coin. But, if you look in a bucket and there is a coin, the chance of all other buckets having a coin goes down to $999/1999=0.49974$, which is an excellent approximation for the $0.5$ if you assumed independence. However, Bloom filters can be fully analyzed without this shortcut using something known as the Azuma–Hoeﬀding inequality, which we do not do here.


## Summary

So with an error rate $\epsilon$, use a Bloom filter with $-\frac{\ln \epsilon}{\ln^2 2}$ bits per item, and $-\frac{\ln \epsilon}{\ln 2}$ hash functions. Using big-O, the space usage of a Bloom filter is $O(n \log \frac{1}{\epsilon})$, and the query time is $O\left( \log \frac{1}{\epsilon} \right)$. 

Disadvantages:
- Errors
- No deletion
- No key,value pairs. Just keys.

## Example: URL

Suppose you wanted had a company that kept a database of bad URLs, and you wanted to have a product for customers that would check if the web pages they visited were on the list and warn them. Suppose your list has one million bad URLs and the average length is 70 characters. Without bloom filters you would have these two main options:

- Transmit the list of bad URLs to all users. This would be about 70MB of raw data. The user's software would query it. This has a disadvantage of a large file that must be transferred. A large file is likely to not be stored in memory on the computer and so queries could become slow.

- Keep the list of bad URLs on the server. The user queries the server for every URL. This requires lots of data tranfers, and the checks will take a while because of network latency.

With Bloom filters, there is a third option:

- Give the user a bloom filter with one percent accuracy, which uses 10 bits, $\frac{10}{8}$, bytes per item. Total size is only 1.2MB. Every time there is a hit in the Bloom filter, the server is queried, but this happens in error only 1% of the time.






# Homework

Bloom filters do not support deletion. One way to support deletion is to instead of having an array of single bits, have an array that can store small integers. When we insert, the integers are incremented, and when we delete they are decremented.

Your tasks are:

- Code a Bloom filter class that supports deletion, where `del B[k]` would delete `k` from `B`. For simplicity, use one byte for each element of the array, so a count in the range 0..255 can be stored. Make sure to generate an error if the maximum count is exceeded.

- How many things can be inserted without causing the structure to fail? Try to make a precise mathematical statement.

Hint: Us this classic result from probability theory known as ball-and-bins: If you throw $n$ balls randomly into $n$ bins, all bins will contain at most $\frac{e \ln n}{\ln \ln n}$ balls with  probability at least $1-\frac{1}{n^{0.35}}$.









