## Introduction

We will look at Hash Tables and Bloom Filters in the notebook, will implement the them in Python to get some concrete understanding.

### Hash Tables

Hash Tables store key value pairs. The goal is to allow constant time lookup and insertion. We specifically are interested in the following three operations

- Lookup: Given a key, return the corresponding value for the key
- Insert: Given a key, insert (or replace) the corresponding value in the table
- Delete: Given the key, delete the corresponding value

We know arrays let us insert, lookup and delete in constant time and thus if we can convert our key to an integer value in the array such that there is a 1-1 relation between the key and the value, then we can guarantee a constant time operation at the cost of space.

---

**Quiz 12.1**

Suppose all our strings are 25 length strings and we assume there are just lower case english alphabets in the key, then we have a $26^{25}$ possible combinations. Such large array is impractical.

---

Continuing with above case, suppose U is the universal set, containing all possible strings of length 25 and S be the subset of keys we intend to insert in the Hash Table. We saw how creating one large array for all possible keys in the universe U is prohibitive, we would thus need a way to consume linear space, in the order of $\mid S \mid$ and still get $\theta(1)$ complexity for all operations

Let us now implement  2-Sum problem in three ways

- 1: Naive, Linear scan of array for all numbers with complexity $\theta(n^2)$
- 2: Sort and Binary search with complexity $\theta(nlogn)$
- 3: Use hash map for lookup with complexity $\theta(n)$



In [1]:
def naive_2sum(arr, expected_sum):
    for i, n1 in enumerate(arr):
        expectedn2 = expected_sum - n1
        for idx in range(i):
            n2 = arr[idx]
            if n2 == expectedn2:
                return "yes"
            
    return "no"

def better_2sum(arr, expected_sum):
    import bisect
    arr.sort()
    
    # Binary search the array to see if the provided number exists in arr
    # runs in O(log n) time
    def is_present(num, max_idx):
        idx = bisect.bisect_left(arr, num, lo = 0, hi = max_idx)
        return idx != max_idx and arr[idx] == num
    
    for idx, n1 in enumerate(arr):
        expectedn2 = expected_sum - n1
        # Search in all the numbers we have seen so far exclusing the current to avoid 
        # double counting the number
        if is_present(expectedn2, idx):
            return "yes"
        
    return "no"

def optimal_2sum(arr, expected_sum):
    s = set()
    # set can essentially be view as a Hash Table with no associated value, we are just interested in the key
    for n1 in arr:
        expectedn2 = expected_sum - n1 
        if expectedn2 in s:
            return "yes"
        s.add(n1)
    
    return "no"

Lets test the above functions

In [2]:
arr  = [2, 2, 1, 5, 3, 9]

# Should return yes as just one match, 1 + 9 = 10
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 10),\
            better_2sum(arr, 10),\
            optimal_2sum(arr, 10))

# Should return yes as two numbers match, 2 + 2 and 3 + 1 add to 4
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 4),\
            better_2sum(arr, 4),\
            optimal_2sum(arr, 4))

# Should return no as no numbers sum to 100
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 100),\
            better_2sum(arr, 100),\
            optimal_2sum(arr, 100))

# Should return no as no two unique number add up to 18
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 18),\
            better_2sum(arr, 18),\
            optimal_2sum(arr, 18))


(naive_2sum, better_2sum, optimal_2sum) yes yes yes
(naive_2sum, better_2sum, optimal_2sum) yes yes yes
(naive_2sum, better_2sum, optimal_2sum) no no no
(naive_2sum, better_2sum, optimal_2sum) no no no


This implementation looks ok, lets write a function to read lines from a file and return them as array of integers and a function to test the implementation on the given file

In [3]:
def load_(file):
    with open(file) as f:
        return [int(x.strip()) for x in f.readlines()]

    
def test(file, interval_start, interval_end, two_sum_impl):
    arr = load_(file)
    # All targets in range [interval_strt, interval_end], inclusive
    return sum(1 for target in range(interval_start, interval_end + 1) if two_sum_impl(arr, target) == 'yes')

        

In [4]:
# Try out all functions
import time
for two_sum_impl in [naive_2sum, better_2sum, optimal_2sum]:
    start = int(time.time() * 1000)
    res = test('problem12.4test.txt', 3, 10, two_sum_impl)
    end = int(time.time() * 1000)
    print('Result from', two_sum_impl.__name__, 'returned', res, 'in', (end - start), 'ms')

Result from naive_2sum returned 8 in 0 ms
Result from better_2sum returned 8 in 1 ms
Result from optimal_2sum returned 8 in 1 ms


---

Ok, the above results seem to work as expected, lets try on the big file now

---

In [5]:
# try nlogn and linear solution
import time

start = int(time.time() * 1000)
res = test('problem12.4.txt', -10000, 10000, optimal_2sum)
end = int(time.time() * 1000)
print('Result from optimal_2sum returned', res, 'in', (end - start), 'ms')

start = int(time.time() * 1000)
res = test('problem12.4.txt', -10000, 10000, better_2sum)
end = int(time.time() * 1000)
print('Result from better_2sum returned', res, 'in', (end - start), 'ms')


Result from optimal_2sum returned 427 in 7370780 ms
Result from better_2sum returned 427 in 27375272 ms


Well we can see even the linear time solution took close to 2 hours and the $\theta(nlogn)$ using binary search close to 7 hours, the result 427 is hopefully correct.

---

**Quiz 12.2**

The solution that does binary search for searching te value takes $\theta(nlogn)$, each binary search takes $\theta(logn)$  and we repeat the lookup n times. Look at ``better_2sum`` the implementation

---

### Hash table implementations

Coming back to the two sets of our interest, $\mid U \mid$ and $\mid S \mid$ for the Universe of all possible values and our keyspace and the subset of our interest. For example, for IPv4 addresses its $2^{32}$ possible values. There are two possible ways to store and retrieve the keys, Arrays and Linked lists, following table summarizes the Space and time complexities

|Data Structure|Space Complexity|Time Complexity|
|:-------------|-----|--------|
|Array|$\theta({\mid U \mid})$|$\theta(1)$|
|Linked List|$\theta({\mid S\mid})$|$\theta({\mid S\mid})$ |

As we see above, using array for large universe is impractical, and using linked lists for even moderate sized subset of keys is inefficient for lookup. We need the best of both worlds and following is the complexities we intend to achieve


|Data Structure|Space Complexity|Time Complexity|
|:-------------|-----|--------|
|Hash Table|$\theta({\mid S \mid})$|$\theta(1)$|

Hash tables start by assigning a fixed size array of size $n$ to start with (we will assume its fixed for simplicity, however, real implementations can increase the size of hash table and rehash the keys which is an expensive operation for large set S). Next step is to use a hash function $h \rightarrow \{0, 1, 2 ... n - 1\}$ to map the key to a numeric value between [0, n -1] to store the key and the corresponding value. 

Given the hash table has size much less than U, and the set S is practically not know before hand in majority of cases, any mapping from S to a numeric value will cause collision, that is, two keys give same hashcode. 

An interesting problem of Birthday Paradox attempts to find the number of people in a group there is a 50% chance two people have birthday on the same date. The number surprisingly comes pretty low, it is observed that with approximately 23 people which is $1.177 * \sqrt{365}$, we will expect to see two people having birthday on same date. The proof and explanation can be found at [this](https://betterexplained.com/articles/understanding-the-birthday-paradox/) URL. This also is the answer to quiz 12.3

Ok, on a similar lines, given we have a limited size of buckets in a hash table, we expect to see collisions. How do we address these problems, we will look at the couple of ways and start with an option called chaining

In chaining, we start building a Linked list in the bucket appending the results to the end of of the linked list in that bucket. Unless all keys end up in same bucket, essentially forming a linked list, We should get a constant time complexlity for lookup. Some optimizations in HashMap in Java, converts the linked list to tree beyond a threshold effectively reducing the lookup time to $\theta({log(n)})$


Lets look at a simple Python implementation of chaining

In [55]:
class Entry:
    # Stores the key, value and next item in the linked list created in the bucket
    #
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None
        
    def __repr__(self):
        if self.key:
            return '(' + str(self.key) + ' : ' + str(self.value) + ') -> ' + str(self.next)
        else:
            return '(None : None)'
        
        
class ChainedHashTable:  
    
    
    def __init__(self, size = 10):
        # Dummy entry in the bucket
        self.buckets = [Entry(None, None) for _ in range(size)]
        self.size = size
    
    def __bucket__(self, key):
        return abs(hash(key)) % self.size
    
    def put(self, key, value):
        #puts a key value pair in the hash table, key has to be a non None value
        if key:
            # latest entry in the bucket is the stored in bucket, this enables
            # constant time insertion as we get the tail of the linked list directly
            bucket = self.__bucket__(key)
            entry = Entry(key, value)
            entry.next = self.buckets[bucket]
            self.buckets[bucket] = entry
        else:
            raise ValueError('Non None key expected')
            
    def get(self, key):
        val = None
        bucket = self.__bucket__(key)
        entry = self.buckets[bucket]
        while entry:
            if key == entry.key:
                val = entry.value
                break
                
            entry = entry.next
        return val

In [56]:
ht = ChainedHashTable(7)
print('Inserting values [10, 20] in ChainedHashTable')
for i in range(10, 21):
    ht.put(i, i)
print('\nBuckets of the ChainedHashTable are')
for entry in ht.buckets:
    print(entry)
print('\nRetrieving values in range [5, 25] from ChainedHashTable')
for k in range(5, 26):
    print('Key:', k, '->', ht.get(k))


Inserting values [10, 20] in ChainedHashTable

Buckets of the ChainedHashTable are
(14 : 14) -> (None : None)
(15 : 15) -> (None : None)
(16 : 16) -> (None : None)
(17 : 17) -> (10 : 10) -> (None : None)
(18 : 18) -> (11 : 11) -> (None : None)
(19 : 19) -> (12 : 12) -> (None : None)
(20 : 20) -> (13 : 13) -> (None : None)

Retrieving values in range [5, 25] from ChainedHashTable
Key: 5 -> None
Key: 6 -> None
Key: 7 -> None
Key: 8 -> None
Key: 9 -> None
Key: 10 -> 10
Key: 11 -> 11
Key: 12 -> 12
Key: 13 -> 13
Key: 14 -> 14
Key: 15 -> 15
Key: 16 -> 16
Key: 17 -> 17
Key: 18 -> 18
Key: 19 -> 19
Key: 20 -> 20
Key: 21 -> None
Key: 22 -> None
Key: 23 -> None
Key: 24 -> None
Key: 25 -> None
