## Introduction

We will look at Hash Tables and Bloom Filters in the notebook, will implement the them in Python to get some concrete understanding.

### Hash Tables

Hash Tables store key value pairs. The goal is to allow constant time lookup and insertion. We specifically are interested in the following three operations

- Lookup: Given a key, return the corresponding value for the key
- Insert: Given a key, insert (or replace) the corresponding value in the table
- Delete: Given the key, delete the corresponding value

We know arrays let us insert, lookup and delete in constant time and thus if we can convert our key to an integer value in the array such that there is a 1-1 relation between the key and the value, then we can guarantee a constant time operation at the cost of space.

---

**Quiz 12.1**

Suppose all our strings are 25 length strings and we assume there are just lower case english alphabets in the key, then we have a $26^{25}$ possible combinations. Such large array is impractical.

---

Continuing with above case, suppose U is the universal set, containing all possible strings of length 25 and S be the subset of keys we intend to insert in the Hash Table. We saw how creating one large array for all possible keys in the universe U is prohibitive, we would thus need a way to consume linear space, in the order of $\mid S \mid$ and still get $\theta(1)$ complexity for all operations

Let us now implement  2-Sum problem in three ways

- 1: Naive, Linear scan of array for all numbers with complexity $\theta(n^2)$
- 2: Sort and Binary search with complexity $\theta(nlogn)$
- 3: Use hash map for lookup with complexity $\theta(n)$



In [6]:
def naive_2sum(arr, expected_sum):
    for i, n1 in enumerate(arr):
        expectedn2 = expected_sum - n1
        for idx in range(i):
            n2 = arr[idx]
            if n2 == expectedn2:
                return "yes"
            
    return "no"

def better_2sum(arr, expected_sum):
    import bisect
    arr.sort()
    
    # Binary search the array to see if the provided number exists in arr
    # runs in O(log n) time
    def is_present(num, max_idx):
        idx = bisect.bisect_left(arr, num, lo = 0, hi = max_idx)
        return idx != max_idx and arr[idx] == num
    
    for idx, n1 in enumerate(arr):
        expectedn2 = expected_sum - n1
        # Search in all the numbers we have seen so far exclusing the current to avoid 
        # double counting the number
        if is_present(expectedn2, idx):
            return "yes"
        
    return "no"

def optimal_2sum(arr, expected_sum):
    s = set()
    # set can essentially be view as a Hash Table with no associated value, we are just interested in the key
    for n1 in arr:
        expectedn2 = expected_sum - n1 
        if expectedn2 in s:
            return "yes"
        s.add(n1)
    
    return "no"

Lets test the above functions

In [7]:
arr  = [2, 2, 1, 5, 3, 9]

# Should return yes as just one match, 1 + 9 = 10
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 10),\
            better_2sum(arr, 10),\
            optimal_2sum(arr, 10))

# Should return yes as two numbers match, 2 + 2 and 3 + 1 add to 4
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 4),\
            better_2sum(arr, 4),\
            optimal_2sum(arr, 4))

# Should return no as no numbers sum to 100
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 100),\
            better_2sum(arr, 100),\
            optimal_2sum(arr, 100))

# Should return no as no two unique number add up to 18
print('(naive_2sum, better_2sum, optimal_2sum)', \
            naive_2sum(arr, 18),\
            better_2sum(arr, 18),\
            optimal_2sum(arr, 18))


(naive_2sum, better_2sum, optimal_2sum) yes yes yes
(naive_2sum, better_2sum, optimal_2sum) yes yes yes
(naive_2sum, better_2sum, optimal_2sum) no no no
(naive_2sum, better_2sum, optimal_2sum) no no no


This implementation looks ok, lets write a function to read lines from a file and return them as array of integers and a function to test the implementation on the given file

In [8]:
def load_(file):
    with open(file) as f:
        return [int(x.strip()) for x in f.readlines()]

    
def test(file, interval_start, interval_end, two_sum_impl):
    arr = load_(file)
    # All targets in range [interval_strt, interval_end], inclusive
    return sum(1 for target in range(interval_start, interval_end + 1) if two_sum_impl(arr, target) == 'yes')

        

In [9]:
# Try out all functions
import time
for two_sum_impl in [naive_2sum, better_2sum, optimal_2sum]:
    start = int(time.time() * 1000)
    res = test('problem12.4test.txt', 3, 10, two_sum_impl)
    end = int(time.time() * 1000)
    print('Result from', two_sum_impl.__name__, 'returned', res, 'in', (end - start), 'ms')

Result from naive_2sum returned 8 in 3 ms
Result from better_2sum returned 8 in 1 ms
Result from optimal_2sum returned 8 in 1 ms


---

Ok, the above results seem to work as expected, lets try on the big file now

---

In [10]:
# Just try linear solution
import time

start = int(time.time() * 1000)
res = test('problem12.4.txt', -10000, 10000, optimal_2sum)
end = int(time.time() * 1000)
print('Result from optimal_2sum returned', res, 'in', (end - start), 'ms')

Result from optimal_2sum returned 427 in 7018253 ms


Well we can see even the linear time solution took close to 2 hours, hopefully the result 427 is correct.