Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Note that this Pre-class Work is estimated to take **45 minutes**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "MUHAMMAD ABDURREHMAN ASIF"
COLLABORATORS = ""

---

# CS110 Pre-class Work - Hash tables and hash functions

## Part A. Direct Address Tables [time estimate: 10 minutes]

As the first step in setting up a crossword solving algorithm you need to create 2 direct address tables, one to store all the “up” answers - whether correct or not - and one to store all the “across” answers. Write python code to create a direct address table that allows you to:

1. initialize N empty guesses
2. set a guess for the i-th entry
3. clear an incorrect guess for the i-th entry


In [1]:
def init(N):
    [None for i in range(N)]
    
def make_guess(table,entry,guess):
    table[entry] = guess
    
def delete_guess(table,entry):
    table[entry] = None
    
    

## Part B. Social Security [time estimate: 3 minutes]

Could we use a direct address table to store a country's entire set of social security numbers (aka id numbers)? Why or why not?


In theory it is possible but this would be severly inefficient and completely inappropriate to address the sort of problem we are aiming to. A country's entire set of SSN would contain identification data for millions of people. This dataset would constantly be increasing as well as more and more people get their SSNs. Some sort of expiry or deletion function would also be required for redundant ID's. Thus, a direct address table would be ridiculously inefficient. The sheer amount of data would require painstakingly long to perform any tasks on, search through or delete from. Thus, it is not an optimal solution. Finally, all the storage would require a space complexity so high that it would be extremely burdensome on memory capabilities.

## Part C. Chained Hash-table [time estimate: 32 minutes]

### Question 1 [time estimate: 7 minutes]

Using the code in the cell below, complete the missing sections of code. You should copy and paste the code in an additional cell and fill in the code there.

In [2]:
import random
import string


def randomword(length):
    return ''.join(random.choice(string.ascii_lowercase) for i in range(length))


def empty_hash_table(N):
    return [[] for n in range(N)]


def add_to_hash_table(hash_table, item, hash_function):
    N = len(hash_table)
    # YOUR CODE HERE
    return hash_table


def contains(hash_table, item, hash_function):
    N = len(hash_table)
    # YOUR CODE HERE
    # return true if the item has already been stored in the hash_table


def remove(hash_table, item, hash_function):
    if not contains(hash_table, item, hash_function):
        raise ValueError()
    # YOUR CODE HERE
    return hash_table


# Hash Functions
def hash_str1(string):
    ans = 0
    for chr in string:
        ans += ord(chr)
    return ans

def hash_str2(string):
    ans=int(ord(string[0]))
    for ix in range(1, len(string)):
        ans = ans ^ ord(string[ix]) 
    return int(bin(ans).split('b')[1])

def hash_str3(string):
    ans = 0
    for chr in string:
        ans = ans * 128 + ord(chr)
    return ans

def hash_str4(string):
    random.seed(ord(string[0]))
    return random.getrandbits(32)

In [3]:
import random
import string


def randomword(length):
    return ''.join(random.choice(string.ascii_lowercase) for i in range(length))


def empty_hash_table(N):
    return [[] for n in range(N)]


def add_to_hash_table(hash_table, item, hash_function):
    N = len(hash_table)
    idx = hash_function(item)   #initialize index
    if idx >= N:                # if it is greater than N
        idx = idx%N            
    hash_table[idx].append(item)  # we append it whichis the requirement of the func
    return hash_table


def contains(hash_table, item, hash_function):
    N = len(hash_table)
    idx = hash_function(item)    # we use the similar principle as before
    if idx >= N:
        idx = idx%N
    if item in hash_table[idx]:  # instead of adding however, the search returns true or false
        return True
    else:
        return False
    # return true if the item has already been stored in the hash_table


def remove(hash_table, item, hash_function):
    if not contains(hash_table, item, hash_function):
        raise ValueError()
    idx = hash_function(item)    # we initialize an index to help us search the KVP
    hash_table[idx].pop(hash_table[idx].index(item))   # we can just pop the item or key we are looking for after calling it 
    return hash_table


# Hash Functions
def hash_str1(string):
    ans = 0
    for chr in string:
        ans += ord(chr)
    return ans

def hash_str2(string):
    ans=int(ord(string[0]))
    for ix in range(1, len(string)):
        ans = ans ^ ord(string[ix]) 
    return int(bin(ans).split('b')[1])

def hash_str3(string):
    ans = 0
    for chr in string:
        ans = ans * 128 + ord(chr)
    return ans

def hash_str4(string):
    random.seed(ord(string[0]))
    return random.getrandbits(32)

### Question 2 [time estimate: 2 minutes]

Using the code, create 100,000 words of 10 characters each.

In [4]:
words = []
total_words = 100000
for i in range (total_words):
    words.append(randomword(10))

### Question 3 [time estimate: 2 minutes]

Create four chained hash-tables with 5000 slots.

In [5]:
hash_t1 = empty_hash_table(5000)
hash_t2 = empty_hash_table(5000)
hash_t3 = empty_hash_table(5000)
hash_t4 = empty_hash_table(5000)

### Question 4 [time estimate: 2 minutes]

Store all the words in each chained hash table using each of the different hash functions.

In [6]:
for x in words:
    hash_t1 = add_to_hash_table(hash_t1, x, hash_str1)
    hash_t2 = add_to_hash_table(hash_t2, x, hash_str2)
    hash_t3 = add_to_hash_table(hash_t3, x, hash_str3)
    hash_t4 = add_to_hash_table(hash_t4, x, hash_str4)

### Question 5 [time estimate: 4 minutes]

Measure the number of collisions for each hash function.

In [7]:
collision_t1 = 0
collision_t2 = 0
collision_t3 = 0
collision_t4 = 0

# table 1, add one to counter when there is more than 1 key for an index
for i in range(len(hash_t1)):
    if len(hash_t1[i]) > 1:
        collision_t1 += 1
        
# table 2
for i in range(len(hash_t2)):
    if len(hash_t2[i]) > 1:
        collision_t2 += 1

# table 3
for i in range(len(hash_t3)):
    if len(hash_t3[i]) > 1:
        collision_t3 += 1
        
# table 4        
for i in range(len(hash_t4)):
    if len(hash_t4[i]) > 1:
        collision_t4 += 1
        
        
print(collision_t1)
print(collision_t2)
print(collision_t3)
print(collision_t4)

163
16
5000
26


### Question 6 [time estimate: 5 minutes]

For each of the hash functions, how many elements are in a bucket on average (if it is not empty)?


In [8]:
# the average, intuitively would just be the total collisions over the total number of words, we have total collisions already

elem_1 = total_words/collision_t1
elem_2 = total_words/collision_t2
elem_3 = total_words/collision_t3
elem_4 = total_words/collision_t4

print(round(elem_1))
print(round(elem_2))
print(round(elem_3))
print(round(elem_4))


613
6250
20
3846


### Question 7 [time estimate: 5 minutes]

Time how long it takes to find elements that are in each hash table.


In [9]:
import time

t1 = []
t2 = []
t3 = []
t4 = []


for i in words:
    # table 1
    start = time.time()
    contains(hash_t1, i, hash_str1)
    end = time.time()
    t1.append(end - start)
    
    # table 2
    start = time.time()
    contains(hash_t2, i, hash_str2)
    end = time.time()
    t2.append(end - start)
    
    # table 3
    start = time.time()
    contains(hash_t3, i, hash_str3)
    end = time.time()
    t3.append(end - start)
    
    # table 4
    start = time.time()
    contains(hash_t4, i, hash_str4)
    end = time.time()
    t4.append(end - start)
    

print(sum(t1)/len(t1))   #hash table 1
print(sum(t2)/len(t2))   # hash table 2
print(sum(t3)/len(t3))   # hash table 3
print(sum(t4)/len(t4))   # hash table 4


2.9979758262634277e-05
0.00013620682716369628
7.318201065063476e-06
0.00010261783123016357


### Question 8 [time estimate: 5 minutes]

For each hash table, time how long it takes to find 10,000 elements that have not been stored.

In [10]:
new_words = []
for i in range(10000):
    new_words2 = randomword(10)
    while new_words2 in words:
        new_words2 = randomword(10)
    new_words.append(new_words2)

In [11]:
t1 = []
t2 = []
t3 = []
t4 = []


for i in new_words:
    # table 1
    start = time.time()
    contains(hash_t1, i, hash_str1)
    end = time.time()
    t1.append(end - start)
    
    # table 2
    start = time.time()
    contains(hash_t2, i, hash_str2)
    end = time.time()
    t2.append(end - start)
    
    # table 3
    start = time.time()
    contains(hash_t3, i, hash_str3)
    end = time.time()
    t3.append(end - start)
    
    # table 4
    start = time.time()
    contains(hash_t4, i, hash_str4)
    end = time.time()
    t4.append(end - start)
    

print(sum(t1)/len(t1))   #hash table 1
print(sum(t2)/len(t2))   # hash table 2
print(sum(t3)/len(t3))   # hash table 3
print(sum(t4)/len(t4))   # hash table 4


7.34879732131958e-05
0.00031232476234436033
8.565711975097656e-06
0.000228204607963562
