Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Note that this Pre-class Work is estimated to take **45 minutes**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Enjui Chang"
COLLABORATORS = ""

---

# CS110 Pre-class Work - Hash tables and hash functions

## Part A. Direct Address Tables [time estimate: 10 minutes]

As the first step in setting up a crossword solving algorithm you need to create 2 direct address tables, one to store all the “up” answers - whether correct or not - and one to store all the “across” answers. Write python code to create a direct address table that allows you to:

1. initialize N empty guesses
2. set a guess for the i-th entry
3. clear an incorrect guess for the i-th entry


In [15]:
# create two direct address tables
up = []
across = []

# initialize N empty guesses
def intialize_guess(N, table):
    table = [None for i in range(N)]

# set a guess for the i-th entry
def set_guesses(table, guess, i):
    table[i] = guess
    return table
    
# clear an incorrect guess for the i-th entry
def clear_incorrect(table, answer, i):
    
    # clear the guess if incorrect
    if table[i] != answer: 
        table[i] = None
    return table



## Part B. Social Security [time estimate: 3 minutes]

Could we use a direct address table to store a country's entire set of social security numbers (aka id numbers)? Why or why not?


It is possible although it would require a lot of memory space and pretty slow to do so, which show the inefficiency of this method. This is because we need one slot for each social security numbers, without any heuristics when inserting/deleting/searching, meaning that the time and space complexity would be the O(N).

## Part C. Chained Hash-table [time estimate: 32 minutes]

### Question 1 [time estimate: 7 minutes]

Using the code in the cell below, complete the missing sections of code. You should copy and paste the code in an additional cell and fill in the code there.

In [3]:
import random
import string


def randomword(length):
    return ''.join(random.choice(string.ascii_lowercase) for i in range(length))


def empty_hash_table(N):
    return [[] for n in range(N)]


def add_to_hash_table(hash_table, item, hash_function):
    N = len(hash_table)
    # YOUR CODE HERE
    return hash_table


def contains(hash_table, item, hash_function):
    N = len(hash_table)
    # YOUR CODE HERE
    # return true if the item has already been stored in the hash_table


def remove(hash_table, item, hash_function):
    if not contains(hash_table, item, hash_function):
        raise ValueError()
    # YOUR CODE HERE
    return hash_table


# Hash Functions
def hash_str1(string):
    ans = 0
    for chr in string:
        ans += ord(chr)
    return ans

def hash_str2(string):
    ans=int(ord(string[0]))
    for ix in range(1, len(string)):
        ans = ans ^ ord(string[ix]) 
    return int(bin(ans).split('b')[1])

def hash_str3(string):
    ans = 0
    for chr in string:
        ans = ans * 128 + ord(chr)
    return ans

def hash_str4(string):
    random.seed(ord(string[0]))
    return random.getrandbits(32)

In [4]:
import random
import string


def randomword(length):
    return ''.join(random.choice(string.ascii_lowercase) for i in range(length))


def empty_hash_table(N):
    return [[] for n in range(N)]

# add new item to the hash table
def add_to_hash_table(hash_table, item, hash_function):
    N = len(hash_table)
    hash_table[hash_function(item)%N].append(item)
    
    return hash_table

# search to find an element
def contains(hash_table, item, hash_function):
    N = len(hash_table)
    
    # return true if the item has already been stored in the hash_table
    for i in hash_table[hash_function(item)%N]:
        if i == item:
            return True
    return False
    
# remove the element
def remove(hash_table, item, hash_function):
    N = len(hash_table)
    if not contains(hash_table, item, hash_function):
        raise ValueError()
    for i in range(len(hash_table[hash_function(item)%N])):
        if hash_table[hash_function(item)%N][i] == item:
            del hash_table[hash_function(item)%N][i]
            break
    return hash_table


# Hash Functions
def hash_str1(string):
    ans = 0
    for chr in string:
        ans += ord(chr)
    return ans

def hash_str2(string):
    ans=int(ord(string[0]))
    for ix in range(1, len(string)):
        ans = ans ^ ord(string[ix]) 
    return int(bin(ans).split('b')[1])

def hash_str3(string):
    ans = 0
    for chr in string:
        ans = ans * 128 + ord(chr)
    return ans

def hash_str4(string):
    random.seed(ord(string[0]))
    return random.getrandbits(32)

### Question 2 [time estimate: 2 minutes]

Using the code, create 100,000 words of 10 characters each.

In [5]:
# create 100,000 words of 10 characters each
words = []
for i in range(100000):
    words.append(randomword(10))

### Question 3 [time estimate: 2 minutes]

Create four chained hash-tables with 5000 slots.

In [6]:
# create empty hash tables
table1 = empty_hash_table(5000)
table2 = empty_hash_table(5000)
table3 = empty_hash_table(5000)
table4 = empty_hash_table(5000)

### Question 4 [time estimate: 2 minutes]

Store all the words in each chained hash table using each of the different hash functions.

In [7]:
# add the words into the hash table
for i in words:
    add_to_hash_table(table1,i,hash_str1)
    add_to_hash_table(table2,i,hash_str2)
    add_to_hash_table(table3,i,hash_str3)
    add_to_hash_table(table4,i,hash_str4)

### Question 5 [time estimate: 4 minutes]

Measure the number of collisions for each hash function.

In [8]:
# initialize storage
num_col_1 = 0
num_col_2 = 0
num_col_3 = 0
num_col_4 = 0

# run through each hash table and 
# append the total number of collision (length-1) in each bucket 
# if it is not empty

for i in range(len(table1)):
    if len(table1[i])>1:
        num_col_1+= len(table1[i])-1
        
for i in range(len(table2)):
    if len(table2[i])>1:
        num_col_2+= len(table2[i])-1
        
for i in range(len(table3)):
    if len(table3[i])>1:
        num_col_3+= len(table3[i])-1
        
for i in range(len(table4)):
    if len(table4[i])>1:
        num_col_4+= len(table4[i])-1

# print the result
print("Number of collision for hash function 1:",num_col_1)
print("Number of collision for hash function 2:",num_col_2)
print("Number of collision for hash function 3:",num_col_3)
print("Number of collision for hash function 4:",num_col_4)

Number of collision for hash function 1: 99819
Number of collision for hash function 2: 99984
Number of collision for hash function 3: 95000
Number of collision for hash function 4: 99974


### Question 6 [time estimate: 5 minutes]

For each of the hash functions, how many elements are in a bucket on average (if it is not empty)?


In [9]:
# initialize storage
bucket1 = []
bucket2 = []
bucket3 = []
bucket4 = []

# run through each hash table and append the number of elements in each bucket
for i in range(len(table1)):
    if len(table1[i])>0:
        bucket1.append(len(table1[i]))
        
for i in range(len(table2)):
    if len(table2[i])>0:
        bucket2.append(len(table2[i]))
        
for i in range(len(table3)):
    if len(table3[i])>0:
        bucket3.append(len(table3[i]))
        
for i in range(len(table4)):
    if len(table4[i])>0:
        bucket4.append(len(table4[i]))

# average the number of elements with the number of buckets
avg_bucket1 = sum(bucket1)/len(bucket1)
avg_bucket2 = sum(bucket2)/len(bucket2)
avg_bucket3 = sum(bucket3)/len(bucket3)
avg_bucket4 = sum(bucket4)/len(bucket4)

# print the result
print("Average elements in each bucket for hash function 1:",avg_bucket1)
print("Average elements in each bucket for hash function 2:",avg_bucket2)
print("Average elements in each bucket for hash function 3:",avg_bucket3)
print("Average elements in each bucket for hash function 4:",avg_bucket4)

Average elements in each bucket for hash function 1: 552.4861878453039
Average elements in each bucket for hash function 2: 6250.0
Average elements in each bucket for hash function 3: 20.0
Average elements in each bucket for hash function 4: 3846.153846153846


### Question 7 [time estimate: 5 minutes]

Time how long it takes to find elements that are in each hash table.


In [16]:
import time

# intialize storage for time
time1 = []
time2 = []
time3 = []
time4 = []

# measure the search time for each hash table 
for i in words:
    
    # hash table 1
    start = time.time() 
    contains(table1, i, hash_str1) # call the search function
    end = time.time()
    time1.append(end-start) # append the time to storage
    
    # hash table 2
    start = time.time() 
    contains(table2, i, hash_str2) # call the search function
    end = time.time()
    time2.append(end-start) # append the time to storage
    
    # hash table 3
    start = time.time() 
    contains(table3, i, hash_str3) # call the search function
    end = time.time()
    time3.append(end-start) # append the time to the list
    
    # hash table 4
    start = time.time() 
    contains(table4, i, hash_str4) # call the search function
    end = time.time()
    time4.append(end-start) # append the time to storage

# print the result
print("Search time for hash table 1:",sum(time1)/len(time1),"sec")
print("Search time for hash table 2:",sum(time2)/len(time2),"sec")
print("Search time for hash table 3:",sum(time3)/len(time3),"sec")
print("Search time for hash table 4:",sum(time4)/len(time4),"sec")
    

Search time for hash table 1: 0.0001265348482131958 sec
Search time for hash table 2: 0.0005671722221374511 sec
Search time for hash table 3: 9.714887142181397e-06 sec
Search time for hash table 4: 0.0003960433793067932 sec


### Question 8 [time estimate: 5 minutes]

For each hash table, time how long it takes to find 10,000 elements that have not been stored.

In [20]:
# create storage for not stored words
not_words = []

# generate the no stored words
for i in range(100000): 
    new_word = randomword(10)
    while new_word in words: # if new_word is in words, then generate another word
        new_word = randomword(10)
    not_words.append(new_word)

# intialize storage for time
time1 = []
time2 = []
time3 = []
time4 = []

# measure the search time for each hash table 
for i in not_words:
    
    # hash table 1
    start = time.time() 
    contains(table1, i, hash_str1) # call the search function
    end = time.time()
    time1.append(end-start) # append the time to storage
    
    # hash table 2
    start = time.time() 
    contains(table2, i, hash_str2) # call the search function
    end = time.time()
    time2.append(end-start) # append the time to storage
    
    # hash table 3
    start = time.time() 
    contains(table3, i, hash_str3) # call the search function
    end = time.time()
    time3.append(end-start) # append the time to the list
    
    # hash table 4
    start = time.time() 
    contains(table4, i, hash_str4) # call the search function
    end = time.time()
    time4.append(end-start) # append the time to storage

# print the result
print("Search time for hash table 1:",sum(time1)/len(time1),"sec")
print("Search time for hash table 2:",sum(time2)/len(time2),"sec")
print("Search time for hash table 3:",sum(time3)/len(time3),"sec")
print("Search time for hash table 4:",sum(time4)/len(time4),"sec")

Search time for hash table 1: 0.0001397992253303528 sec
Search time for hash table 2: 0.000615766475200653 sec
Search time for hash table 3: 1.1469275951385498e-05 sec
Search time for hash table 4: 0.000426600329875946 sec
