## Assignment 4

For this task, you first need to place the file `students.txt` [1] from MyLearn in your Jupyter environment; this must be placed in the same directory as this notebook. The following code block reads the data into a dynamic data structure:

[1] Sh. https://matrikel.adbk.de/matrikel

In [3]:
import random
import math

file = open('students.txt', 'r')
raw = file.readlines()

data = []
for entry in raw:
    e = entry.split(';')
    data.append([e[0], e[1], e[2], e[3]])

The following lines provides a class `Student`. A `Student` is described by a name, the year of enrolment, a consecutive enrolment number (`enrollmentNr`) and the field of study (`major`).

In [4]:
class Student:
    def __init__(self, rY, eNr, n, m):
        self.registrationYear = rY
        self.enrollmentNr = eNr
        self.name = n
        self.major = m

## Task 1

Implement the method `id()`, which represents a unique identifier for students. This identifier is a hash code, computed from the name of the student.

* To compute the hash code, implement the Java hash code function for strings from the lecture slides. 
* The resulting integer value is to be returned.

In [5]:
def id(student):
    hc = 0

    for ch in student.name:
        hc = 31 * hc + ord(ch)

    return int(hc)


In [6]:
# DO NOT ALTER OR DELETE THIS CELL! 
example = Student("1932", "00011", "Max Mustermann", "Malerei")

In [7]:
id(example)

1959631739172568543302

## Task 2a

Complete the function `N()`. The task of `N()` is to determine the size of a hash table based on the following "rules of thumb": 

- The basis of the calculation is the expected maximum number of keys in the hash table: `expected`;
- The size should be larger than this base by a factor between 1.3 (minimum) and 1.4 (maximum).
- The size should be a prime number. Use `isPrime()` for the implementation of `N()`.

In [8]:
def isPrime(n):
    i = 2;
    while i <= math.sqrt(n):
        if n % i == 0:
            return False
        i += 1
    return True

In [9]:
def N(expected):
    counter = 0
    while True:
        if expected <= 3: #We cannot get a prime number by multiplying 1, 2 or 3 by factors between 1.3 and 1.4, but the code below doesn't catch that
            return False
        possible_size = int(expected * round(random.uniform(1.3,1.4),(len(str(expected))-1)))
        counter += 1
        if isPrime(possible_size):
            return possible_size
        if counter > 100000: # to ensure we don't end up in an infinite loop if we cannot find a prime number
            return False

In [10]:
N(100000)

131041

In [11]:
# DO NOT ALTER OR DELETE THIS CELL! 
result = N(100)
assert result == 131 or result == 137 or result == 139

## Task 2b

Complete the function `add`. The task of `add` is to add a `student` to a hash table based on the `id()`. The hash table should use Seperate Chaining (Verkettung) in `add` to handle collisions.

- To do this, use the function `id()` to get a hash code.
- Calculate the position in the array based on this hashcode.
- Return `True` if the student was successfully inserted, `False` if it was already included.

In [12]:
def contains(book, s):
    id_ = id(s)
    
    idx = id_ % len(book)
    
    if book[idx] != None:
        for i in book[idx]:
            if id(i) == id_:
                return True
            
    return False


def add(book, s):
    if contains(book,s):
        return False
    
    id_ = id(s)

    idx = id_ % len(book)
        
    if book[idx] == None:
        book[idx] = []
    book[idx].append(s)
    
    return True

In [13]:
# DO NOT ALTER OR DELETE THIS CELL!
example = Student("1932", "00011", "Max Mustermann", "Malerei")
test = [None] * N(5)  # we use None to indicate empty positions in the hash table
assert add(test, example) == True
index = id(example) % len(test)
assert example == test[index][0]


In the following code, the hash table is created and filled in `book`.
- The size of the hash table is determined by using the function `N()`. At the beginning, the individual positions are filled with `None`.
- The function `add` is called for all entries in `data` (i.e. the data from `students.txt`).

In [14]:
book = [None] * N(len(data))
for entry in data:
    if entry != None and len(entry) == 4:  # test if the current entry is a valid student
        student = Student(entry[0].strip(), entry[1].strip(), entry[2].strip(), entry[3].strip())
        add(book, student)

## Task 2c

Complete the function `get`. 
- The return value is the searched `student`.
- If the student you are looking for cannot be found in the hash table passed, `null` is returned.

In [15]:
def get(HT, sId):
    
    idx = sId % len(HT)
    
    if HT[idx] != None:
        for student in HT[idx]:
            if id(student) == sId:
                return student
            
    return None

In [16]:
# DO NOT ALTER OR DELETE THIS CELL!

test_student = Student("1820", "00569", "Franz Fidel Herz", "Malerei")
test_id = id(test_student)
assert get(book, test_id).name == "Franz Fidel Herz"

# test a random student which should not be included
test_student = Student("1820", "00569", "Franz Fidel Hofer", "Malerei")
assert get(book, id(test_student)) == None


## Task 2d

Compute the **load factor** of the hash table. The load factor is defined as `n_used_slots / n_total_slots`.

- Complete the function `get_load_factor()`
- The return value is a float.

Note:
* Having a load factor of 1 just describes the ideal situation for a well-implemented hash table using Separate Chaining collision handling: no slots are left empty. Having all slots filled corresponds to a load factor of 1.
* A load factor < 1 means that there are empty slots and items had to be added to a list in another slot, increasing the number of list traversal operations and wasting some memory.
* For instance, if all items collide and occupy just one single slot, the effective load factor would be 0.01 (if table size is 100) and performance would be heavily impacted. 

In [17]:
def get_load_factor(book):
    n_used_slots = 0
    for student in book:
        if student != None:
            n_used_slots += 1
    
    return float(n_used_slots/len(book))

In [18]:
# DO NOT ALTER OR DELETE THIS CELL!

# Load factor calculated
book = [None] * N(len(data))
for entry in data:
    if entry != None and len(entry) == 4:  # test if the current entry is a valid student
        student = Student(entry[0].strip(), entry[1].strip(), entry[2].strip(), entry[3].strip())
        add(book, student)

loadFactor = get_load_factor(book)
assert loadFactor < 0.6 and loadFactor > 0.5