## Today's Agenda
- Hash Tables

# Hashing 

### The Map ADT (A.K.A. Dictionary or Associative Array)

- A map consists of (key, value) pairs
    - distinct keys, unordered
    - Each key, say $k$, may occur only once in the map
    - maps keys to values
    - Values are retrieved from the map via the key value,
    - Values may be modified
    - Key, value pairs may be removed
    - For example, name → address; word → definition
    
- Some operations are:
    - Usual operations: size() and isEmpty()
    - Search operations: Find($k$) (or Get($k$)) returns $v$
    - Add an entry: Insert($k$,$v$) (or Put($k$,$v$))
    - Delete an entry: Delete($k$) (or Remove($k$)) returns $v$
    - The cases where for insert/delete when the key is already there/not there
    
- The implementation of maps is known as **Hash tables**
- These are abstract data types designed for $\mathcal{O}(1)$ Find and Insert operations
- Basic idea

<img src="images/week-07/hash1a.png">

<img src="images/week-07/hash_function.jpg">

- **Key** unique integer that is used for indexing the values
- **Value** data that are associated with keys.


### Hash tables
- There are $m$ possible keys ($m$ typically large, even infinite)
- We expect our table to have only $n$ items
- $n$ is much less than $m$ (often written $n << m$)

- Many dictionaries have this property
    - Compiler: All possible identifiers allowed by the language vs. those used in some file of one program
    - Database: All possible student names vs. students enrolled

## Limited Set of Map Operations
- Note that with maps there are limited operations defined
    - Insert, Find, and Delete
    - Note that **no ordering of elements** is implied
- For many real-lifeapplications, a limited set of operations is needed

## Direct Address Tables
- Direct addressing using an array is very fast
- Assume
    - keys are integers in the set $U=\{0,1,\cdots ,m-1\}$
    - $m$ is small
    - no two elements have the same key
    - Then just store each element at the array location array $A[k]$ ( a bucket for the key value $k$)
    - Search, Insert, and Delete are trivial operations

<img src="images/week-07/hash1.png">

## Some Issues
-  If most keys in $U$ are used
    - direct addressing can work very well (m small)
- The largest possible key in $U$ , say $m$, may be much larger than the number of elements actually stored ($|U|$ much greater than $|K|$)
    - the table is very sparse and wastes space
    - in worst case, table too large to have in memory
- If most keys in $U$ are not used
    - need to map $U$ to a smaller set closer in size to $K$

## Mapping the Keys

<img src="images/week-07/hash2.png">

## Hashing Schemes
- We want to store $N$ items in a table of size $M$, at a location computed from the key $K$
- **Hash function**
    - Method for computing table index from key
- Need of a **collision** resolution strategy
    - How to handle two keys that hash to the same index
    
    
- *An ideal hash function:*
    - Fast to compute
    - “Rarely” hashes two “used” keys to the same index
    
<img src="images/week-07/hash1a.png">

## “Find” an Element in an Array

- Data records can be stored in arrays.
    - $A[0] = \{“IND 110”, Size 89\}$
    - $A[3] = \{“CENG 101”, Size 26\}$
    - $A[17] = \{“CENG 373”, Size 42\}$
- What is class size for CENG 101?
    - Linear search the array – $\mathcal{O}(n)$ worst case time
    - Binary search - $\mathcal{O}(\log{n})$ worst case

## Go Directly to the Element
- What if we could directly index into the array using the key?
    - $A[“CENG 101”] = \{Size 26\}$
- Main idea behind hash tables
    - Use a key based on some aspect of the data to index directly into an array
    - $\mathcal{O}(1)$ time to access records

## Indexing into Hash Table

- Need a fast hash function to convert the element key (string or number) to an integer (the hash value) (i.e, map from U to index)
    - Then use this value to index into an array
    - $Hash(“CENG 101”) = 26$, $Hash(“CENG 373”) = 42$
- Output of the hash function
    - must always be less than size of array
    - should be as evenly distributed as possible

## Choosing the Hash Function
- What properties do we want from a hash function?
    - Want universe of hash values to be distributed randomly to minimize collisions
    - Don’t want systematic non-random pattern in selection of keys to lead to systematic collisions
    - Want hash value to depend on all values in entire key and their positions

## The Key Values are Important
- Notice that one issue with all the hash functions is that the actual content of the key set matters
- The elements in K (the keys that are used) are quite possibly a restricted subset of $U$, not just a random collection
    - variable names, words in the English language, reserved keywords, telephone numbers, etc.

## Simple Hashes
- It's possible to have very simple hash functions if you are certain of your keys
- For example,
    - suppose we know that the keys s will be real numbers uniformly distributed over 0 ≤s < 1
    - Then a very fast, very good hash function is
        - $hash(s) = floor(s·m)$
        - where $m$ is the size of the table

## Example of a Very Simple Mapping
- $hash(s) = floor(s·m)$ maps from $0\le s < 1$ to $0,1,\cdots , m-1$ with $m = 10$

<img src="images/week-07/hash3.png">

Note the even distribution. There are **collisions**, 

<img src="images/week-07/hash8.png">

We will deal with collisions later.

## Perfect Hashing
-  In some cases it's possible to map a known set of keys uniquely to a set of index values
- You must know every single key beforehand and be able to derive a function that works *one-to-one*.
<img src="images/week-07/hash4.png">

## Hashing integers
- Key space is composed of integers
- Simple and most common hash function
\begin{equation}
hash(key)= key\mod{TableSize}
\end{equation}

### Example
- $TableSize=10$
- Key values are $7,18,41,34,10$, and added to hash table in this order. Then,

In [1]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='images/week-07/hashEx11.png'></td><td><img src='images/week-07/hashEx12.png'></td><td><img src='images/week-07/hashEx13.png'></td><td><img src='images/week-07/hashEx14.png'></td><td><img src='images/week-07/hashEx15.png'></td><td><img src='images/week-07/hashEx16.png'></td></tr></table>"))

## Modulus Hash Function

- One solution for a less constrained key set
    - modular arithmetic
- $a \mod{size}$, where $size$ represents hash table size.
    - returns remainder when $a$ is divided by $size$
    - If Table size is $251$
    - $408 \mod{251} = 157$
    - $352 \mod{251} = 101$

## Modulo Mapping
- $a \mod{m}$ maps from integers to $0,1,\cdots ,m-1$
    - Is it one to one? **No**
    - Is it onto? Yes (for every bucket there is a possible key)
    
<img src="images/week-07/hash5.png">

## Hashing Integers
- If keys are integers, we can use the hash function:
    - $hash(key) = key \mod{TableSize}$
- **Problem 1:** What if $TableSize$ is 11 and all keys are 2 repeated digits? (eg, $22, 33,\cdots$)
    - all keys map to the same index
    - Need to pick $TableSize$ carefully: often, a prime number

## Nonnumerical Keys
- Many hash functions assume that the universe of keys is the natural numbers $N=\{0,1,2,\cdots\}$
- Need to find a function to convert the actual key to a natural number quickly and effectively before or during the hash calculation
- Generally work with the ASCII character codes when converting strings to numbers

## Characters to Integers

- If keys are strings can get an integer by adding up ASCII values of characters in key
- We are converting a very large string $c_{0}c_{1}c_{2}\cdots c_{n}$ to a relatively small number $c_{0}+c_{1}+c_{2}+\cdots +c_{n} \mod{size}$.

<img src="images/week-07/hash6.png">

## Hash Must be Onto Table
- **Problem 2:** What if $TableSize$ is 1,000 and all keys are 8 or less characters long?
    - chars have values between 0 and 127
    - Keys will hash only to positions 0 through $8\times 127 = 1016$
- Need to distribute keys over the entire table or the extra space is wasted

## Problems with Adding Characters
- Problems with adding up character values for string keys
    - If string keys are short, they will not hash evenly to all of the hash table
    - Different character combinations hash to same value
        - “abc”, “bca”, and “cab” all add up to the same value, **COLLISION** occurs.  

## Collisions
- A collision occurs when two different keys hash to the same value
    - For example, if $TableSize = 17$, the keys 18 and 35 hash to the same value for the $\mod{17}$ hash function
    - $18 \mod{17} = 1$ and $35 \mod{17} = 1$
- Cannot store both data records in the same slot in array

## Collision-avoidance
- With $hash(key) = key \mod{TableSize}$, the number of collisions depends on
    - The key values to be inserted into hash table
    - $TableSize$
    
- Larger table-size tends to help, but not always
    - Example: Suppose key values: $70, 24, 56, 43, 10$
    - With $TableSize$ $10$ and $60$, we obtain similar collisions.
- Strategy: Pick table size to be prime. **Exercise: Why?**

## Collision Resolution
- Separate Chaining
    - Use data structure (such as a linked list) to store multiple items that hash to the same slot
- Open addressing (or probing)
    - search for empty slots , e.g., using a second function and store item in first empty slot that is found

## Resolution by Chaining
- **Chaining:** All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”)
- Each hash table cell holds pointer to linked list of records with same hash value
- Collision: Insert item into linked list
- To Find an item: compute hash value, then do Find on linked
list
- Note that there are potentially as many as $TableSize$ lists

<img src="images/week-07/hash7.png">

#### Example

insert $10, 22, 107, 12, 42$ with $\mod{}$ hashing and $TableSize = 10$.

In [2]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='images/week-07/hashEx21.png'></td><td><img src='images/week-07/hashEx22.png'></td><td><img src='images/week-07/hashEx23.png'></td><td><img src='images/week-07/hashEx24.png'></td><td><img src='images/week-07/hashEx25.png'></td><td><img src='images/week-07/hashEx26.png'></td></tr></table>"))

## Why Lists?
- Can use List ADT for Find/Insert/Delete in linked list
    - $\mathcal{O}(M)$ runtime where $M$ is the number of elements in the particular chain
- Can also use Binary Search Trees
    - $\mathcal{O}(\log{M})$ time instead of $\mathcal{O}(M)$
    - But the number of elements to search through, $M$, should be small (otherwise the hashing function is bad or the table is too small)
    - generally not worth the overhead of BSTs

## Load Factor of a Hash Table
- Let $N$ be the number of items to be stored
- Load factor \begin{equation}\lambda =\frac{N}{TableSize}\end{equation}
    - $TableSize = 101$ and $N =505$, then $\lambda = 5$
    - $TableSize = 101$ and $N =10$, then $\lambda = 0.1$
- Average length of chained list equals to $\lambda$ and so average time for accessing an item becomes $\mathcal{O}(1)+\mathcal{O}(\lambda)$
    - Want $\lambda$ to be smaller than 1 but close to 1 if good hashing function (i.e. $TableSize≈ N$)
    - With chaining hashing continues to work for $\lambda > 1$

## Resolution by Open Addressing
- No links, all keys are in the table
    - reduced overhead saves space
- When searching for $x$, check locations $h_1{x},h_2{x},h_3{x}\cdots$, until either
    - $x$ is found; or
    - we find an empty location ($x$ not present)
- Various aspects of open addressing differ in which probe sequence they use.

### Example
- Simple approach: If $h(key)= key \mod{TableSize}$ is already full,
    - try $(h(key) + 1) \mod{}TableSize$. If full,
    - try $(h(key) + 2) \mod{}TableSize$. If full,
    - try $(h(key) + 3) \mod{}TableSize$. If full,...
    
- insert $38, 19, 8, 109, 10$

In [3]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='images/week-07/hashEx31.png'></td><td><img src='images/week-07/hashEx32.png'></td><td><img src='images/week-07/hashEx33.png'></td><td><img src='images/week-07/hashEx34.png'></td><td><img src='images/week-07/hashEx35.png'></td></tr></table>"))

## Probing hash tables
- Trying the next spot is called **probing** (also called **open addressing**).
- In above example, We just did *linear probing*.
- $i$th probe was \begin{equation}(h(key) + i) \mod{TableSize}\end{equation}
where $h(key)= key \mod{TableSize}$
- In general, we have some **probe function** $F$ and use
\begin{equation}h(key) + F(i) \mod{TableSize}\end{equation}
- Open addressing does poorly with high load factor $\lambda$
     - So want larger tables
     - Too many probes means no more $\mathcal{O}(1)$

## Cell Full? Keep Looking.
- $h_{i}(x)=(h(x)+F(i)) \mod{TableSize},\quad i=0,1,2,\cdots$
    - Define $F(0) = 0$
    - $h(x)= x \mod{TableSize}$
- $F$ is the collision resolution function.
    - Some possibilities:
        - Linear: $F(i) = i$
        - Quadratic: $F(i) = i^{2}$
        - Double Hashing: $F(i) = i·Hash_{2}(x)$

## Linear Probing
- When searching for $k$, check locations $h(k),h(k)+1, h(k)+2,\cdots \mod{TableSize}$ until either
    - $k$ is found; or
    - we find an empty location ($k$ not present)
- If table is very sparse, almost like separate chaining.
- When table starts filling, we get clustering but still constant average search time.
- Full table ⇒ infinite loop

## Primary Clustering Problem
- Once a block of a few contiguous occupied positions emerges in table, it becomes a “target” for subsequent collisions
- As clusters grow, they also merge to form larger clusters.
- Primary clustering: elements that hash to different cells probe same alternative cells

<img src="images/week-07/hashPrimaryClustering.png">

## Quadratic Probing
- $h_{i}(x)=(h(x)+F(i)) \mod{TableSize},\quad i=0,1,2,\cdots$
- $F(i)=i^{2}$
- When searching for $k$, check locations $h(k),h(k)+1^{2}, h(k)+2^{2},\cdots \mod{TableSize}$ until either
    - $k$ is found; or
    - we find an empty location ($k$ not present)
- No primary clustering but secondary clustering possible

### Example
- $TableSize=10$ and iinsert $89,18,49,58,79$
- $h_{i}(x)=(h(x)+i^2) \mod{TableSize}$
- $h_{0}(89)=(h(89)+0^2) \mod{10}=9$
- $h_{0}(18)=(h(18)+0^2) \mod{10}=8$
- $h_{0}(49)=(h(49)+0^2) \mod{10}=9$, full try next
- $h_{1}(49)=(h(49)+1^2) \mod{10}=0$

In [4]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='images/week-07/hashEx41.png'></td><td><img src='images/week-07/hashEx42.png'></td><td><img src='images/week-07/hashEx43.png'></td><td><img src='images/week-07/hashEx44.png'></td><td><img src='images/week-07/hashEx45.png'></td><td><img src='images/week-07/hashEx46.png'></td></tr></table>"))

## Double Hashing
- Double Hashing: $F(i) = i·g(x)$
- When searching for $k$, check locations $h(k),h(k)+g(k), h(k)+2\times g(k),\cdots \mod{TableSize}$ until either
    - $k$ is found; or
    - we find an empty location ($k$ not present)
- General formula: $h_{i}(k)=(h(k)+i\times g(k)) \mod{TableSize},\quad i=1,2,\cdots$
-  Must be careful about $g(k)$
    - Make sure that $g(k)$ can not be $0$.
    - For example
        - $h(k)=k\mod{p}$,
        - $g(k)=q-(k\mod{q})$
        - $2<q<p$,
        - $p$ and $q$ are prime numbers.

## Rehashing
- If table gets too full, create a bigger table and copy everything
- Rehashing makes a new, **bigger table**
- With chaining, we get to decide what “too full” means
    - Keep load factor reasonable (e.g., < 1)?
    - Consider average or max size of non-empty chains.
- For probing, half-full is a good rule of thumb
- New table size
    - Twice-as-big is a good idea, except that won’t be prime!
    - So go about twice-as-big, but prime.

In [None]:
# Hashing Function to return 
# key for every value.
def Hashing(keyvalue):
    return keyvalue % len(HashTable)                                                  

In [None]:
# Insert Function to add
# values to the hash table
def insert(Hashtable, keyvalue, value):
      
    hash_key = Hashing(keyvalue)
    Hashtable[hash_key].append(value)
  

In [None]:
# Function to display hashtable
def display_hash(hashTable):
      
    for i in range(len(hashTable)):
        print(i, end = " ")
          
        for j in hashTable[i]:
            print("-->", end = " ")
            print(j, end = " ")
              
        print()

In [None]:
# Creating Hashtable as 
# a nested list.
HashTable = [[] for _ in range(10)]

In [None]:
display_hash (HashTable)

In [None]:
# Driver Code
insert(HashTable, 10, 'Ankara')

In [None]:
insert(HashTable, 20, 'Kayseri')

In [None]:
insert(HashTable, 25, 'İstanbul')
insert(HashTable, 9, 'Trabzon')
insert(HashTable, 21, 'İzmir')
insert(HashTable, 21, 'Muğla')

In [None]:
display_hash (HashTable)

In [None]:
HashTable[Hashing(21)]

In [1]:
class HashTable(object):
    def __init__(self, length=10):
        # Initiate our array with empty values.
        self.array = [None] * length
    
    def hash(self, key):
        """Get the index of our array for a specific string key"""
        length = len(self.array)
        return hash(key) % length
        
    def add(self, key, value):
        """Add a value to our array by its key"""
        index = self.hash(key)
        if self.array[index] is not None:
            # This index already contain some values.
            # This means that this add MIGHT be an update
            # to a key that already exist. Instead of just storing
            # the value we have to first look if the key exist.
            for kvp in self.array[index]:
                # If key is found, then update
                # its current value to the new value.
                if kvp[0] == key:
                    kvp[1] = value
                    break
                else:
                    # If no breaks was hit in the for loop, it 
                    # means that no existing key was found, 
                    # so we can simply just add it to the end.
                    self.array[index].append([key, value])
        else:
            # This index is empty. We should initiate 
            # a list and append our key-value-pair to it.
            self.array[index] = []
            self.array[index].append([key, value])
            
        if self.is_full():
            self.double()
    
    def get(self, key):
        """Get a value by key"""
        index = self.hash(key)
        if self.array[index] is None:
            raise KeyError()
        else:
            # Loop through all key-value-pairs
            # and find if our key exist. If it does 
            # then return its value.
            for kvp in self.array[index]:
                if kvp[0] == key:
                    return kvp[1]
            
            # If no return was done during loop,
            # it means key didn't exist.
            raise KeyError()
            
    
    def is_full(self):
        """Determines if the HashTable is too populated."""
        items = 0
        # Count how many indexes in our array
        # that is populated with values.
        for item in self.array:
            if item is not None:
                items += 1
        # Return bool value based on if the 
        # amount of populated items are more 
        # than half the length of the list.
        return items > len(self.array)/2
        
    def double(self):
        """Double the list length and re-add values"""
        ht2 = HashTable(length=len(self.array)*2)
        for i in range(len(self.array)):
            if self.array[i] is None:
                continue
            
            # Since our list is now a different length,
            # we need to re-add all of our values to 
            # the new list for its hash to return correct
            # index.
            for kvp in self.array[i]:
                ht2.add(kvp[0], kvp[1])
        
        # Finally we just replace our current list with 
        # the new list of values that we created in ht2.
        self.array = ht2.array

In [2]:
h = HashTable()

In [3]:
h.add(10, 'Ankara')

In [4]:
h.get(10)

'Ankara'

In [5]:
h.add(25, 'İstanbul')
h.add(20, 'Kayseri')
h.add(9, 'Trabzon')
h.add(21, 'İzmir')
h.add(21, 'Muğla')

In [7]:
h.get(20)

'Kayseri'

In [8]:
h.get(21)

'Muğla'

In [None]:
h.hash(21)

In [None]:
h.is_full()