# Hashes and Hashtables

<img src="https://ohmydish.com/wp-content/uploads/2015/04/Hash-browns.jpg" style="width:500px">

<span style='color:#D3D3D3'><span>https://</span>ohmydish.com/wp-content/uploads/2015/04/Hash-browns.jpg </span>

<style>
    p {font-size:large};
</style>

### How can I run this notebook?
- Save the `.ipynb` file to a folder you want to work in
- Install Docker Desktop
- From your working folder, run
    ```bash
    docker run -it -p 8888:8888 -v "$(pwd):/data" sehrig/cling jupyter-notebook
    ```
  - This command assumes a *unix* environment (mac or linux).
  If you are on Windows, the `-v <path>:/data` argument will look a little different
    - This might help you: <https://stackoverflow.com/questions/41485217/mount-current-directory-as-a-volume-in-docker-on-windows-10> 
- Follow the instructions in your terminal (i.e. open the URL that appears that starts with `127.0.0.1`)
- Open the notebook and have at it!
- You can learn more about Jupyter Notebooks and kernels on the internet.

<br />
<br />
<br />

<br />
<br />
<br />

What if I told you that you could have a collection that performs insert and lookup in O(1)?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### **Hashtables!**

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

## Value vs location
The key is to store and retrieve the items by their *value* and not their *location*.

Location is relative to the other items. This will change with respect to the number of items -> so you **KNOW** there will be an `n` somewhere in `O(...n...)`. 

Maybe you can reduce the impact (e.g. `O(log n)`), but you can never get rid of it when you depend on location to store an item.

We want a process for storing and retrieving items based on value or identity, which are independent of the other items -> `O(1)`

<br />
<br />
<br />

<br />
<br />
<br />

## How do we do this?

The idea is to take an object and turn it into an integer. 

Any integer can be turned into an index via the modulus operator (e.g. `num % length`).

We can use that index to store the object in an array.

The process of converting an object to an integer is called *hashing*. 

A function that does this is called a *hash function*.

<br />
<br />
<br />

<br />
<br />
<br />

## What makes a good hash function?

The following converts a string into an integer:

In [1]:
int hash(char* foo) {
    return 7;
}



<br />
<br />
<br />

<br />
<br />
<br />

Let's see it in action

In [2]:
char foo[] = "foo";
hash(foo)

(int) 7


In [3]:
char bar[] = "bar";
hash(bar)

(int) 7


<br />
<br />
<br />

<br />
<br />
<br />

How about this one?

In [4]:
int hash2(char* foo) {
    return rand();
}



In [5]:
hash2(foo)

(int) 596516649


In [6]:
hash2(bar)

(int) 1189641421


<br />
<br />
<br />

This is better than the constant hash function, but...

In [7]:
hash2(foo)

(int) 1025202362


...still not good.

<br />
<br />
<br />

Obviously, these hash function are not very useful.

**What makes a good hash function?**

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

## Good hash functions
- Deterministic
- Uniform
- Fast
  - You can't store a value in `O(1)` if it takes `O(n)` to compute the hash...
  

<br />
<br />
<br />

<br />
<br />
<br />

## Hash tables

You can store an item by it's hash.

Thus, the item is stored based on its *value* and not its *location*.

### Setup

In [8]:
#include <cstdio>



In [10]:
const int capacity = 10;
char* table[capacity];
int size;
char EMPTY[] = "";

[1minput_line_13:2:12: [0m[0;1;31merror: [0m[1mredefinition of 'capacity'[0m
 const int capacity = 10;
[0;1;32m           ^
[0m[1minput_line_12:2:12: [0m[0;1;30mnote: [0mprevious definition is here[0m
 const int capacity = 10;
[0;1;32m           ^
[0m

ename: evalue

In [11]:
void printTable() {
    for (int i = 0; i < capacity; i++) {
        std::fprintf(stdout, "%0d: %s\n", i, table[i]);
    }
}



In [12]:
void initTable() {
    for (int i = 0; i < capacity; i++) { table[i] = EMPTY; };
    size = 0;
    printTable();
}



In [13]:
initTable()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


### Define `hashStr` , `add`, and `search`

In [14]:
int hashStr(char* item) {
    return (int) item[0];
}



In [15]:
char* add(char* item) {
    table[hashStr(item) % capacity] = item;
    size++;
    return item;
}



In [16]:
char* search(char* item) {
    int pos = hashStr(item) % capacity;
    return table[pos];
}



### Try it out!

In [17]:
char foobar[] = "foobar";
char alice[] = "alice";
char bob[] = "bob";
char ernie[] = "ernie";
char patty[] = "patty";



In [18]:
add(foobar); printTable()

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [19]:
search(foobar)

(char *) "foobar"


Add more stuff:

In [20]:
add(alice);
add(bob);
printTable();

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: alice
8: bob
9: 


(void) @0x7fa88bffdd78


Observe how using `%` causes the items to "wrap around" the array:

In [21]:
add(ernie); add(patty); printTable()

0: 
1: ernie
2: patty
3: 
4: 
5: 
6: 
7: alice
8: bob
9: 


(void) @0x7fa88bffdd78


<br />
<br />
<br />

What happened to `foobar`!?

<br />
<br />
<br />

In [22]:
search(foobar)

(char *) "patty"


<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

## Collisions

What happens when two different objects end up at the same index (same hash modulo table size)? Collision! 

How do you know you have a collision between two items? 

Do we allow duplicate items?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

## Dealing with collisions

### Strategy 1
Just use a bigger table.

In [23]:
hashStr(foobar) % 10

(int) 2


In [24]:
hashStr(patty) % 10

(int) 2


In [25]:
hashStr(foobar) % 1000

(int) 102


In [26]:
hashStr(patty) % 1000

(int) 112


Problem solved.

<span style='font-size:50px'>üí™üèª</span>

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

But what if you need to store 9999 items? What if you don't know how many items you need? How can you guarantee you still won't have collisions?

<span style='font-size:50px'>üò∂  ü§î</span>

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### Strategy 2
If you go to store an item, and the item's position is already taken, just stick it somewhere else. 

In [27]:
char* addNoCollisions(char* item) {
    int pos = hashStr(item) % capacity;
    while (table[pos] != EMPTY) { pos = rand() % capacity; }
    table[pos] = item;
    size++;
    return item;
}



In [28]:
initTable()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [29]:
addNoCollisions(foobar); addNoCollisions(patty); printTable()

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: patty
8: 
9: 


(void) @0x7fa88bffdd78


Problem solved.

<span style='font-size:50px'>üí™üèª</span>

<br />
<br />
<br />

In [30]:
search(foobar)

(char *) "foobar"


In [31]:
search(patty)

(char *) "foobar"


Oops.

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### Storing colliding items, while being able to find them again

*Linear Addressing* is an algorithm for storing **and** retrieving items in a hash table while addressing collisions in a deterministic way.

In [32]:
char* addLinear(char* item) {
    int pos = hashStr(item) % capacity;
    while (table[pos] != EMPTY) { pos = (pos + 1) % capacity; } /* why is % important here? */
    table[pos] = item;
    size++;
    return item;
}



In [33]:
char* searchLinear_almost(char* item) {
    int pos = hashStr(item) % capacity;
    while (table[pos] != item) { pos = (pos + 1) % capacity; }
    return table[pos];
}



Do you see the bug?

<br />
<br />
<br />

<br />
<br />
<br />

In [34]:
char* searchLinear(char* item) {
    int pos = hashStr(item) % capacity;
    while (table[pos] != item && table[pos] != EMPTY) { pos = (pos + 1) % capacity; }
    return table[pos];
}



In [35]:
initTable()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [36]:
addLinear(foobar); addLinear(patty); printTable()

0: 
1: 
2: foobar
3: patty
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [37]:
searchLinear(foobar)

(char *) "foobar"


In [38]:
searchLinear(patty)

(char *) "patty"


<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

Did you catch the bug? 

<span style='font-size:50px'>üêõ</span>

<br />
<br />
<br />

What happens if our table is full and we try to search?

How can you solve this?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

In [39]:
char fred[] = "fred";
char fran[] = "fran";
char fry[] = "fry";
char fickle[] = "fickle";
char fellow[] = "fellow";
char paul[] = "paul";
char peter[] = "peter";
char pedro[] = "pedro";



In [40]:
char pam[] = "pam";



In [41]:
initTable()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [42]:
addLinear(foobar); addLinear(patty); addLinear(fred); addLinear(fran); addLinear(fry);
addLinear(fickle); addLinear(fellow); addLinear(paul); addLinear(peter); addLinear(pedro);
printTable()

0: peter
1: pedro
2: foobar
3: patty
4: fred
5: fran
6: fry
7: fickle
8: fellow
9: paul


(void) @0x7fa88bffdd78


In [43]:
char* searchLinearBetter(char* item) {
    int attempt = 0;
    int pos = hashStr(item) % capacity;
    
    /* fill in the blanks */
    
    while (table[pos] != item && table[pos] != EMPTY && attempt++ < 100) { 
        pos = (pos + 1) % capacity; 
    }
    if (attempt < 100) {
        return table[pos];
    } else {
        std::fprintf(stdout, "GAVE UP!");
        return EMPTY;
    }
}



In [44]:
searchLinearBetter(pam)

GAVE UP!

(char *) ""


<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### Revisiting collisions
The more the hash-table holds, the more likely a collision.

In [45]:
char richard[] = "richard";
char quentin[] = "quentin";



In [46]:
initTable()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [47]:
addLinear(fred); addLinear(patty); printTable()

0: 
1: 
2: fred
3: patty
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [48]:
hashStr(quentin) % capacity

(int) 3


In [49]:
addLinear(quentin); printTable();

0: 
1: 
2: fred
3: patty
4: quentin
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [50]:
hashStr(richard) % capacity

(int) 4


In [51]:
addLinear(richard); printTable();

0: 
1: 
2: fred
3: patty
4: quentin
5: richard
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


What is happening here?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

In [52]:
printTable()

0: 
1: 
2: fred
3: patty
4: quentin
5: richard
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


What is the probability that the next item added to the table will end up in position **0**?

What about position **6**?

As more items are added to the table, what happens to the expected time complexity?

How can you avoid clumping?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### Expansion and reshashing
As noted before, one way to reduce collisions is to have a bigger table.

If you're going to expand the table, how does the algorithm differ from that used to expand an array-list? Why?

<br />
<br />
<br />

In [53]:
hashStr(foobar) % 10

(int) 2


In [55]:
hashStr(foobar) % 101

(int) 1


<br />
<br />
<br />

<br />
<br />
<br />

### Quadratic addressing
Another way to address collisions is to use a different algorithm for probing.



<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### Deletion

In [56]:
char* remove(char* item) {
    int attempt = 0;
    int pos = hashStr(item) % capacity;
    
    /* fill in the blanks to address high-capacity scenario */
    
    while (table[pos] != item && table[pos] != EMPTY && attempt++ < 100) { 
        pos = (pos + 1) % capacity; 
    }
    if (attempt < 100) {
        char* value = table[pos];
        table[pos] = EMPTY;
        return value;
    } else {
        std::fprintf(stdout, "GAVE UP!");
        return EMPTY;
    }
}



In [57]:
initTable()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [58]:
addLinear(fred); addLinear(fran); addLinear(foobar); printTable()

0: 
1: 
2: fred
3: fran
4: foobar
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [59]:
searchLinearBetter(foobar)

(char *) "foobar"


In [60]:
remove(fred); printTable()

0: 
1: 
2: 
3: fran
4: foobar
5: 
6: 
7: 
8: 
9: 


(void) @0x7fa88bffdd78


In [61]:
searchLinearBetter(foobar)

(char *) ""


<br />
<br />
<br />

<span style='font-size:35px'>ü§¶üèª‚Äç‚ôÇÔ∏è</span>

What is happening here? 

How can you address it?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

### Strategy for deletion
- When you remove an item, store a dummy value that indicates it used to be populated
  - When searching for an item that originally collided, you know to keep looking

Can dummy items be replaced by new items? Under what conditions?

<br/>
<details>
    <summary>How do deleted (dummy) items affect the time complexity of insert, search, and delete? ‚¨áÔ∏è </summary>
    <ul>
        <li>Deleted items don't decrease search time, as you still need to traverse them</li>
        <li>Deleted items don't decrease storage, as you are still storing the dummies</li>
        <li>You cannot replace a deleted item with a new one until you've probed the whole chain to be sure it's not already there.</li>
        <li>The only way to efficiently clean up is to reduce the table and rehash.</li>
    </ul>
</details>

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

## Traversal

If you traverse a hash table, in what order do the items come out?

Is this meaningful?

<br />
<br />
<br />

<br />
<br />
<br />

Can you make a data structure that preserves insert-order while maintaining `O(1)` complexity?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

## Conclusion
Hashmaps are awesome, but collisions have to be dealt with in order to make things work.

Open addressing (via linear or quadratic probing) is one strategy for addressing collisions (*pun not intended* üòÜ).

But there are some issues with that strategy. What issues have we reviewed today?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />

Is there a better way?

<br />
<br />
<br />

<br />
<br />
<br />

<br />
<br />
<br />