# Hash

In [1]:
#include <string>
#include <sstream>
#include <iostream>
#include <vector>
#include <list>
#include <algorithm>  // find
using namespace std;

## The Table

In [2]:
string table[10];

In [3]:
void reset_table() {
    for (int i = 0; i < 10; i++) {
        table[i] = "";
    }    
}

In [4]:
void print_table() {
    for (int i = 0; i < 10; i++) {
        cout << i << ": " << table[i] << endl;
    }
}

In [5]:
reset_table();
print_table()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


## The Hash Function

In [6]:
int string_hash(string const& value) {
    int result = 0;
    for (auto c : value) {
        result += int(c);
    }
    return result;
}

In [7]:
string_hash("foobar")

633

In [8]:
string_hash("I love cs235!")

976

## Table + Hash Function

In [16]:
void add_item(string const& item, int (*the_function)(string const&)) {
    table[the_function(item) % 10] = item;
}

In [17]:
int another_hash(string const& item) {
    return int(item[0]);
}

In [18]:
string_hash("bazquux")

784

In [19]:
reset_table();
add_item("foobar", &another_hash);
add_item("bazquux", &another_hash);
add_item("win!", &another_hash);
print_table()

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: 
8: bazquux
9: win!


In [25]:
reset_table();
add_item("foobar", &string_hash);
add_item("foobar", &string_hash);
add_item("bazquux", &string_hash);
add_item("win!", &string_hash);
print_table()

0: 
1: 
2: 
3: foobar
4: bazquux
5: 
6: 
7: win!
8: 
9: 


In [22]:
bool has_item(string const& item, int (*hash)(string const&)) {
    return table[hash(item) % 10] == item;
}

In [23]:
has_item("foobar", &string_hash)

true

In [24]:
has_item("frobnicate", &string_hash)

false

In [26]:
void remove_item(string const& item, int (*hash)(string const&)) {
    table[hash(item) % 10] = "";
}

In [27]:
remove_item("foobar", &string_hash);
print_table()

0: 
1: 
2: 
3: 
4: bazquux
5: 
6: 
7: win!
8: 
9: 


## Introducing: The HashTable

- A **hash function** converts a value into an integer
- A **hash table** uses a hash function to determine the location in which to store the value

What is the big-O complexity to add, remove, or lookup a value?

- The time it takes to convert a value into an index is $O(1)$
- Add, remove, or lookup are just additional constant operations.

$O(1)$!

## Hash Functions: *Revisited*

In [28]:
int hash_7(string const& value) {
    return 7;
}

In [30]:
int rand_hash(string const& value) {
    return rand();
}

In [31]:
hash_7("foo")

7

In [32]:
rand_hash("bar")

674016736

In [33]:
rand_hash("bar")

1181762074

### Hash Function Qualities

The choice of hash function matters. What kind of function do we want?

- **Determinism**: the same value will ALWAYS yield the same hashcode
  - No `rand` in the hash function!
  
- **Efficiency**: the hashcode can be computed quickly.
  - If it takes longer to compute the hashcode than to insert into a BST, that's no good.
 
- **Defined range**: the distribution covers the full defined range
  - If my array is 1000 slots long, but my hash function only produces values between 0..10, that's no good.

- **Uniformity**: the hashcodes are uniformly distributed across the full possible space
  - If my hash function tends to output even numbers but not odd numbers, that's no good.



## Hash Tables: *Revisited*

In [34]:
reset_table();
print_table();

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


In [35]:
add_item("foo", &string_hash);
add_item("oof", &string_hash);
print_table();

0: 
1: 
2: 
3: 
4: oof
5: 
6: 
7: 
8: 
9: 


Is it possible to build a hash function that will never produce collisions?

We will always need to handle collisions. 

How should we do it?

## HashTable Collisions

One strategy is to use "probing".

If the slot an item is assigned is occupied, you follow a deterministic algorithm to find another slot. 

This gets complicated. Don't use probing.

### Chaining

Instead of storing the items directly, each slot stores a list of items. 

First determine the slot an item should go in, then search the list in that slot. 

As long as the number of items assigned to the same slot stays small, the performance doesn't degrade.

When the number of items gets closer to the capacity of the array, it's time to grow the array.

```
0: foo, quux
1:
2: bar
3: baz, zip
4: 
5:
6: win
7: win!
8: cs235
9: abc
```

### Growing

- Create a new array
- Re-add each item to the table

Why not simply copy the lists over to the new array? Why do we need to re-add each item individually?

Assume an array size of 10. The hashcodes `1812` and `7502` will end up in the same slot:

In [36]:
1812 % 10

2

In [37]:
7502 % 10

2

But when I increase the array size to 20, these same hashcodes now fall in different slots:

In [38]:
1812 % 20

12

In [39]:
7502 % 20

2

In [40]:
template<class T>
class HashSet {
    int capacity;
    int size;
    list<T>* table;
    int (*hash)(string const&);
    
    public:

    HashSet(int (*hash)(string const&)) : capacity(10), size(0), hash(hash) {
        // Each slot is initialized with default value of list<T>
        //  which is an empty list
        table = new list<T>[capacity];
    }
    ~HashSet() {
        delete[] table;
    }
    
    string to_string() const {
        stringstream ss;
        for (int i = 0; i < capacity; i++) {
            ss << i << ": ";
            for (auto item : table[i]) {
                ss << item << ",";
            }
            ss << endl;
        }
        return ss.str();
    }
    
    void print() const {
        cout << to_string() << endl;
    }
    
    bool add(string const& value) {
        if (size > capacity * 0.8) {
            _grow();
        }
        
        int pos = hash(value) % capacity;

        // Is this item already in the table?
        list<T> &bucket = table[pos];
        bool found = (std::find(bucket.begin(), bucket.end(), value) != bucket.end());

        if (found) {
            return false;
        }
        
        bucket.push_back(value);
        size++;
        return true;
    }
    
    bool remove(string const& value) {
        int pos = hash(value) % capacity;
        
        auto &bucket = table[pos];
        
        auto iter = std::find(bucket.begin(), bucket.end(), value);

        if (iter == bucket.end()) {
            return false;
        }
        
        bucket.erase(iter);
        size--;
        return true;
    }
    
    bool contains(string const& value) {
        int pos = hash(value) % capacity;

        // Is this item already in the table?
        auto &bucket = table[pos];
        return (std::find(bucket.begin(), bucket.end(), value) != bucket.end());
    }
    
    void _grow() {
        auto old_table = table;
        auto old_capacity = capacity;
        capacity *= 2;
        table = new list<T>[capacity];
        size = 0;
        for (int i = 0; i < old_capacity; i++) {
            for (auto item : old_table[i]) {
                add(item);
            }
        }
        delete[] old_table;
    }
};

In [41]:
HashSet<string> set(string_hash);

In [42]:
for (auto word : {"foo", "bar", "quux", "win", "stuff"}) {
    set.add(word);
}

In [46]:
string_hash("win")

334

In [44]:
set.print()

0: 
1: 
2: stuff,
3: 
4: foo,win,
5: 
6: 
7: quux,
8: 
9: bar,



In [47]:
for (auto word : {"abc", "bcd", "cde", "def"}) {
    set.add(word);
}

In [48]:
set.print()

0: cde,
1: 
2: stuff,
3: def,
4: foo,win,abc,
5: 
6: 
7: quux,bcd,
8: 
9: bar,



In [49]:
for (auto word : {"efg"}) {
    set.add(word);
}

In [50]:
set.print()

0: cde,
1: 
2: 
3: def,
4: foo,
5: 
6: efg,
7: quux,
8: 
9: bar,
10: 
11: 
12: stuff,
13: 
14: win,abc,
15: 
16: 
17: bcd,
18: 
19: 



In [51]:
set.remove("abc")

true

In [53]:
string_hash("abc")

294

In [52]:
set.print()

0: cde,
1: 
2: 
3: def,
4: foo,
5: 
6: efg,
7: quux,
8: 
9: bar,
10: 
11: 
12: stuff,
13: 
14: win,
15: 
16: 
17: bcd,
18: 
19: 



## Iteration order

When you iterate through the values of a hash table, what order to they come out?

In [54]:
set.print()

0: cde,
1: 
2: 
3: def,
4: foo,
5: 
6: efg,
7: quux,
8: 
9: bar,
10: 
11: 
12: stuff,
13: 
14: win,
15: 
16: 
17: bcd,
18: 
19: 



## Big O

What is the big-O for add, remove, and contains?

- Computing the position is $O(1)$
- Finding the bucket is $O(1)$
- Assuming the hash function uniformly distributes the data, then the probability that there is a collision will be small
  - You can tune the grow parameter to improve performance
- Growing adds $n$ items over again, but it only happens once every $n$ items, so the amortized complexity is $O(1)$
- All together: $O(1)$

What are the pathological cases for a hashtable?

- All the items end up in the same bucket: $O(n)$

## Hash Maps

To turn a set into a map, you store key-value-pairs instead of just values.

In [55]:
template<class T>
ostream& operator<<(ostream& stream, list<T> things) {
    for (auto thing : things) {
        stream << thing << " ";
    }
    return stream;
}

In [58]:
list<string> stuff = {"foo", "bar", "baz"};
cout << stuff << endl;

foo bar baz 


In [None]:
/*
map<string, string> foo;
foo.add("key", "something");
foo["key"] = "value";
// "something" = "value"
*/

In [60]:
template<class K, class V>
class HashMap {
    struct KeyValue {
        K key;
        V value;
        KeyValue(K key, V value) : key(key), value(value) {}
        bool operator==(KeyValue const& other) {
            // We'll say two key-values are the same if their keys are the same
            return key == other.key;
        }
    };
    
    int capacity;
    int size;
    list<KeyValue>* table;
    int (*hash)(K const&);
    
    public:
    HashMap(int (*hash)(K const&)) : capacity(10), size(0), hash(hash) {
        table = new list<KeyValue>[capacity];
    }
    ~HashMap() { delete[] table; }
    
    string to_string() const {
        stringstream ss;
        for (int i = 0; i < capacity; i++) {
            ss << i << ": ";
            for (auto kv : table[i]) {
                ss << kv.key << ":" << kv.value << ", ";
            }
            ss << endl;
        }
        return ss.str();
    }
    
    void print() const {
        cout << to_string() << endl;
    }
    
    list<KeyValue>& _get_bucket(K const& key) {
        int pos = hash(key) % capacity;        
        return table[pos];
    }
    
    bool add(K key, V value) {
        if (size > capacity * 0.8) {
            _grow();
        }
        
        KeyValue kv(key, value);
        auto &bucket = _get_bucket(key);
        auto iter = std::find(bucket.begin(), bucket.end(), kv);

        if (iter != bucket.end()) {
            // The is already a KV with the same key
            // Is the value the same?
            if (iter->value == value) {
                // Yes, same value.
                // So return false.
                return false;
            } else {
                // No, different value.
                // So replace the value and return true
                iter->value = value;
                return true;
            }
        }
        
        // The key is not already in the table
        // So add the KV to the assigned bucket
        bucket.push_back(kv);
        size++;
        return true;
    }
    
    V& operator[](K const& key) {
        if (size > capacity * 0.8) { _grow(); }
        
        // Use default value V() for now
        KeyValue kv(key, V());
        
        auto &bucket = _get_bucket(key);
        auto iter = std::find(bucket.begin(), bucket.end(), kv);

        if (iter != bucket.end()) {
            return iter->value;
            
        } else {
            // The key is not already in the table
            // So add the KV to the assigned bucket
            iter = bucket.insert(iter, kv);
            size++;
            return iter->value;
        }
    }
    
    void _grow() {
        auto old_table = table;
        auto old_capacity = capacity;
        capacity *= 2;
        table = new list<KeyValue>[capacity];
        size = 0;
        for (int i = 0; i < old_capacity; i++) {
            for (auto key_value : old_table[i]) {
                add(key_value.key, key_value.value);
            }
        }
        delete[] old_table;
    }
};

In [61]:
HashMap<string, int> my_map(&string_hash);

In [64]:
my_map.add("foo", 7)

true

In [65]:
my_map.print()

0: 
1: 
2: 
3: 
4: foo:7, 
5: 
6: 
7: 
8: 
9: 



In [66]:
my_map["quux"] = 12

12

In [67]:
my_map["bar"]

0

In [68]:
my_map.print()

0: 
1: 
2: 
3: 
4: foo:7, 
5: 
6: 
7: quux:12, 
8: 
9: bar:0, 



In [69]:
my_map["foo"] = 8

8

In [70]:
my_map.print()

0: 
1: 
2: 
3: 
4: foo:8, 
5: 
6: 
7: quux:12, 
8: 
9: bar:0, 



In [71]:
my_map["bar"] = 7;
my_map["quux"] = 9;
my_map["stuff"] = 3;

In [73]:
my_map.print()

0: 
1: 
2: stuff:3, 
3: 
4: foo:8, 
5: 
6: 
7: quux:9, 
8: 
9: bar:7, 



In [74]:
int char_hash(char const& c) {
    return int(c);
}

In [75]:
HashMap<char, list<string>> stuff(&char_hash);

In [76]:
for (auto word : {"some", "words", "and", "stuff", "to", "include", "when", "that", "makes", "sense"}) {
    stuff[word[0]].push_back(word);
}

In [77]:
stuff.print()

0: 
1: 
2: 
3: 
4: 
5: s:some stuff sense , i:include , 
6: t:to that , 
7: a:and , 
8: 
9: w:words when , m:makes , 



In [78]:
stuff['s']

{ "some", "stuff", "sense" }

In [79]:
stuff['a']

{ "and" }

In [80]:
stuff['i']

{ "include" }

## How to Hash Anything

In [81]:
template <typename T>
int hashme(T const& param) {
    unsigned char *ptr = (unsigned char *)&param;
    int sum = 0;
    for(int i = 0; i < sizeof(param); i++) {
        sum += ptr[i];
    }
    return sum;
}

In [82]:
hashme("hello")

532

In [83]:
sizeof(8)

4

In [84]:
hashme(876)

111

In [85]:
struct Foo {
    int foo;
    string bar;
    Foo(int foo, string bar) : foo(foo), bar(bar) {}
};
Foo foo(7, "world");
hashme(foo)

1403

## Key Ideas

- Hash functions convert a value into an integer
- Hash tables use hash functions to store values in $O(1)$ time
- Hash maps use hash tables to store key-value pairs. 