# 1) Hashing

* Maintain evolving set of entities.
* Insert, delete, lookup in $O(1)$

## Implementation

**Given:** Universe $U$ (of all possible entities). Maintain evolving set $S \subseteq U$.

1. Pick $n$ = # of buckets with $n \sim |S|$ (or resized dynamically).
2. Choose hashfunction $h: a \rightarrow \{0, 1, ..., n - 1\}$.
3. Use array $A$ of length $n$, store $x$ in $A[h(x)]$.

### Resolving Collisions

**Collision:** Distinct $x, y \in U$ such that $h(x) = h(y)$.

#### 1) Chaining
* Keep linked list in each bucket.
* Given a key x, perform insert/delete/lookup in list in $A[h(x)]$

#### 2) Open Addressing
* Only one element per bucket.
* Hashfunction now specifies *probe sequence*. Keep trying until open slot is found.
* **Example:** Linear Probing, Double Hashing

### Hash Functions 
* For insert/delete/lookup: $O(listLength)$ (=> $\frac{m}{n}$ to $n$, For chaining, insert is $O(1)$).
* => Performance depends on hash function!
    * Should spread data evenly (gold standard: random hashing).
    * Easy/fast to store and evaluate.

---

# 2) Universal Hashing

## Load Factor
The load factor of a hash table is:
$\alpha = \frac{\text{# of obj in HT}}{\text{# of buckets of HT}}$

**Note:**
* $\alpha = O(1)$ is necessary condition for operations to run in constant time.
* With open addressing, we need $\alpha << 1$.

## Pathological Data Sets
* No guarantee of hash function to spread data evenly.
* Because: $h: U \rightarrow \{0, 1, ..., n - 1\}$
    * There exists a bucket $i$, such that at least $\frac{|U|}{n}$ elements of $U$ hash into $i$ under $h$.
    
### Solution:

1. Use cryptographic hash function (inveasible to reverse engineer a pathological data set).
2. Use randomization:
    * Design family $H$ of hash functions such that: For all data sets $S$: "almoast all" functions $h \in H$ spread $S$ out "pretty evenly".
    * => **Universal Family**
    
## Universal Hash Functions

**Definition:**
* Let $H$ be a set of hash function from $U$ to $\{0, 1, ..., n - 1\}$.
* H is **universal** if and only if:
    * For all $x, y \in U$ (with $x \neq y$): $Pr(h(x) = h(y)) \leq \frac{1}{n}$
* when $h$ is choosen uniformly at random from $H$.

### Example: Hashing IP Addresses

* Let $U$ = IP addresses with form $(x_1, x_2, x_3, x_4)$ with $x_i \in \{0, 1, ..., 255\}$
* Let $n$ = a prime (e.g. small multiple of # of obj in hash table).

**Construction:**
* Define one hash function $h_a$ per 4-tuple $a = (a_1, a_2, a_3, a_4)$ with each $a_i \in {0, ..., n - 1}$. ($n^4$ such functions).

$\Rightarrow h_a(x_1, x_2, x_3, x_4) = (a_1 x_1 + a_2 x_2 + a_3 x_3 + a_4 x_4) \mod n$

$\Rightarrow H = \{h_a | a_1, a_2, a_3, a_4 \in \{0, 1, ..., n - 1\}\}$


**Theorem:** This family is universal.

* Consider distinct IP addresses $X$ and $Y$. Assure $x_4 \neq y_4$
* **Question:** What is the probability of a collision: $Pr[h_a(x) = h_a(y)]$

$Pr[h_a(x) = h_a(y)] \Leftrightarrow a_1 x_1 + ... + a_4 x_4 \mod n = a_1 y_1 + ... + a_4 y_4 \mod n$

$Pr[h_a(x) = h_a(y)] \Leftrightarrow a_4(x_4 - y_4) \mod n = \sum_{i = 1}^3 a_i (y_i - x_i) \mod n$

=> With $a_1, a_2, a_3$ fixed arbitrarily, how many choices of $a_4$ satisfy:

$a_4(x_4 - y_4) \mod n = \sum_{i = 1}^3 a_i (y_i - x_i) mod n$

=> **Claim:** Left-hand side equally likely of any of $\{0, 1, ..., n - 1\}$ since:
* $x_4 \neq y_4$ ($x_4 - y_4 \neq 0 \mod n$).
* $n$ is prime.
* $a_4$ is uniform at random.

---

# 3) Constant Time Guarantee of Chaining

**Given:** Hash table with chaining, hash function $h$ choosen uniformly at random from universal family $H$.

**Theorem:** [Carter-Wegmann 1979]
* All operations run in $O(1)$ time for every data set $S$.
    * **Caveats**: 1) In expectation over random choice of $h$. 2) Assumes $|b| = O(n)$. 3) Assumes $O(1)$ to evaluate hash function.
    
**Proof:**
* Analyze unsucessful lookup (all other operations are only faster).
* Let $S$ = data set with $|S| = O(n)$.
* Consider lookip for $x \notin S$.
* Running Time: $O(1) + O(\text{list length in }A[h(x)])$ (hash time + traversion).

**General Decomposition Principle:**
1. Identify random variable $y$ that you care about.
2. Express $y$ as sum of indicator random variables ($\in \{0, 1\}$): $\sum_{l=1}^m x_l$
3. Apply linearity of expectation: $E(y) = \sum_{l=1}{m} Pr(x_l = 1)$

**1)** Let $L$ = length of list in $A[h(x)]$.

**2)** For $y \in S (x \neq y)$, define $z_y = 1$ if $h(x) = h(y)$ else $z_y = 0$.
* Note: $L = \sum_{y \in S} z_y$

**3)**: $E(L) = \sum_{y \in S} E(z_y)$

$\Rightarrow E(L) = \sum_{y \in S} E(z_y) = \sum_{y \in S} Pr[h(x) = h(y)]$

*Note:* $Pr[h(x) = h(y)] \leq \frac{1}{n}$ since $H$ is universal.

$\Rightarrow E(L) \leq \sum_{y \in S} \frac{1}{n} = \frac{|S|}{n} = \alpha = O(1)$

---

# 4) Open Addressing Performance

## Double Hashing

**Heuristic Assumption:** All $n!$ probe sequences are equally likely.

**Observation:** Under heuristic assumption expect: Insertion time is $\sim \frac{1}{1 - \alpha}$

**Proof:** A random probe finds an empty slot with probablility $1 - \alpha$.
* => Insertion time ~ the number $N$ of coin flips to get "heads" where $Pr(\text{"heads"}) = 1 - \alpha$

**Note:** $E(N) = 1 + \alpha * E(N) = \frac{1}{1 - \alpha}$

## Linear Probing

Heuristic assumption of double hashing is completely false for linear probing!

**New Assumption**: Initial probe uniformly at random, independent for different keys.

**Theorem:** [Knuth 1962]
* Under above assumption, expected insertion time is $\sim \frac{1}{(1  - \alpha)^2}$

---

# 5) POC Chaining Implementation

In [1]:
class Entry:
    
    def __init__(self, key, value):
        self.key = key
        self.value = value
    
    def __str__(self):
        return '<{}->{}>'.format(self.key, self.value)

    def __repr__(self):
        return self.__str__()


class HashMap:
    
    def __init__(self, capacity=7):
        self.buckets = [None] * capacity
        self.capacity = capacity
        
    def put(self, key, value):
        index = self._get_index(hash(key))
        if self.buckets[index] is None:
            self.buckets[index] = [Entry(key, value)]
        else:
            for entry in self.buckets[index]:
                if entry.key == key:
                    entry.value = value # Update existing key
                    return
            self.buckets[index].insert(0, Entry(key, value))
    
    def get(self, key):
        index = self._get_index(hash(key))
        if not self.buckets[index]:
            raise KeyError(key)
        for entry in self.buckets[index]:
            if entry.key == key:
                return entry.value
        raise KeyError(key)
        
    def contains(self, key):
        try:
            self.get(key)
        except KeyError:
            return false
        return true

    def delete(self, key):
        index = self._get_index(hash(key))
        if not self.buckets[index]:
            raise KeyError(key)
        for list_index, entry in enumerate(self.buckets[index]):
            if entry.key == key:
                self.buckets[index].pop(list_index)
                return
        raise KeyError(key)
        
    def _get_index(self, hash_value):
        return hash_value % self.capacity

In [2]:
h = HashMap()
print('Add 2, 13, 23, 41:')
h.put(2, 4)
h.put(13, 4)
h.put(23, 5)
h.put(41, 7)
print(h.buckets)
h.delete(23)
h.delete(2)
h.delete(41)
print('Deleted 23, 2, 41:')
print(h.buckets)
print('Add 23:')
h.put(23, 8)
print(h.buckets)

Add 2, 13, 23, 41:
[None, None, [<23->5>, <2->4>], None, None, None, [<41->7>, <13->4>]]
Deleted 23, 2, 41:
[None, None, [], None, None, None, [<13->4>]]
Add 23:
[None, None, [<23->8>], None, None, None, [<13->4>]]


---

# 6) Bloom Filter

* Space efficient data structure to test weather element is in set.
* No false negatives, but false positives (if x is inserted, `lookup(x)` is guaranteed to succeed)!

## Operations

**Initialize:** 1) Array of $n$ bits. 2) $k$ hash functions $h_1, ..., h_k$.

* **Insert(x):** For $i = 1,2, ..., k$ set $A[h_i(x)] = 1$
* **Lookup(x):** Return $True \Leftrightarrow A[h_i(x) = 1]$ for every $i = 1, 2, ..., k$

### Heuristic Analysis

=> Should be a trade-off between space and error (false positives) probability.

**Assume:** All $h_i(x)$'s are uniformly random and independent.

**Given:** Array $A$ with $n$ bits, insert data set $S$ into Bloom Filter.

For each bit of $A$, the probability its been set to $1$ is: $1 - (1 - \frac{1}{n})^{k |S|}$

*Note:* $(1 - \frac{1}{n})^{k |S|}$ is upper bound by $1 - e^{(-k |S|) / n} = 1 - e^{-k / b}$

=> Under assumption, for $x \in S$, probability for false positive is:

=> $\leq (1 - e^{-k / b})^k = \text{Error rate } \epsilon$

### How to set k?

For a fixed $b$, $\epsilon$ is minimized by setting: $k \sim (\ln 2) * b$

$\Rightarrow \epsilon \sim (1/2)^(\ln 2) b$ (exponentially small in b!)

$\Rightarrow b \sim 1.44 \log_2 \frac{1}{\epsilon}$

#### Example:

With $b = 8$, choose $k = 5$ or $k = 6$. Error probability is only 2%!