# Hash functions

Suppose we have a universe of $M$ possible items(say every possible 100-character string). Arrays support constant-time lookup(more or less), but the idea of having one item for array for each potential element from the universe is just infeasible. If we only want to store roughly $N$ of these $M$ items, where $M$ is much larger than $N$, we can get away with a table of size around $N$.

Each time we store or retrieve an item, we do so via a sort of nickname, a hash value, determined by a hash function

* For example, we might have H("This hash function idea is brilliant") = 534216
* So we would store the string in position 534216

This has the advantage of being fairly fast, provided the hash function $H$ can be evaluated quickly.

Let's think immediately about what can go wrong:

* The hash function might be slow: perhaps lots of arithmetic operations
* Several items might be hashed to the same position in the table/array, confusing the search, this is called a collision

The three main desired properties are that:

* It spreads items evenly
* It is fast
* A given key must be **consistently** hashed to the same value

Hash functions are deterministic, but we will often argue about their behavior from a random point of view. We can choose the hash function randomly from a family of hash functions.

We'll start by analyzing a special data structure called a **Bloom Filter**.

Consider a stream comprising a sequence of items to insert, there are no deletions. You are then asked whether a certain item is present. You must answer "yes" or "no". You want to answer as accurately as possible, but you also want to **save space**.

One method would be to keep a fairly small array $A$ of Boolean values in memory

* Initially, set all A's values to false
* As each item x arrives, calculate $h(x)$ then set $A[h(x)]$ to be true

At the end, if asked whether an item $y$ is present in the stream, you check whether $A[h(y)]$ is true. **Trouble** is, there might have been some other value $z\neq y$ with $h(z)=h(y)$.

One resolution is to have a **family** of $k$ hash functions. When item $x$ arrives

* Set **all** of $h_1(x), h_2(x), h_3(x), h_4(x)$ etc to true

Later, when item $y$ is queried

* Check whether **all** $h_1(y), h_2(y), h_3(y), h_4(y)$ are true

Again, if $y$ is in the stream, the system will say "yes". But it might also say "yes" if for some $z\neq y$, there was an unfortunate clash and all the Booleans in the positions that $z$ hashes to are True.

Let's estimate the probability of this.

Say $m$ (distinct) elements are in the stream and $n$ cells in the array. Each item is hashed by $k$ different hash functions. 

The probability that a particular cell $i$ in the array remains False after $m$ elements is
* The probability that the "random" hashed value of each item ends up in some location other than $i$
* That is, $(1-\frac{1}{n})^{km}$, which we bound with our old friend $e^{-km/n}\equiv p$

In fact, careful analysis shows that the events "bit 1 in the array is True" and "bit 2 in the array is True" are in fact not independent. But, we'll assume for now that they are to simplify the math.

So, if we query an item $z$ that was **not** in fact in the stream, then the probability of (falsely) reporting that it was in the stream is the probability that all $k$ hashed values are true

$$(1-(1-\frac{1}{n})^{km})^k\approx(1-e^{-km/n})^k=(1-p)^k$$

Suppose we know the size of the Boolean array and (a good estimate of) the number of **distinct** items in the stream: how do we choose the number of hash functions?

With more hash functions, there's a greater chance of "finding" a False bit when querying an item that wasn't in the stream. On the other hand, with more hash functions, too many of the bits get turned True. Also more time consuming.

If we take the logarithm of the false positive probability, we get $k\ln(1-e^{-km/n})$

And after a bit of calculus, we find that this is minimized when $k=(\ln2)\frac{n}{m}$, and the probability of a false positive becomes $0.6185\frac{n}{m}$.

The process is let $u=1-e^{-km/n}$, then $e^{-km/n}=1-u$, $k=-\frac{n}{m}\ln{(1-u)}$. Now $k\ln(1-e^{-km/n})=-\frac{n}{m}\ln{(1-u)}\ln u$, take the derivative of this we get $\frac{-\ln{u}}{1-u}+\frac{\ln{(1-u)}}{u}=0$.

We have $u\ln{u}=(1-u)\ln{(1-u)}$, this is only possible when $u=1-u$, therefore $u=\frac{1}{2}$.

Another curious fact is that the false positive rate is minimized when $p=\frac{1}{2}$.

# Frequent items

Let's take a short breather with the randomized approaches, and focus on a deterministic algorithm for finding frequent items in a stream.

Say we have a sequence of $m$ items, and we want to record those that occur at least $m/k$ times. A fundamental setting is determining a **majority** element.

Consider the sequence "Tamara, Sarah, Melissa, Sarah, Emily, Sarah, Sarah"

The name "Sarah" occurs four times out of the seven names in the sequence, a majority. How can a computer determine this really efficiently?

## Misra-Gries algorithm

We keep track of just $k-1$ items, with a counter for each: a very small amount of space!

```
For each new item x:
    If x is a tracked item:
        Increment its counter
    If x is not tracked and if fewer than k-1 items are tracked:
        Add x to the tracked items, with a count of 1
    Else: //we already have k-1 tracked items and x is not one of them
        Decrement the count of every tracked item
        Evict every tracked item that now has count 0
```

At the end of the stream, return all tracked items.

The space required is proportional to the product of $k$ and max{log(Universe size), log(Stream length)}. Since for each item we need to record the item itself(maximum is universe size) as well as its count(maximum is stream length).

We only need one pass through the data.