# Lecture 4 Hash Functions
__Math 3280: Data Mining__

__Outline__
1. Basics of Hash Functions
2. When two entries get the same index
3. Tips on Hash Functions
4. Hash Functions with Text
5. Intro to MapReduce and Supercomputers

__Reading__ 
* Leskovec, Section 1.3.2

## Basics of Hash Functions

When we need to search for a particular value, we could simply go through all the values until we find the one we want. This is called a __linear search__. For small datasets, this works just fine. But for large datasets, this is inefficient.

A __hash function__ takes some key value related to the data and produces a __bucket number__, or a __hash-key__. That is, we take something intuitive about the data (ID, name, timestamp,...) and do some calculation on it to determine what bucket, or place in our array, the data should be stored. Then when we want to recall that data, we do the same calculation, and we know exactly where that data is stored.

The following three examples demonstrate how one common hash function works, and presents a potential issue.

*Example 1*:
> You have data for 10 patients that you want to store in the database.
> * Their IDs are:
>   * [100, 186, 152, 199, 103, 127, 175, 131, 114, 148]
> * To determine the bucket to store the data in (the hash-key), take the modulus of each ID with the number of elements (10)
> $$f_h(x) = x \% n$$
>   * [0, 6, 2, 9, 3, 7, 5, 1, 4, 8]
> * Store the data:
>   * `ID = [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]`
>   * `ID = [100, ___, ___, ___, ___, ___, ___, ___, ___, ___]`
>   * `ID = [100, ___, ___, ___, ___, ___, 186, ___, ___, ___]`
>   * `ID = [100, ___, 152, ___, ___, ___, 186, ___, ___, ___]`
>   * `ID = [100, ___, 152, ___, ___, ___, 186, ___, ___, 199]`
>   * `ID = [100, ___, 152, 103, ___, ___, 186, ___, ___, 199]`
>   * `ID = [100, ___, 152, 103, ___, ___, 186, 127, ___, 199]`
>   * `ID = [100, ___, 152, 103, ___, 175, 186, 127, ___, 199]`
>   * `ID = [100, 131, 152, 103, ___, 175, 186, 127, ___, 199]`
>   * `ID = [100, 131, 152, 103, 114, 175, 186, 127, ___, 199]`
>   * `ID = [100, 131, 152, 103, 114, 175, 186, 127, 148, 199]`
> * If you want patient 186, take the modulus $f_h(186) = 186 \% 10 = 6$. The data is in bucket 6 for all lists.
>   * `ID[6] = 186`, `name[6]`, `weight[6]`, ...

## When two entries get the same index

Sometimes, our hash function will cause two or more entries to receive the same hash-key. For example, $f_h(114) = 114\%10 = 4$ and $f_h(124) = 124\%10 = 4$. 

We start by going to the bucket indicated by our hash-key, just as before. If that bucket is already filled, move to the next index. Sometimes, you may have to advance multiple indices before finding an empty bucket.

When we recall the information, then the index from our hash-key becomes a starting point for a linear search. If the correct entry is in the bucket from our calculation, then no search is required. If the correct entry is not in the bucket from our calculation, then we look at the next bucket, then the next, and so on until we find the right information.

*Example 2*:

This example is the same as example 1, but notice that some of the calculations repeat bucket numbers:
> You have data for 10 patients that you want to store in the database.
> * Their IDs are:
>   * [245, 287, 261, 295, 233, 209, 276, 284, 260, 221]
> * To determine the bucket to store the data in, take the modulus of each ID with the number of elements (10)
>   * [5, 7, 1, 5, 6, 3, 6, 1, 4, 8]
> * Store the data:
>   * `ID = [___, ___, ___, ___, ___, 245, ___, ___, ___, ___]`
>   * `ID = [___, ___, ___, ___, ___, 245, ___, 287, ___, ___]`
>   * `ID = [___, 261, ___, ___, ___, 245, ___, 287, ___, ___]`
> * The next is 295 going into bucket 5. But bucket 5 is already filled. So, fill the next bucket.
>   * `ID = [___, 261, ___, ___, ___, 245, 295, 287, ___, ___]`
>   * `ID = [___, 261, ___, 233, ___, 245, 295, 287, ___, ___]`
>   * `ID = [___, 261, ___, 233, ___, 245, 295, 287, ___, 209]`
> * The next is 276 going into bucket 6. But bucket 6 is already filled. So, go to the next bucket, but that is also filled. Just keep going and fill the next available bucket.
>   * `ID = [___, 261, ___, 233, ___, 245, 295, 287, 276, 209]`
>   * `ID = [___, 261, ___, 233, 284, 245, 295, 287, 276, 209]`
>   * `ID = [260, 261, ___, 233, 284, 245, 295, 287, 276, 209]`
> * The next is 221 going into bucket 1. But bucket 1 is already filled. So, fill the next bucket.
>   * `ID = [260, 261, 221, 233, 284, 245, 295, 287, 276, 209]`
> * If you want patient 233, take the modulus $233 mod 10 = 3$. The data is in bucket 3 for all lists.
>   * `ID[3] = 233`, `name[3]`, `weight[3]`, ...
> * If you want patient 276, take the modulus $276 mod 10 = 6$. But this time, the data isn't in bucket 6. Go to bucket 6 and start a linear search from there.
>   * `ID[6] = 295`
>   * `ID[7] = 287`
>   * `ID[8] = 276` is a match!
>   * The data is in bucket 8 for all lists.
>   * `ID[8] = 276`, `name[8]`, `weight[8]`, ...


## Tips on Hash Functions

Because of the possibility of overlapping data from our hash function, there are a few tips that help to reduce this possibility.
1. Make the array larger than it needs to be
    * If the array is larger, than that gives more possible results, reducing the chance for repeated hash-keys
    * If there are repeated hash-keys from $f_h(x)$, then there is more likely space close to the result, reducing the length of the linear search if it's needed
2. Make the array size ($n$) a prime number
    * If there is a prime number of bins, then the chance of repeated results decreases
    * Choosing ($n$) such that it has common factors with most hash-keys, then the possible hash-keys result in nonrandom distribution into buckets - so a prime number of buckets is preferred

    > Suppose your population is only contained of even numbers. If $n=10$, then the only buckets that can be filled normally are $0, 2, 4, 6,$ and $8$. However, if we choose $n=11$, then the even integers create an equal 1/11 probability for each bucket.

    * Be sure to consider the case when the prime number $n$ is a factor in most values of your population. If this is the case, just choose a different prime number.

## Hash Functions with Text

When we are dealing with text, we have to find a way to convert text into numberical values. A simple example would be to convert each letter in the text into its appropriate ASCII code.

*Example 3*:

In this example, we use names instead of IDs.
> You have data for 5 patients that you want to store in the database.
> * Their names are:
>   * [Jon, Sue, Sam, Dan, Ted]
> * Create a numberical value by adding the ASCII codes for each character in the name. Then take the modulus of that result with the number of patients (5).
>   * 'Jon' = 74 + 111 + 110 = 295 --> 295 mod 5 = 0
>   * 'Sue' = 83 + 117 + 101 = 301 --> 301 mod 5 = 1
>   * 'Sam' = 83 +  97 + 109 = 289 --> 289 mod 5 = 4
>   * 'Dan' = 68 +  97 + 110 = 275 --> 275 mod 5 = 0
>   * 'Ted' = 84 + 101 + 100 = 285 --> 285 mod 5 = 0
> * Store the data:
>   * `ID = [Jon, ___, ___, ___, ___]`
>   * `ID = [Jon, Sue, ___, ___, ___]`
>   * `ID = [Jon, Sue, ___, ___, Sam]`
>   * `ID = [Jon, Sue, Dan, ___, Sam]`
>   * `ID = [Jon, Sue, Dan, Ted, Sam]`

Notice how in this last example, finding the record for Ted is almost as many tests as just doing a linear search. Certain datapoints could have that issue. But for the most part, this is a very straightforward hash function that simplifies the search process. On the whole, the number of calculations needed to find a name has dropped drastically.

## Intro to MapReduce and Supercomputers

In this chapter, we look at the computer requirements when dealing with big data.

When dealing with large computations, such as large-scale models, one computer will not be enough.
> As an undergraduate, I created a model of air pollution in North Salt Lake. It was a 24-hour model covering a 100-km^2 area. It took well over 1 hour on my computer to get the results. Imagine how much more time it would have taken as a 3-Dimensional model covering the entire planet... By the time my computer finished a forecast model, the event being forecasted would have happened weeks ago.

To handle large computers, we utilize __parallel processing__, where several processors are linked together and work on parts of the problem simultaneously. This helps the calculations to complete in far less time.
* Each processor is called a __node__.
* The collection of nodes is called a __supercomputer__.

In data science, however, we are not only dealing with large computations, but with large amounts of data as well. 
* For example, large-scale Web services, such as Google or Amazon, are continually dealing with large amounts of data and customer interactions.

To handle this, we use not only the processors on each node, but the storage space as well. 
* These systems are known as __computing clusters__.
* The software to manage the data and queries is a __distributed file system__.

### How to cool a Supercomputer or Computing Cluster
We have a large issue with computers burning out do to excessive heat. With that many computers running all at once, the temperature in the room rises rapidly.
* Most computing clusters are stored in independent rooms with extremely powerful air conditioning units
  * If the AC goes out, the computer must be shut down immediately
* Some large companies have developed other methods
  * Immersion Cooling - Using fluids instead of fans and air as a coolant
    * https://submer.com/blog/what-is-immersion-cooling/
    * https://en.wikipedia.org/wiki/Aquasar
    * https://www.datacenterknowledge.com/sustainability/submerged-supercomputer-named-world-s-most-efficient-system-in-green-500