# Hash Tabels

## Array with custom index (Associative)

Let's remember how looks like usual arrays. It is a container, that contains some amount of elements where each element has an index - number, where we or array can find this element. \[5,4,6,7\] - element 6 has index 2 (as you remember indices are counted from 0). But what to do if we want to take element with index 'car' in array 'person'. I mean if we want just ask array person\['car'\] and take his car model. So, the interface for such an array is called the 'Associative array'.

![associative](images\associative.png)

First of all, it is a dynamic array with non constant memory. **You can add elements to it even if it is full.** Secondly, actualy it is not an array, it is container with pairs ("key","value"). In simple world key is index of array, value... is value). The same as for other containers there are several implementations of dynamic array. But most of them have the same interface:


 1. Construct - O(n).
 2. Add element ('key','value') - add to array element with index 'key' and value. There are no exact time limits for this operation, but in the most famous implementations it take O(1) or O(logN) in worst.
 3. Delete element ('key') - mark the memory with 'key' element as empty. The value will be deleted.
 4. Find of \[key\] - find element with 'key' index.

As I said before, there are several implementations of the Associative array. 

The first one is to create a data structure for such arrays, it works quite simple and can be simply described. But in some way it unmutable. Mean if you want to add value with the key that is not described in the structure - you should create a new structure for such data.

There is a simpler and more effective method - use an array that contains pairs (key, value) as values. With binary search you can find any element with difficulties O(logN).

The third one - use the search tree, that contains (key, value) inside every node. We will discuss it in the next lecture.

Finally, we can use hash tables. It literally gives the answer to queries like \[Object\], where the object can literally be any class. The only point is that the hash function must be able to work with these objects. Wait, hash function? Yep, here we come to the concept of a hash table. 

## Hash table

Ok, fomaly Hash table is another one container that was created as **array with fast (in average O(1)) access and composite indices** (the ability to use any class as an index). So, usually, an array uses numbers as indices not because we like numbers (we like numbers?) but because it is the simplest way to find elements in memory (just summ beginning of the array with the element number). So how can we find elements in the hash table with a complex key? The answer is hash function. And it is the main idea of hash table.

This function takes a key as an argument and returns a number - real index of the element in an array. Ok, now you can see the hash table is a usual array, where we do not take the indices directly, but calculate them from the keys using a hash function. Sounds good, but we need to find such a function. And more of it, this function should give us a unique index for every key. But what if such keys about million, billion or more, like key - word with ten or twenty symbols. There are about 23^20 variants. Quite a bit of. In addition to this, how do we make such a function truly unique if the keys have no digits and are generally quite complex?

Honestly, no way. Because of a really big amount of different types of keys instead of trying to do this we would modify our array and hash function to minimize such matches. Firstly, let's take some functions,  that turns our keys into numbers. For example for string keys, let's calculate the sum of symbols using an encoding table. For example word 'wow' in UTF8.

In [6]:
a = ord('w')
b = ord('o')

print(str(a) + " " + str(b) + " " + str(a+b+a))


119 111 349


In [7]:
# "wow" - 349 
# [0, ......349]

How we can see, each symbol in machine language is a number, and when we work with strings we actually work with numeric array. So, to understand what number each character is encoded in, you can look at the encoding table, or just ask the python).

![utf8](images/utf8.png)

 Above you can see some charecters of UTF8 table. Actually, it contains about 2^32 different symbols (at the end of the table their musical symbols and china hieroglyphs). So, now we can calculate some numbers for each word, but there are many words with an identical sum of characters. And what to do about it? Ok, firstly our array has a limited size and there is not a good idea if we ask it to resize just because we have two elements with key 1 and 47482587629842346987698. To solve it let's remember modulo division. By definition, taking a number a modulo b means that we take the remainder of dividing a by b. 
 
**For example** 15 dividing to 4 give us remainder 3 (because 4*3 + 3 = 15). Ok, so we have our array size and key-number that we got from hash function and we can just to calculate key modulo array size that brings us the actual index in the array. And if the array is not yet filled to the level when it is already required to increase its size we just put an element into index.

![hashBase](images/hashBase.gif)

But now we have the other problem - colisions. Some times hash function can give us equal numbers for diffrent strings. And more offten, different numbers can become equal if we take them modulo by size. There are two well-known ways to solve this problem.

## Separate chaining

First method is called separate chaining. The main idea of it that we transform our array of elements to array of cells. Each cell will now store a list. And when we get two elements with the same hash fuction, we just put them inside a list (to begining of list) with the index equal to hash function result. To get element, we take a first element of list from the cell with the index equal to hash function and go through the list to find the element with our key (original one, because hash function is the same for all of them).

![chaining](images/chaining.png)

Ok, it looks like a working solution. But let's look to asymptotics. If we have too many keys with the same hash function we get a big list at one of the cells. At the same time, other cells are still empty. It will be cost a lot of memory and a quiet time to search for elements. About `O(k)`, where k - the average length of the element. But we still want to keep it about O(1). One of the ideas is to recreate (resize) array when k becomes too big (for example K\*n is equal to N, where n - not empty cells, N - array size). The second idea is to use all the memory of the array to store the elements, and not occupy additional memory with lists. Both of these ideas bring us to the second method.

## Open addressing

Instead of creating of lists inside an array, we will resolve collisions by shifting over the array. For example, if we have elements with to equal hash function, we put the first one into in cell with an index equal to hash function and the next element into the next cell.

![openAdd1](images/openAdd1.png)


As we see at the example above, John and Sandra have the same hash functions, and because of it, we put Sandra into the next cell. This easy to implement method still has some problems when we try to add or find elements at the beginning in a big list of filled cells. If the array will be filled at 60% or 70%, the search can take about O(N) steps which are quite a lot.

How we can solve the problem above? - Let's try to move farther from collision element, not in 1 step, but q steps: $h1(s) = (h(s) + q*k) % size$, where k - steps before we don't find empty cell. With this, we are much less likely to form large groups of elements at some place. But now we have a problem, that we can jump into some steps and back to our element (because when we get the end of the array we teleport to the beginning with the next step and continue to search from the beginning of the array).




In [8]:
a = [1, 2, 3, 4, 5, 6, 7, 8]
print(len(a))
print()
i = 2
q = 4
k = 0
for k in range(len(a)):
    index = (i + q*k) % len(a)
    print(a[index])

8

3
7
3
7
3
7
3
7


In such way, we will definitely never find a place for an element with a hash function of '2'.Ok, then we should choose q in a way that helps us to scan all array cells. Let's remember prime numbers. By definition, they have no complex divisors except 1 and themselves. For example number 7. We can divide it only to 1 and 7. But we need not prime numbers, but numbers that are relative prime with array size. With them, we will move through the array in a new way every time we appear at the beginning. And it is quite simple to do, just take an odd q (because the size usually takes as powers of two). Let's try.

In [9]:
a = [1, 2, 3, 4, 5, 6, 7, 8]
print(len(a))
print()
i = 2
q = 3
k = 0
for k in range(len(a)):
    index = (i + q*k) % len(a)
    print(a[index])

8

3
6
1
4
7
2
5
8


That is better. Still there can be a situation when we can meet a big group of elements in small part of the table. It is almost impossible to finally defeat this, as with a good choice of a pivot in quick sort. But there is a lot of diffrent hash fuctions that are trying to reduce the likelihood of collisions. One of the most famous is `double hashing`. In this method we take not one, but two diffrent hash function. Really diffrent. For example for string first hash function calculate the summ of elements and the second one the middel element multiple by string length. And don't forget made it relative prime to table size -> h2(s) = (s\[middle\]*len(s))*2 + 1. The result hash function is h(s) = h1(s) + h2(s)*k, k - step.

In [10]:
a = ['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh', 'hk']
print(len(a))
print()
i = ord(a[1][0]) + ord (a[1][1])
k = 0
for k in range (len(a)):
    q = len(a[1])*ord(a[1][1])*2 + 1
    index = (i + q*k) % len(a)
    print(a[index])

8

fg
cd
hk
ef
bc
gh
de
ab


I almost forgot. There are basic hash tables here, in python. It is called dictionary. I am shure that you already work with them.

In [11]:
a = {'ab':1,
     'bc':2,
     'cd':3,
     'de':4}

print(a['bc'])

2
