# Hashing and Symbol Tables
### Hashing
**Hashing** is a concept in which, when we give data of arbitrary size to function, we get a small simplified value. This function is called a **hash function**. Hashing uses a hash function that maps the given data to another range of data, so that a new range of data can be used as an index in the hash table. We will use hashing to convert strings to integers. We are using strings to convert into integers, however, it can be any other data type which can be converted into integers. Let's look at an exapmple to better understand the concept. We want to hash the exxpression `hello world`, that is, we want to get a numeric value that we could say *represents* the string.

We obtain the unique ordinal value of any character by using the `ord()` function. For example, the `ord('f')` function gives 102. Further, to get the hash of the whole string, we could just sum the ordinal numbers of each character in the string.

In [1]:
sum(map(ord, "hello world"))

1116

The obtained numberic value, 1116, for the whole `hello world` string is called the **hash of the string**. COnsider the following diagram to see the ordinal value of each character in the string that results in the hash value `1116`.

| Character | Value |
| :-------- | :---- |
| `h` | 104 |
| `e` | 101 |
| `l` | 108 |
| `l` | 108 |
| `o` | 111 |
| ` ` | 32 |
| `w` | 119 |
| `o` | 111 |
| `r` | 114 |
| `l` | 108 |
| `d` | 100 |
| sum: | 1116 |

Changing the order of the phrase would yield the same hash value, along with replacing the `h` and `w` with a `g` and an `x`, respectively.
### Perfect hashing function
A **perfect hashing function** is the one by which we get a unique hash value for a given string. In practice, most of the hashing functions are imperfect and face collisions. This means that a hash function gives the same hash value to more than one string; i.e. it is not a one-to-one function between strings and hash values. Normally, hashing functions need to be very fast, so trying to create a function that gives us a unique hash value for each string is normally not possible. Hence, we accept this fact and we know that we may get some collisions, that is, two or more strings may have the same hash value. Therefore, we try to find a strategy to resolve the collisions rather than trying to find a perfect hash function.

To avoid the collisions of the previous example, we could add a multiplier that the ordinal value of each character is multiplied by a value that continuously increasesin the string as we progress in the string. Next, the hash value of the string is obtained by adding the multiplied ordinal value of each character.

The implementation of this concept is shown here:

In [2]:
def myhash(s):
    mult = 1
    hv = 0
    for ch in s:
        hv += mult * ord(ch)
        mult += 1
    return hv

for item in {'hello world', 'world hello', 'gello xorld'}:
    print("{}: {}".format(item, myhash(item)))

world hello: 6616
gello xorld: 6742
hello world: 6736


In [3]:
# Still is not a perfect hash:
strs = {'ad', 'ga'}

for item in strs:
    print("{}: {}".format(item, myhash(item)))

ga: 297
ad: 297


### Hash tables
A **hash table** is a data structure where elements are accessed by a keyword rather than an index number, unlike in **lists** and **arrays**. In this data structure, the data items are stored in key/value pairs similar to dictionaries. A hash table uses a hashing function in order to find an index position where an element should be stored and retrieved. This gives us fast lookups since we are using an index number that corresponds to the hash value of the key.

Each position in the hash table data structure is often called a **slot** or **bucket** and can store an element. So, each data item in the form of `(key, value)` pairs would be stored in the hash table at a position that is decided by the hash value of the data. 

To implement the hash table, we start by creating a class to hold hash table items. These need to have a key and a value since our hash table is a `{key: value}` store:

```python
class HashItem:
    def __init__(self, key, value):
        self.key = key
        self.value = value
```

This gives us a very simple way to store items. Next, we start working on the hash table class itself. As usual, we start with a constructor:

```python
class HashTable:
    def __init__(self):
        self.size = 256
        self.slots = [None for i in range(self.size)]
        self.count = 0
```

The hash table uses a standard Python list to store its elements. Let's set the size of the hash table to 256 elements to start with. Later, we will look  at strategies for how to grow the hash table as we begin filling it up. We will now initialize a list containing 256 elements in the code. These are positions where the elements are to be stored-the slots or buckets. So, we have 256 slots to store elements in the hash table. Finally, we add a counter for the number of the acutal hash table elements we have:

It is important to note the difference between the size and count of a table. The size of a table refers to the number of slots in the table. he count of the table refers to the number of slots that are filled, meaning the number of actual `{key: value}` pairs that have been added to the table.

Now, we have to decide on adding our hashing function to the table. We can use the same hash function that returns the sum of ordinal values for each character in the strings with a slight change. Since our hash table has 256 slots, that means we need a hashing function that returns a value in the range of 1 to 256. A good way of doing it is to return the remainder of dividing the hash value by the size of the table since the remainder would surely be an integer value between 0 and 255.

As the hashing function is only meant to be used internally by the class, we put an underscore at the beginning of the name to indicate this. This is a normal Python convention for indicating that something is meant for internal use. Here is the implementation of the `hash` function:
```python
    def _hash(self, key):
        mult = 1
        hv = 0
        for ch in key:
            hv += mult * ord(ch)
            mult += 1
        return hv % self.size
```
For the time being, we are going to assume that keys are strings. We shall discuss how one can use non-string keys later. For now, the `_hash()` function is going to generate the hash value for a string.
### Storing elements in a hash table
To store the elements in the hash table, we add them to the table with the `put()` function and retrieve them with the `get()` function. First, we will look at the implementation of the `put()` function. We start by embedding the key and the value into the `HashItem` class and then compute the hash value of the key.

Here is the implementation of the `put` function:
```python
    def put(self, key, value):
        item = HashItem(key, value)
        h = self._has(key)
```
Once we know the hash value of the key, it will be used to find the position where the element should be stored in the hash table. Hence, we need to find an empty slot. We start at the slot that corresponds to the hash value of the key. If that slot is empty, we insert our item there.

However, if the slot is not empty and the key of the item is not the same as our current key, then we have a collision. It means that we have a hash value for the item that is the same as some previously stored item in the table. This is where we need to figure out a way to handle a conflict.

One way of resolving this kind of collision is to find another free slot from the position of the collision; this collision resolution process is called **open addressing**. We can do this by linearly looking for the next available slot by adding one to the previous hash value where we get the collision. We can resolve this conflict by adding one to the sum of the ordinal values of each character in the key string, which is further divided by the size of the hash table to obtain the hash value. This systematic way of visiting each slot is a linear way of resolving collisions and is called **linear probing**.

Let's consider the following code:
```python
    while self.slots[h] is not None:
        if self.slots[h].key is key:
            break
        h = (h + 1) % self.size
```
The preceding code is to check whether the slot is empty, then get the new hash value using the method described. If the slot is empty, to store the new element, then we increase the count by one. Finally, we insert the item into the list at the required position:
```python
    if self.slots[h] is None:
        self.count += 1
    self.slots[h] = item
```
### Retrieving elements from the hash table
To retrieve the elements from the hash table, the value stored corresponding to the key would be returned. Here, we will discuss the implementation of the retrieval method-the `get()` method. This method would return the value stored in the table corresponding to the given key.

First of all, we compute the hash of the given key corresponding to the value that is to be retrieved. Once we have the hash value of the key, we look up the hash table at the position of the hash value. If the key item is matched with the stored key value at that location, the corresponding `value` is retrieved. If that does not match, then we add 1 to the sum of the ordinal values of all the characters in the string, similar to what we did at the time of storing the data, and we look at the newly obtained hash value. We keep looking until we get our key element or we check all the slots in the hash table.

To implement this retrieval method that is, the `get()` method, we start by calculating the hash of the key. Next, we look up at the computed hash value in the table. If there is a match, we return the corresponding stored value. Otherwise, we keep looking at the new hash value location computed as described. Here is the implementation of the `get()` method:
```python
    def get(self, key):
        h = self._hash(key)     # computer hash for the given key
        while self.slots[h] is not None:
            if self.slots[h].key is key:
                return self.slots[h].value
            h = (h + 1) % self.size
        return None
```
Finally, we return `None` if the key was not found in the table. Another good alternative may be to raise an exception in the case the key does not exist in the table.
### Testing the hash table
To test our hash table, we create `HashTable` and store a few elements in it, then try to retrieve them. We will also try to `get()` a key that does not exist. We also sue the two strings, `ad` and `ga`, which had the collision and returned the same hash value by our hashing function. To properly evaluate the work of the hash table, we throw this collision as well, just to see that the collision is properly resolved.

In [1]:
class HashItem:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        
    
class HashTable:
    def __init__(self):
        self.size = 256
        self.slots = [None for i in range(self.size)]
        self.count = 0
        
    def _hash(self, key):
        mult = 1
        hv = 0
        for ch in key:
            hv += mult * ord(ch)
            mult += 1
        return hv % self.size
    
    def put(self, key, value):
        item = HashItem(key, value)
        h = self._hash(key)
        
        while self.slots[h] is not None:
            if self.slots[h].key is key:
                break
            h = (h + 1) % self.size
        if self.slots[h] is None:
            self.count += 1
        self.slots[h] = item
    
    def get(self, key):
        h = self._hash(key)
        while self.slots[h] is not None:
            if self.slots[h].key is key:
                return self.slots[h].value
            h = (h + 1) % self.size
        return None
    
    def __setitem__(self, key, value):
        self.put(key, value)
    
    def __getitem(self, key):
        return self.get(key)

In [3]:
ht = HashTable()
ht.put("good", "eggs")
ht.put("better", "ham")
ht.put("best", "spam")
ht.put("ad", "do not")
ht.put("ga", "collide")

for key in ("good", "better", "best", "worst", "ad", "ga"):
    v = ht.get(key)
    print(v)

eggs
ham
spam
None
do not
collide


### Using `[]` with the hash table
Using the `put()` and `get()` methods doesn't look very convenient to use. However, we would have preferred to be able to treat our hash table as a list, as it would be easier to use. For example, we would like to be able to use `ht("good")` instead of `ht.get("good")` for the retrieval of elements from the table.

This can be done easilty with the special methods, `__setitem__()` and `__getitem__()`. See the code implemented above:
```python
    def __setitem__(self, key, value):
        return self.put(key, value)
    
    def __getitem__(self, key):
        return self.get(key)
```
Now we test code would be like the following:

In [4]:
ht = HashTable()
ht['good'] = 'eggs'
ht['better'] = 'ham'
ht['best'] = 'spam'
ht['ad'] = 'do not'
ht['ga'] = 'collide'

for key in ("good", "better", "best", "worst", "ad", "ga"):
    v = ht.get(key)
    print(v)
    
print("The number of elements is: {}".format(ht.count))

eggs
ham
spam
None
do not
collide
The number of elements is: 5


### Non-string keys
In most cases in real-time applications, generally, we need to use strings for the keys. However, if necessary, you could use any other Python types. If you create your own class that you want to use as a key, you will need to override the special `__hash__()` function for that class, so that you get reliable hash values.

Note that you would still have to calculate the modulo of the hash value and the size of the hash table to get the slot. That calculation should happen in the hash table and not in the key classsince the table knows its own size.
### Growing a hash table
In our example, we fixed the hash table size to 256. It is obvious that, when we add the elements to the hash table, we would begin to fill up the empty slots, and at some point, all of the slots would be filled up and the hash table will be full. To avoid such a situation, we can grow the size of the table when it is starting to get full.

To grow the size of the hash table, we compare the size and the count in the table. `size` is the total number of the slots and `count` denotes the number of slots that contain elements. So, if `count` is equal to `size`, that means we have filled up the table. The load factor of the hash table is generally used to expand the size of the table, that gives us an indication of how many available slots of the table have been used. The load factor of the hash table is computed by dividing the number of **used** slots by the **total** number of slots in the table:

$load \ factor = \frac{n}{k}$, where $n$ is the number of used slots and $k$ is the total number of slots.

As the load factor value approaches 1, it means that the table is going to be filled, and we need to grow the size of the table. It is better to grow the size of the table before it gets almost full, as the retrieval of elements from the table becomes slow when the table fills up. A value of 0.75 for a load factor may be a good value to grow the size of the table.

The next question is how much we should increase the size of the table. One strategy could be to double the size of the table.
### Open addressing
The collision resolution mechanism we used in our example was linear probing, which is an example of an *open addressing* strategy. Linear probing is simple since we used a fixed number of slots. There are other open addressing strategies as well, however, they all share the idea that there is an array of slots. When we want to insert a key, we check whether the slot already has an item or not. If it does, we look for the next available slot.

If we have a hash table that contains 256 slots, then 256 is the maximum number of elements in that hash. Moreover, as the load factor increases, it will take longer to find the insertion point for the new element. Because of this, we need a new strategy to resolve collisions, *chaining*.
### Chaining
Chaining is another method to handle the problem of collision in hash tables. It solves this problem by allowing each slot in the hash table to store a reference to many items at the position of a collision. So, at the index of a collision, we are allowed to store many items in the hash table.

When an element is inserted, it will be appended to the list that corresponds to that element's hash value. That is, if you have two elements that both have a hash value of 1075, both of these elements would be added to the list that exists in the `1075 % 256 = 51` slot of the hash table. Chaining then avoids conflict by allowing multiple elements to have the same hash value. Hence, there is no limit on the number of elements that can be stored in a hash table, whereas, in the case of linear probing, we had to fix the size of the table, which we need to later grow when the table is filled up, depending upon the load factor. Moreover, the hash table can hold more values than the number of available slots, since each slot holds a list that can grow.

However, there is a problem in chaining-it becomes inefficient when a list grows at a particular hash value location. As a particular slot has many items, searching them can get very slow since we have to do a linear search through the list until we find the element that has the key we want. This can slow down retrieval, which is not good, since hash tables are meant to tbe efficient. The following diagram demonstrates a linear search through list items until we find a match.

So, there is a problem of slow retrieval of items when a particular position in a hash table has many entries. This problem can be resolved using another data structure in place of using a list that can perform fast searching and retrieval. There is a nice choice of using **BSTs**, which provide fast retrieval, as we discussed in the previous chapter. The search tree could be inefficent, so we need to ensure that our BST is self-balancing.
### Symbol tables
Symbol tables are used by compilers and interpreters to keep track of the symbols that have been declared and to keep information about them. Symbol tables are often built using hash tables since it is important to efficiently retrieve a symbol from the tablde.

Here is an example:
```python
name = "Joe"
age = 27
```
HEre, we have two symbols, `name` and `age`. They belong to namespace, which  could be `__main__`, but it could also be the name of a module if you placed it there. Each symbol has a `value`; for example, the `name` symbol has the value, `Joe`, and the `age` symbol has the value, `27`. A symbol table allows the compiler or the interpreter to look up these values. So, the `name` and `age` symbols become keys in the hash table. All of the other information associated with them become the `value` of the symbol table entry.

It's not only variables that are symbols, but functions and classes are also treated as symbols, and they will also be added to the symbol table so that, when any one of them needs to be accessed, they are accessible from the symbol table. For exapmle, the `greet()` function and two variables are stored in the symbol table in the diagram on page 199.

In Python, each module that is loaded has its own symbol table. The symbol table is given the name of that module. This way, modules act as namespaces. We can have multiple symbols of the same name as long as they exist in different symbol tables, and we can access them through the appropriate symbol table.