## Hashing Introduction

Suppose we want to design a system for storing employee records keyed using phone numbers. And we want following queries to be performed efficiently: 

1. Insert a phone number and corresponding information.
2. Search a phone number and fetch the information.
3. Delete a phone number and related information.

We can think of using the following data structures to maintain information about different phone numbers. 
 
1. Array of phone numbers and records.
2. Linked List of phone numbers and records.
3. Balanced binary search tree with phone numbers as keys.
4. Direct Access Table.

For **arrays and linked lists**, we need to search in a linear fashion, which can be costly in practice. If we use arrays and keep the data sorted, then a phone number can be searched in O(Logn) time using Binary Search, but insert and delete operations become costly as we have to maintain sorted order. 
 

 
With **balanced binary search tree**, we get moderate search, insert and delete times. All of these operations can be guaranteed to be in O(Logn) time. 

Another solution that one can think of is to use a direct access table where we make a big array and use phone numbers as index in the array. An entry in array is NIL if phone number is not present, else the array entry stores pointer to records corresponding to phone number. Time complexity wise this solution is the best among all, we can do all operations in O(1) time. For example to insert a phone number, we create a record with details of given phone number, use phone number as index and store the pointer to the created record in table. 
This solution has many practical limitations. First problem with this solution is extra space required is huge. For example if phone number is n digits, we need O(m * 10n) space for table where m is size of a pointer to record. Another problem is an integer in a programming language may not store n digits. 

Due to above limitations Direct Access Table cannot always be used. Hashing is the solution that can be used in almost all such situations and performs extremely well compared to above data structures like Array, Linked List, Balanced BST in practice. With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case. 

*Hashing is an improvement over Direct Access Table. The idea is to use hash function that converts a given phone number or any other key to a smaller number and uses the small number as index in a table called hash table.*

**Hash Function:** A function that converts a given big phone number to a small practical integer value. The mapped integer value is used as an index in hash table. In simple terms, a hash function maps a big number or string to a small integer that can be used as index in hash table. 
A good hash function should have following properties

1. Efficiently computable. 
2. Should uniformly distribute the keys (Each table position equally likely for each key) 

For example for phone numbers a bad hash function is to take first three digits. A better function is consider last three digits. Please note that this may not be the best hash function. There may be better ways. 

**Hash Table:** An array that stores pointers to records corresponding to a given phone number. An entry in hash table is NIL if no existing phone number has hash function value equal to the index for the entry. 

**Collision Handling:** Since a hash function gets us a small number for a big key, there is possibility that two keys result in same value. The situation where a newly inserted key maps to an already occupied slot in hash table is called collision and must be handled using some collision handling technique. Following are the ways to handle collisions: 
 

- Chaining: The idea is to make each cell of hash table point to a linked list of records that have same hash function value. Chaining is simple, but requires additional memory outside the table.
- Open Addressing: In open addressing, all elements are stored in the hash table itself. Each table entry contains either a record or NIL. When searching for an element, we examine the table slots one by one until the desired element is found or it is clear that the element is not in the table.

### What are Hash Functions and How to choose a good Hash Function?

#### What is a Hash Function? 

A function that converts a given big phone number to a small practical integer value. The mapped integer value is used as an index in the hash table. In simple terms, a hash function maps a big number or string to a small integer that can be used as the index in the hash table. 

#### What is meant by Good Hash Function? 

A good hash function should have the following properties: 

1. Efficiently computable.
2. Should uniformly distribute the keys (Each table position equally likely for each key)

**For example:** For phone numbers, a bad hash function is to take the first three digits. A better function is considered the last three digits. Please note that this may not be the best hash function. There may be better ways. 

In practice, we can often employ **heuristic techniques** to create a hash function that performs well. Qualitative information about the distribution of the keys may be useful in this design process. In general, a hash function should depend on every single bit of the key, so that two keys that differ in only one bit or one group of bits (regardless of whether the group is at the beginning, end, or middle of the key or present throughout the key) hash into different values. Thus, a hash function that simply extracts a portion of a key is not suitable. Similarly, if two keys are simply digited or character permutations of each other (such as 139 and 319), they should also hash into different values. 

The two heuristic methods are hashing by division and hashing by multiplication which are as follows: 

1. The mod method: 
    - In this method for creating hash functions, we map a key into one of the slots of table by taking the remainder of key divided by table_size. That is, the hash function is 

    ```
    h(key) = key mod table_size

    i.e. key % table_size
    ```

   - Since it requires only a single division operation, hashing by division is quite fast.
   - When using the division method, we usually avoid certain values of table_size like table_size should not be a power of a number suppose r, since if table_size = r^p, then h(key) is just the p lowest-order bits of key. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key.
    - It has been found that the best results with the division method are achieved when the table size is prime. However, even if table_size is prime, an additional restriction is called for. If r is the number of possible character codes on an computer, and if table_size is a prime such that r % table_size equal 1, then hash function h(key) = key % table_size is simply the sum of the binary representation of the characters in the key mod table_size.
    
    - Suppose r = 256 and table_size = 17, in which r % table_size i.e. 256 % 17 = 1.
    - So for key = 37599, its hash is 
 
    `37599 % 17 = 12`
    - But for key = 573, its hash function is also
    `573 % 17 = 12`
    - Hence it can be seen that by this hash function, many keys can have the same hash. This is called Collision.
    - A prime not too close to an exact power of 2 is often good choice for table_size.
2. The multiplication method: 
    - In multiplication method, we multiply the key k by a constant real number c in the range 0 < c < 1 and extract the fractional part of k * c.
    - Then we multiply this value by table_size m and take the floor of the result. It can be represented as
       ```
        h(k) = floor (m * (k * c mod 1))
                             or
        h(k) = floor (m * frac (k * c))
        ```
    - where the function floor(x), available in standard library math.h, yields the integer part of the real number x, and frac(x) yields the fractional part. [frac(x) = x – floor(x)] 
 
    - An advantage of the multiplication method is that the value of m is not critical, we typically choose it to be a power of 2 (m = 2p for some integer p), since we can then easily implement the function on most computers
    - Suppose that the word size of the machine is w bits and that key fits into a single word.
    - We restrict c to be a fraction of the form s / (2w), where s is an integer in the range 0 < s < 2w. 
 
    - Referring to figure, we first multiply key by the w-bit integer s = c * 2w. The result is a 2w-bit value
        ```
        r1 * 2w + r0

        where r1 = high-order word of the product
              r0 = lower order word of the product
        ```      

    - Although this method works with any value of the constant c, it works better with some values than the others.
    `c ~ (sqrt (5) – 1) / 2 = 0.618033988 . . .`
    - is likely to work reasonably well.
    -Suppose k = 123456, p = 14,
m = 2^14 = 16384, and w = 32.
Adapting Knuth’s suggestion, c to be fraction of the form s / 2^32.
Then key * s = 327706022297664 = (76300 * 2^32) + 17612864,
So r1 = 76300 and r0 = 176122864.
The 14 most significant bits of r0 yield the value h(key) = 67.

## Hash Map

Hash maps are indexed data structures. A hash map makes use of a hash function to compute an index with a key into an array of buckets or slots. Its value is mapped to the bucket with the corresponding index. The key is unique and immutable. Think of a hash map as a cabinet having drawers with labels for the things stored in them. For example, storing user information- consider email as the key, and we can map values corresponding to that user such as the first name, last name etc to a bucket.  

Hash function is the core of implementing a hash map. It takes in the key and translates it to the index of a bucket in the bucket list. Ideal hashing should produce a different index for each key. However, collisions can occur. When hashing gives an existing index, we can simply use a bucket for multiple values by appending a list or by rehashing.

In Python, dictionaries are examples of hash maps. We’ll see the implementation of hash map from scratch in order to learn how to build and customize such data structures for optimizing search.

The hash map design will include the following functions:

- set_val(key, value): Inserts a key-value pair into the hash map. If the value already exists in the hash map, update the value.
- get_val(key): Returns the value to which the specified key is mapped, or “No record found” if this map contains no mapping for the key.
- delete_val(key): Removes the mapping for the specific key if the hash map contains the mapping for the key.

Below is the implementation:


In [1]:
class HashTable:
  
    # Create empty bucket list of given size
    def __init__(self, size):
        self.size = size
        self.hash_table = self.create_buckets()
  
    def create_buckets(self):
        return [[] for _ in range(self.size)]
  
    # Insert values into hash map
    def set_val(self, key, val):
        
        # Get the index from the key
        # using hash function
        hashed_key = hash(key) % self.size
          
        # Get the bucket corresponding to index
        bucket = self.hash_table[hashed_key]
  
        found_key = False
        for index, record in enumerate(bucket):
            record_key, record_val = record
              
            # check if the bucket has same key as
            # the key to be inserted
            if record_key == key:
                found_key = True
                break
  
        # If the bucket has same key as the key to be inserted,
        # Update the key value
        # Otherwise append the new key-value pair to the bucket
        if found_key:
            bucket[index] = (key, val)
        else:
            bucket.append((key, val))
  
    # Return searched value with specific key
    def get_val(self, key):
        
        # Get the index from the key using
        # hash function
        hashed_key = hash(key) % self.size
          
        # Get the bucket corresponding to index
        bucket = self.hash_table[hashed_key]
  
        found_key = False
        for index, record in enumerate(bucket):
            record_key, record_val = record
              
            # check if the bucket has same key as 
            # the key being searched
            if record_key == key:
                found_key = True
                break
  
        # If the bucket has same key as the key being searched,
        # Return the value found
        # Otherwise indicate there was no record found
        if found_key:
            return record_val
        else:
            return "No record found"
  
    # Remove a value with specific key
    def delete_val(self, key):
        
        # Get the index from the key using
        # hash function
        hashed_key = hash(key) % self.size
          
        # Get the bucket corresponding to index
        bucket = self.hash_table[hashed_key]
  
        found_key = False
        for index, record in enumerate(bucket):
            record_key, record_val = record
              
            # check if the bucket has same key as
            # the key to be deleted
            if record_key == key:
                found_key = True
                break
        if found_key:
            bucket.pop(index)
        return
  
    # To print the items of hash map
    def __str__(self):
        return "".join(str(item) for item in self.hash_table)
  
  
hash_table = HashTable(50)
  
# insert some values
hash_table.set_val('gfg@example.com', 'some value')
print(hash_table)
print()
  
hash_table.set_val('portal@example.com', 'some other value')
print(hash_table)
print()
  
# search/access a record with key
print(hash_table.get_val('portal@example.com'))
print()
  
# delete or remove a value
hash_table.delete_val('portal@example.com')
print(hash_table)

[][][][][][][][][('gfg@example.com', 'some value')][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]

[][][][][][][][('portal@example.com', 'some other value')][('gfg@example.com', 'some value')][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]

some other value

[][][][][][][][][('gfg@example.com', 'some value')][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]


#### Time Complexity:

Memory index access takes constant time and hashing takes constant time. Hence, the search complexity of a hash map is also constant time, that is, **O(1).**