### Hash Function
Hash function is any function which maps data of arbitrary size to a fixed-size value. Value returned by such function is called *hash value*, *hash* or *digest*. Hash value is commonly used in conjunction with *hash table*. A good hash function has the following properties:
- always returns same hash value for same input
- equal input will therefore have the same hash, unequal input on the other hand should have different hashes
- must be uniform, it must distribute hash over its range
- fixed size output from a hash function is desirable
- should be non-invertible, ie from a given hash one cannot determine the input used to generate the given hash

A sample hash function: In Java a string's hash is calculated in the following manner:

In [1]:
def hash_string(input):
    hash = 0; j = 1;
    for i in input:
        hash += ord(i)*(31**(len(input)-j))
        j += 1
    return hash

hash_string('ABC')

64578

### Hash Table
Hash Table is a data structure which maps keys to values. We use the keys to calculate hash and that hash acts as index where the value is stored. The image below represents a hash table:

![hash table](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg/315px-Hash_table_3_1_1_0_1_0_0_SP.svg.png)

### Collisions
It is possible that different inputs may have the same hash, for example,

In [2]:
print('Hash for Aa is ', hash_string('Aa'))
print('Hash for BB is ', hash_string('BB'))

Hash for Aa is  2112
Hash for BB is  2112


There are several methods to resolve collisions. In a typical hash table the index is calculated in the following two steps:
$$index = f(key, array\_size)$$
$$hash = hash\_func(key)$$
$$index = hash \% array\_size$$

The **load_factor** of a hashtable is $load\_factor = \frac{n}{k}$, where $n$ is the number of occupied entries and $k$ is the total number of buckets (array_size).

The following methods are used to reduce collision:
- **separate chaining:** in this case each bucket contains a linked list of all entries having the same hash. In this case the cost of lookup depends upon the average number of keys per bucket. The worst case in this scenario is when all the items are stored in the same bucket, this is equivalent to searching in a plain list. Other data structures (rather than linked list) can also be used, like self balancing BST.
![separate chaining](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Hash_table_5_0_1_1_1_1_1_LL.svg/450px-Hash_table_5_0_1_1_1_1_1_LL.svg.png)

- **open addressing:** in open addressing, in case of collision, we move to next bucket until an empty bucket is found. The drawback is that the maximum entries that can be stored is the number of buckets. The next bucket can be found in the following ways:
    - **linear probing:** in case of collision look at next bucket, then next bucket, then next until vacant bucket is found
    - **quadratic probing:**
    - **double hashing:**

### Chained Hash Table
![Chained Hash Table](https://i.imgur.com/XVb6Pnn.png)

In the above example, $n = 14$ is the current occupancy. Whereas $t = 16$ is the size of the array. The hash value of data item $x$, is $y = hash(x)$. The hash, $y \in \{0,1,..., t-1\}$.  
It is necessary to make sure the lists do not get long, so $n \le t$

```java
List<T>[] table;
int n;	// Number of items in the table

public boolean add(T x) {
    // Return false if the element already exists
    if(find(x))
        return false;

    // We follow the below rule so that
    // the lists do not become too long.
    // Or we can resize based on predecided
    // load factor.
    if(n+1 > table.length)
        resize(); // This method resizes the table array
                  // and reinserts all values

    table[hash(x)].add(x);
    n++;
    return true;
}

public T remove(T x) {
    // Iterator is used because we are modifying list
    // during iteration
    Iterator<T> iterator = table[hash(x)].iterator();
    while(iterator.hasNext()) {
        T temp = iterator.next();
        if(temp.equals(x)) {
            iterator.remove();
            n--;
            return temp;
        }
    }
    return null;
}

public boolean find(T x) {
    for(T t: table[hash(x)]) {
        if(t.equals(x)) {
            return true;
        }
    }
    return false;
}
```

### Linear Hash Table
In a linear hash table, we follow the following process:
1. Find if the position $i = hash(x)$ is vacant, if yes then we insert value at that index
2. If the previous position is already occupied, then try to store at $(i + 1)\ mod\ table.length$
3. If the index in previous step is also not available then go to $(i + 2)\ mod\ table.length$
4. Keep incrementing till a vacant position is found

Whenever we remove an item from hashtable, we replace it with dummy `del`. This del item indicates that the index was previously occupied. So in total, we store three different types of values
- data values: actual values in the USet that we are representing
- null values: at array locations where no data has ever been stored; and
- del values: at array locations where data was once stored but that has since been deleted.  

In a linear hash table we maintain that $table.length \ge 2q$, where $q$ is the number of data or del values.

```java
T[] table;
int n; // total number of filled spots
int q; // total number of filled or del spots
T del = (T) new Object();

public boolean find(T x) {
    int i = hash(x);
    int start = i;

    if(table[i].equals(x))
        return true;

    i = (i + 1)%(table.length);
    while(i != start){
        if(x != del && x.equals(table[i]))
            return true;
        else if(table[i] == null)
            break;
        else
            i = (i + 1)%(table.length);
    }

    return false;
}

public boolean add(T x) {
    if(find(x))
        return false;

    if(table.length < 2*(q+1))
        resize();

    int i = hash(x);
    while(table[i] != null || table[i] != del) {
        i = (i + 1)%table.length;
    }

    if(table[i] == null)
        q++;
    n++;

    table[i] = x;
    return true;
}

public T remove(T x) {
    int i = hash(x);
    int start = i;

    if(table[i].equals(x)) {
        table[i] = del;
        n--;
        if(8*n < table.length)
            resize();
        return x;
    }

    i = (i + 1)%(table.length);
    while(i != start){
        if(x != del && x.equals(table[i])) {
            table[i] = del;
            n--;
            if(8*n < table.length)
                resize();
            return x;
        } else if(table[i] == null)
            break;
        else
            i = (i + 1)%(table.length);
    }

    return null;
}
```

### Problems
**Q 1:** Given an array find if there exists a subarray such that the sum of elements of that subarray equals zero.  
**Answer:** A naive approach is to go through all the subarrays. But there is a better approach. Consider the array `2,4,-3,-1,5,-1`. If we generate iits prefix array, we get `2,6,3,2,7,6` We see that The sum rises from 2 and goes back to 2. This means that there exists a subarray having sum equal to zero. In general, if there are repeating values in prefix sum array, we can conclude that there exists a subarray having sum zero.  
There is a corner case. Consider the array `4,-3,-1,2,7`. Its prefix array is `4,1,0,2,9`. None of the elements of prefix array occur more than once, yet we have a subarray with sum equal to zero.

In [None]:
def exists_zero_sum_subarray(A):
    # Generate prefix array
    prefix = [0] * len(A)
    sum = 0
    for i in range(len(A)):
        sum += A[i]
        # If sum at any time is 0
        if sum == 0:
            return True
        prefix[i] = sum

    # Use a map/set to keep track of numbers
    num_set = {}
    for i in prefix:
        if i in num_set:
            return True
        else:
            num_set[i] = 0

    return False

**Q 2:** This question is just an extension of the above one. In this question we have to return the length of the longest subarray having sum equal to zero.  
**Answer:** To solve this problem, we will be storing the index of the first occurance in the map as well. So the answer is:

In [1]:
def longest_zero_sum_subarray(A):
    max_len = 0
    
    # Generate prefix array
    prefix = [0] * len(A)
    sum = 0
    for i in range(len(A)):
        sum += A[i]
        # If sum at any time is 0
        if sum == 0:
            cur_len = (i+1)
            if cur_len > max_len:
                max_len = cur_len
        prefix[i] = sum

    # Use a map to keep track of index
    index_map = {}
    for i in range(len(prefix)):
        if prefix[i] in index_map:
            if i - index_map[prefix[i]] > max_len:
                max_len = i - index_map[prefix[i]]
        else:
            index_map[prefix[i]] = i

    return max_len

A = [2,4,-3,-1,5,-1]
print(longest_zero_sum_subarray(A))

4


**Q 3:** This question is generalization of the above. Instead of sum being zero, find the length of longest subarray having sum $K$.  
**Answer:** In the above problem, we had equation like this: `prefix[j] - prefix[i] = 0`. Here we modify it to `prefix[j] - prefix[i] = K`. Or `prefix[j] - K = prefix[i]`.

In [1]:
def longest_K_sum_subarray(A, K):
    max_len = 0
    
    # Generate prefix array
    prefix = [0] * len(A)
    sum = 0
    for i in range(len(A)):
        sum += A[i]
        # If sum at any time is 0
        if sum == K:
            cur_len = (i+1)
            if cur_len > max_len:
                max_len = cur_len
        prefix[i] = sum

    # Use a map to keep track of index
    index_map = {}
    for i in range(len(prefix)):
        if prefix[i] - K in index_map:
            if i - index_map[prefix[i] - K] > max_len:
                max_len = i - index_map[prefix[i] - K]

        if prefix[i] not in index_map:
            index_map[prefix[i]] = i

    return max_len

A = [1,2,-3,3,-1,2,4]
K = 3
print(longest_K_sum_subarray(A,K))

K = 0
print(longest_K_sum_subarray(A,K))

A = [1,3,15,10,20,23,3]
K = 48
print(longest_K_sum_subarray(A,K))

5
3
4


In the above problem if the array contained only non-negative numbers then we could have solved this using two pointers (since the prefix sum array would be sorted in ascending order).

In [4]:
def longest_K_sum_subarray_two_pointers(A, K):
    max_len = 0
    
    # Generate prefix array
    prefix = [0] * len(A)
    sum = 0
    for i in range(len(A)):
        sum += A[i]
        prefix[i] = sum

    # Use two pointers to solve A[j] - A[i] = K
    i = 0
    j = 0
    while(j < len(prefix)):
        if prefix[j] - prefix[i] == K:
            if j - i > max_len:
                max_len = j - i
            j += 1
        elif prefix[j] - prefix[i] < K:
            j += 1
        else:
            i += 1

    return max_len

A = [1,3,15,10,20,23,3]
K = 48
print(longest_K_sum_subarray_two_pointers(A,K))

4


**Q 4:** Given an array, a special pair is a pair of numbers such that `A[i] == A[j]` and `|i - j|` is minimum. Return the index of special pair. For example in the array `2,4,6,2,3,12,3`. Both 2 and 3 are repeated but the distance between the 3s is minimum, so we return `(4,6)` as the answer.  
**Answer:**

In [5]:
def special_pair(A):
    index_map = {}
    
    min_len = len(A)
    pair = None
    
    for i in range(len(A)):
        if A[i] not in index_map:
            index_map[A[i]] = i
        else:
            cur_len = i - index_map[A[i]]
            if cur_len < min_len:
                min_len = cur_len
                pair = (index_map[A[i]], i)
            index_map[A[i]] = i
    
    return pair

A = [2,4,6,2,3,12,3]
print(special_pair(A))

(4, 6)


**Q 5:** Given an array A, for example `13,4,3,1,12,11,5,6,2`, return the length of largest subsequence consisting of consequtive elements. In this example the answer is 6 corresponding to the subsequence `1,2,3,4,5,6` .   
**Answer:** In the example above, we can see that there are two clusters of consequtive numbers: `1,2,3,4,5,6` and `11,12,13`. So if we identify the element which is the starting element of the cluster, then we can solve this problem. So once we identify a number A as the start, we can do A+1 and check if the number is present in the array or not, and so on.

In [7]:
def longest_consecutive(A):
    # Add all elements to a hashmap
    present = set()
    for i in A:
        present.add(i)

    # Iterate over A and check if an element is
    # left boundary.
    max_count = 0
    for i in range(len(A)):
        if (A[i] - 1) not in present: # A[i] is left boundary
            j = A[i]
            count = 0
            while j in present: # Check how many consequetive numbers are there
                count += 1
                j += 1
            if count > max_count:
                max_count = count

    return max_count

**Q 6:** Given an array of strings, find how many palindromes can be formed by concatenating two strings from the array at a time. For example, consider the array `['abcd', 'dcba', 'lls', 's', 'ssll']`. The palindromic pairs formed are: `abcddcba, slls, llsssll, dcbaabcd`   
**Answer:** Naive approach is two go through all the possible pairs and check if the concatenated string is palindrome or not. There is a better approach to solve the problem. The time complexity in this case is $O(nk^2)$, where $k$ is the average length of a string.

In [8]:
def palindromic_pairs(A):
    # Function to return if a string is a palindrome
    def is_palindrome(input):
        if input == '':
            return True
        
        i = 0
        j = len(input) - 1
        while(i <= j):
            if input[i] != input[j]:
                return False
            i += 1
            j -= 1

        return True

    # Form a reverse map, value is the index
    rev_map = {}
    for i in range(len(A)):
        rev_map[''.join(list(reversed(A[i])))] = i

    # Answer set
    answer = set()

    # Prefix + Rest + Reverse of prefix
    for i in range(len(A)):
        for j in range(1, len(A[i]) + 1):
            prefix = A[i][:j]
            rest = A[i][j:]
            if prefix in rev_map and i != rev_map[prefix]:
                if is_palindrome(rest):
                    answer.add(A[i] + (''.join(list(reversed(prefix)))))

    # Reverse of Postfix + Rest + Postfix
    for i in range(len(A)):
        for j in range(1, len(A[i]) + 1):
            postfix = A[i][-j:]
            rest = A[i][:-j]
            if postfix in rev_map and i != rev_map[postfix]:
                if is_palindrome(rest):
                    answer.add((''.join(list(reversed(postfix)))) + A[i])

    return len(answer)

### Rolling Hash
Consider a hash function similar to the one defined at the very begining of this page:

In [9]:
def hash(input):
    hash_ = 0
    prime = 31
    for i in range(len(input)):
        hash_ += ord(input[i])*(prime**i)
    return hash_

Now consider the string 'abcde'. If we know that $hash("abc") = x$, then we can easily calculate $hash("bcd")$ in one step. We can say that $hash("bcd") = y = \frac{(x - ord("a"))}{prime} + ord("d")*(prime^2)$. Similarly, now that we know $hash("bcd") = y$, we can easily calculate $hash("cde")$ in one step. $prime^2$ because the length of string is $3$.

Using the above concept, we can solve the problem of finding a pattern in a string. This algorithm is called *Rabin Karp* algorithm and its implementation is given below:

In [10]:
def find_in_string(source, target):
    # Return -1 if target is longer
    if len(target) > len(source):
        return -1

    # Return -1 if either empty
    if len(target) == 0 or len(source) == 0:
        return -1

    # Function to calculate hash from a string
    def get_hash(string):
        prime = 37
        i = 0
        hash = 0
        for t in string:
            hash += ord(t)*(prime**i)
            i += 1

        return hash

    # Calculate hash of target
    hash_target = get_hash(target)

    prime = 37
    start = -1
    # We employ rolling hash
    for i in range(len(source) - len(target) + 1):
        # For the first set of t characters, we calculate the
        # hash using the earlier defined function
        if i == 0:
            hash_current = get_hash(source[i:i+len(target)])
        # For other set of characters, we calculate new hash using
        # previous hash
        else:
            hash_current = (hash_current - ord(source[i-1]))//prime + ord(source[i+len(target)-1])*(prime**(len(target)-1))

        # Even if hashes are same we need to compare source and
        # target because of hash collision
        if hash_current == hash_target:
            j = 0
            while(j < len(target)):
                if target[j] != source[i+j]:
                    return -1
                j += 1

            start = i
            break

    return start

**Q 7:** Given an array containing all natural numbers till N. Find the largest number possible after K swaps.  
**Answer:** The approach is to pick the next largest number in each iteration and swap it to its required place. So in the first iteration, we pick the largest number and put it at index 0. If a number is already present at its required position, go to the next largest number without increasing the swap count.

In [11]:
def k_swaps(A, K):
    # Generate index map
    index_map = {}
    for i in range(len(A)):
        index_map[A[i]] = i

    # Find the largest number, that is N
    # since A has natural numbers till N
    N = len(A)

    i = 0
    swaps = 0
    while(swaps < K):
        # Each iteration we pick the next largest number
        # and its corresponding index
        from_ = index_map[N - i]
        
        # If the next largest number is already at its
        # required position, skip to next largest number.
        # No swapping required
        if A[i] == N - i:
            i += 1
            continue

        # We have to swap the current largest number to
        # this index
        to_ = i

        # Swap
        A[from_], A[to_] = A[to_], A[from_]

        # Update index_map after swapping
        index_map[A[to_]] = to_
        index_map[A[from_]] = from_

        i += 1
        swaps += 1

    return A