# Elements of Programming Interview

## Hash Tables

A hash table is a data structure used to store keys, optionally, with corresponding values. Inserts, deletes and lookups tun in $O(1)$ time on average.

**Underlying idea**: 

* Store the keys in an array.
    * The location of the key in the array is based on its "hash code".
    * The **hash code** is an **integer** computed from the key by a hash function.
* If two keys map to the same location, a "collision" is said to occur.
    * To deal with collisions, one maintains a linked list of objects at each array location.
* If the hash function does a good job of spreading objects across the underlying array and take $O(1)$ time to compute
    * on average, **lookups**, **insertions**, and **deletions** have $O(1+n/m)$ time complexity
        * n = number of objects
        * m = length of the array
    * When $n/m$ grows large, rehashing can be applied (larger number of locations is allocated, and the objects are moved to the new array) which takes $O(n+m)$ time. 
        * If rehashing is done infrequently, its amortized cost is low.
        
### The Hard Requirement of a Hash Function

Equal keys should have equal hash codes. It is easy to get wrong, e.g., by writing a hash function that is based on address rather than contents, or by including profiling data.

### Softer Requirement of a Hash Function

Hash function should spread keys, i.e., the hash codes for a subset of objects should be uniformly distributed across the underlying array. In addition, a hash function should be efficient to compute.

### Designing a Hash Function Suitable for Strings

* The hash function should examine all the characters in the string.
* It should give a large range of values
* It should not let one character dominate (e.g., if we simply cast characters to integers and multiplied them, a single $0$ would result in a hash code of $0$).
* We would also like a rolling hash function, one in which if a character is deleted from the front of the string, and another added to the end, the new hash code can be computed in $O(1)$ time.

The following function has these properties:


In [1]:
import functools

def string_hash(s, modulus):
    MULT = 997
    return functools.reduce(lambda v, c: (v * MULT) + ord(c) % modulus, s, 0)

string_hash('araks', 10)

6920348317506

## Hash Table Boot Camp

### An Application of Hash Tables

**Problem**: Suppose you were asked to write a program that takes as input a set of words and returns groups of anagrams for those words. Wach group must contain at least two words.

**Solution**: Given any string, its sorted version can be used as a unique identifier for the anagram group it belongs to. Our algorithm proceeds by adding sort(*s*) for each string *s* to a hash table. 
* Anytime you need to store a set of strings, a hash table is an excellent choice.

In [2]:
import collections

def find_anagram(dictionary_of_words):
    sorted_string_to_anagram = collections.defaultdict(list)
    for s in dictionary_of_words:
        sorted_string_to_anagram[''.join(sorted(s))].append(s)
        
    return [group for group in sorted_string_to_anagram.values() if len(group) >= 2]

find_anagram(['listen', 'levis', 'debitcard', 'badcredit', 'elvis', 'silent', 'money', 'lives'])

[['listen', 'silent'], ['levis', 'elvis', 'lives'], ['debitcard', 'badcredit']]

**Time complxity:** $O(nm\log{m})$, where $n$ is the number of strings and $m$ is the maximum string length. Explanation: 
* There are $n$ calls to sort -> Sorting all keys has time complexity $O(nm\log{m})$
* There are $n$ insertions into tha hash table -> Insertions add a time complexity of $O(nm)$

### Design of a Hashable Class

Consider a class that represents contacts. 
* Assume each contact is a string. 
* Suppose that individual contacts are to be stored in a list and it's possible that the list contains duplicates. 
* Multiplicity is not important, i.e., three repetitions of the same contact is the same as a single instance of that contact.

Two contacts should be equal if they contain the same set of strings, regardless of the ordering of the strings within the underlying list.  

In order to be able to store contacts in a hash table, we first need to explicitly define equality, which we can do by forming sets from the lists and comparing the sets.  

**Note**: The hash function and equals methods below are very inefficient. In practice, it would be advisable to cache the underlying set and the hash code, remembering to void these values on updates.

In [3]:
class ContactList:
    def __init__(self, names):
        '''
        names is a list of strings
        '''
        self.names = names
    
    def __hash__(self):
        '''
        Conceptually we want to hash the set of names.
        Since the set type is mutable, it cannot be hashed. Therefore we use frozenset.
        '''
        return hash(frozenset(self.names))
    
    def __eq__(self, other):
        return set(self.names) == set(other.names)
    
def merge_contact_lists(contacts):
    '''
    contacts is a list of ContactList.
    '''
    return list(set(contacts))

contact_list_1 = ContactList(['john', 'james'])
contact_list_2 = ContactList(['ann', 'david', 'ann', 'mike'])
contact_list_3 = ContactList(['james', 'john', 'james'])
merged_contact_lists = merge_contact_lists([contact_list_1, contact_list_2, contact_list_3])

# Create a dictionary the keys of which are the ContactList objetcs and the values are the length of the contacts lists.
contacts_dict = collections.defaultdict(int)
for item in merged_contact_lists:
    contacts_dict[item] = len(item.names)

for key in contacts_dict:
    print(key, contacts_dict[key])

<__main__.ContactList object at 0x104b43a20> 4
<__main__.ContactList object at 0x104b43b38> 2


**Notice**, that depending on the definition of equality there are only 2 key/value pairs in our dictionary.  
**Let's** try the same without our definition of equality.

In [4]:
class ContactList:
    def __init__(self, names):
        '''
        names is a list of strings
        '''
        self.names = names
    
    def __hash__(self):
        '''
        Conceptually we want to hash the set of names.
        Since the set type is mutable, it cannot be hashed. Therefore we use frozenset.
        '''
        return hash(frozenset(self.names))
    
#     def __eq__(self, other):
#         return set(self.names) == set(other.names)
    
def merge_contact_lists(contacts):
    '''
    contacts is a list of ContactList.
    '''
    return list(set(contacts))

contact_list_1 = ContactList(['john', 'james'])
contact_list_2 = ContactList(['ann', 'david', 'ann', 'mike'])
contact_list_3 = ContactList(['james', 'john', 'james'])
merged_contact_lists = merge_contact_lists([contact_list_1, contact_list_2, contact_list_3])

# Create a dictionary the keys of which are the ContactList objetcs and the values are the length of the contacts lists.
contacts_dict = collections.defaultdict(int)
for item in merged_contact_lists:
    contacts_dict[item] = len(item.names)

for key in contacts_dict:
    print(key, contacts_dict[key])

<__main__.ContactList object at 0x104b43550> 3
<__main__.ContactList object at 0x104b43630> 4
<__main__.ContactList object at 0x104b43668> 2


**Notice**, that without the definition of equality there are 3 key/value pairs in our dictionary as ['john', 'james'] and ['james', 'john', 'james'] are now considered not equal to each other.

**What if** now we try skipping the definition of the hash function?

In [5]:
class ContactList:
    def __init__(self, names):
        '''
        names is a list of strings
        '''
        self.names = names
    
#     def __hash__(self):
#         '''
#         Conceptually we want to hash the set of names.
#         Since the set type is mutable, it cannot be hashed. Therefore we use frozenset.
#         '''
#         return hash(frozenset(self.names))
    
    def __eq__(self, other):
        return set(self.names) == set(other.names)
    
def merge_contact_lists(contacts):
    '''
    contacts is a list of ContactList.
    '''
    return list(set(contacts))

contact_list_1 = ContactList(['john', 'james'])
contact_list_2 = ContactList(['ann', 'david', 'ann', 'mike'])
contact_list_3 = ContactList(['james', 'john', 'james'])
merged_contact_lists = merge_contact_lists([contact_list_1, contact_list_2, contact_list_3])

# Create a dictionary the keys of which are the ContactList objetcs and the values are the length of the contacts lists.
contacts_dict = collections.defaultdict(int)
for item in merged_contact_lists:
    contacts_dict[item] = len(item.names)

for key in contacts_dict:
    print(key, contacts_dict[key])

TypeError: unhashable type: 'ContactList'

**Unhashable type**, of course.

## Know Your Hash Table Libraries

Hash table-based data structures in Python:
* set
* dict
* collections.defaultdict
* collections.Counter

The difference between set and the other three is that set simply stores keys, whereas the others store key-value pairs. All have the property that they do not allow for duplicate keys, unlike, for example, list.

* Accessing value associated with a key that is not present in a **dict** leads to a **KeyError** exception.
* However, **collections.defaultdict** returns the default value of the type that was specified when the collection was instantiated.
* **collections.Counter** is used for counting the number of occurances of keys, with a number of set-like operations, as illustrated below.

In [6]:
c = collections.Counter(a=3, b=1)
d = collections.Counter(a=1, b=2)

# add two counters together
c + d

Counter({'a': 4, 'b': 3})

In [7]:
# Subtract (keeping only positive counts)
c - d

Counter({'a': 2})

In [8]:
# Intersection
c & d

Counter({'a': 1, 'b': 1})

In [9]:
# Union
c | d

Counter({'a': 3, 'b': 2})

* The most common operations for set are *s.add(42), s.remove(42), s.discard(123), x in s*, as well as *s <= t* (is a subset of t), and *s - t* (elements is s that are in t).
* The basic operations on the three key-value collections are similar to those on set.
    * Difference: iteration over key-value collections yields the keys.
        * To iterate over the key-value pairs: items()
        * To iterate over values: values()
        * To iterate over keys: keys() (returns an iterator to the keys)
        
**Mutable containers are not hashable**.  
**Note** that the built-in ***hash()*** function can greatly simplify the implementation of a hash function for a user-defined class, i.e., implementing __ hash(self) __.