# **Hash Table Boot Camp**
---
- data structure used to store keys
    - optionally with corresponding values
    - good data structure to store dictionaries
- Inserts, Deletes, Lookups -> all `O(1)` average time
- Keys do not have to be in order 
- common mistake: key in hash table will be updated 
    - consequence -> lookup for that key will now fail 
        - even tho it is still in the hash table 
    - if you have to update a key:
        - first remove it 
        - then update it 
        - finally add t back 
    - ensures it is moved to the correct array location
    - Avoid using mutable objects as keys 
    
---
### **"Hash Code"**
   - where keys are stored in an array (also known as: 'slots')
   - integer computed from the key by a **hash function**
        - if hash function chosen well: 
            - objects distributed uniformly across the array locations

---
### **"Collsion"**
   - two keys map to the same location 
   - handle by maintaining a linked list of objects at each array location 
   - if the **hash function** does a good job spreading objects in `O(1)` time:
        - lookups, insertions, and deletions have `O(1+ n/m)` time complexity
            - `n` = number of objects
            - `m` = length of the array 
            - `load` = `m/n`

---
### **"Rehashing"**
   - if the `load = m/n` grows large
        - `len(array) / #objects`
   - a new array with a larer number of locations is allocated 
   - objects moved to the new array 
   - expensive at `O(n + m)` time complexity 
   - if done infrequenly -> amortization cost is low 
   - not really an issue outside of realtime systems
        - even then... seperate threads do the rehashing

---
### **"Hash Function"**
   - distributes objects uniformly across array locations
   - crucial element to hash table 
   - equal keys must have equal hash codes 
   - should be efficient to compute 
   
---

---
### Hash Funtion for a String
- examine all characters in a string
- give a large range of values
- not let one character dominate
- **Rolling Hash Function**
    - if one character is deleted from the string and another added to the end
    - new hash code can be computed in `O(1)` time 

In [None]:
import functools

In [None]:
def str_hash(s: str, modulus: int) -> int: 
    mult = 997
    # ord() returns the number representing the unicode code of a character 
    return functools.reduce(lambda v, c: (v * mult + ord(c)) % modulus, s, 0)

In [None]:
string = "bobby"
mod = 3

In [None]:
str_hash(string,mod)

---
---
## Applications of Hash Tables 
- anagrams: reorganizing letters to form new words
    - "eleven plus two" is an anagram for "twelve plus one"
    - do not depend on ordering of characters in the strings
    - sort the characters in the strings
        - if the sorted characters match, they are anagrams 
        - iterate through the strings 
        - compare each string with remaining strings 
            - if anagrams, do not consider the second string again 
        - `O(n² m log m)` algo 
            - `n` = number of strings 
            - `m` = max string length 
        - map strings to a representative 
            - sorted version of a string can be use as a unique identifier for the anagram group 
            - map sorted string to the anagram it belongs to 
        - sorted strings are the keys 
            - final algo proceeds by adding `sort(s)` to each string `s` in the dictionary to a hash table 
        - values are arrays of corresponding strings from original input 

In [1]:
from typing import List
import collections

anagrams = ["debitcard", "elvis", "silent", "badcredit", "lives", "freedom", "listen", "levis", "money"] 

In [2]:
def find_anagrams(dictionary: List[str]) -> List[List[str]]:
    
    sorted_string_to_anagrams: DefaultDict[ str, List[str]] = collections.defaultdict(list)
        
    for s in dictionary:
        sorted_string_to_anagrams[''.join(sorted(s))].append(s)
    
    return [group for group in sorted_string_to_anagrams.values() if len(group) >= 2]

In [10]:
# freedom and money not shown because they have no buddy anagrams
find_anagrams(anagrams)

[['debitcard', 'badcredit'], ['elvis', 'lives', 'levis'], ['silent', 'listen']]

In [11]:
def diff_but_same(anagrams):
    
    groupedWords = collections.defaultdict(list)
    
    # puts all anagrams into dictionary with keys as the sorted words 
    for word in anagrams:
        groupedWords[''.join(sorted(word))].append(word)
    
    # print anagrams together 
    for group in groupedWords.values():
        print(" ".join(group))

In [5]:
diff_but_same(anagrams)

debitcard badcredit
elvis lives levis
silent listen
freedom
money


##### Time Complexity: `O(n m log m)`
- `n` = number of strings 
    - `n` calls to sort -> `O(n m log m)`
    - `n` insertions into hash table -> `O(n m)`
    - together still `O(n m log m)` due to removing constants 
- `m` = max string length 

---
### Variant: `O(n m)` algorithm for the same problem
- `n` = number of strings
- `m` = max string length 
- USE **HASH MAP**

In [6]:
from typing import List
from collections import defaultdict
from collections import Counter

anagrams = ["debitcard", "elvis", "silent", "badcredit", "lives", "freedom", "listen", "levis", "money"]

In [7]:
def shorter_time_anagrams(anagrams: list) -> list:
    
    # will create a new list if the key is not found in the dictionary 
    m = defaultdict(list)
    
    for word in anagrams:
        # frozenset(takes an iterable object as input -> makes immutable)
        # makes hash of (frozenset(Counter('cat'))) == hash of other 'cat' anagrams 
        # Counter('cat') counts the frequency of the characters in the string 
        m[frozenset(dict(Counter(word)).items())].append(word)
        
    return [v for k, v in m.items()]

In [9]:
shorter_time_anagrams(anagrams)

[['debitcard', 'badcredit'],
 ['elvis', 'lives', 'levis'],
 ['silent', 'listen'],
 ['freedom'],
 ['money']]

---
---
## Design a Hashable Class 
- Contacts: each contact is a string
    - hard requirement that the individual contacts be stored in a list 
        - duplicates are allowed
        - two contacts are equal if they contain the same set of strings 
        - original ordering does not matter 
        - multiplicity not important: 3 isntances of one contact is the same as one instance of a contact 
    - explicity define equality: forming sets from the lists and comparing the sets 
        - hash function should depend on the strings present but not their ordering 
- hash function and equals methods are very inefficient 
    - better to cache the underlyng set and the hash code
    - voiding these values on updates 

In [13]:
class ContactList: 
    def __init__(self, names: List[str]):
        self.names = names 
    
    def __hash__(self):
        return hash(frozenset(self.names))
    
    def __eq__(self, other):
        return set(self.names) == set(other.names)
    
def merge_contact_list(contacts: List[ContactList]) -> ContactList:
    return list(set(contacts))

In [16]:
contacts = ['sammy', 'louis', 'katherine', 'marco', 'katherine', 'sammy', 'davis', 'louis']

merge_contact_list(contacts)

['katherine', 'sammy', 'davis', 'louis', 'marco']

##### Time Complexity: `O(n)`
- `n` = number of strings in the contact list 

##### Hash Codes: 
- often cached for performance 
- caveat that the cache must be cleared if object fields have been updated 