# 8. Hashing with Chaining

- **Dictionary - ADT**
    - maintains a set of items each with a key
    - operations:
        - insert(item) --> overwrites any existing key
        - delete(item)
        - search(key): returns item with given key or report error if DNE
    - operations take O(log(n)) via AVL
    - but how do we do search in O(1)
    - Python Dictionaries:
        - D[key] --> search
        - D[key] = val --> insert
        - del D[key] --> delete
        - item = (key, value)
    - Motivations:
        - document distance
        - databases
        - compilers and interpreters
        - network router and server
        - ----------- more subtle -----------
        - substring
        - string commonalities
- How do dictionaries work:
    - Simple approach:
        - direct-access table 
        - store items in array indexed by key
        - PROBLEMS:
            - 1. keys may not be integers
            - 2. hogs memory
        - SOLUTIONS:
            - 1. prehashing (in python called hash BUT NOT HASHING)
                - maps keys to non-negative integers 
                - in theory, keys are finite and discrete strings of bits
                - in PYTHON:
                    - hash(x) --> maps x to string of numbers
                    - some issues hash(x) could equal hash(y) even though x == y
                    - ideally two hashes should be the same iff x == y
                - you don't want the prehashes of your keys to ever change otherwise you run into problems when searching
            - 2. hashing
                - "to cut into pieces and mix around"
                - reduce universe U of all keys to a reasonable size _m_ for table
                - idea: m = O(n)
                    - size of table proportional to # of keys in dictionary
                - PROBLEMS:
                    - 2 keys that map to same spot in hash table = collision
                - SOLUTION:
                    - chaining:
                        - if multiple keys map to the same spot, we store the items/values in that spot in a list/ linked list
                    - worst case is O(n) but in practice it works well because of randomization
        - so we end up with a process called hashing with chaining
- Why are operations constant in dictionaries?
    - ASSUMPTION: Simple uniform hashing
        - each key is equally likely to be hashed to any slot of the table, independent of where other keys are hashing.
    - ANALYSIS:
        - expected length of chain for n keys and m slots n/m (1/m + 1/m + 1/m ...) = load factor
        - As long as m = O(n), n/m will be O(1) (constant)
        - running time for insert, delete, search (search is hardest)
            - running time = O(1 + n/m) which is constant running time for all operations
- Hash functions:
    - Division Method:
        - h(k) = k mod m
        - problematic if m and k have common factors
        - works a lot of the time but not always
    - Multiplication Method:
        - h(k) = [(a * k) mod 2^w] >> (w-r)
            - a = integer (should be odd)
                - in betweeen 2^(r-1) and 2^r
            - w = number of bits of words in machine
            - r = leftmost part of rightmost w bits
        - the shifting ensures the slot we get is random
        - you end up with a number between 0 and m-1
    - Universal Hashing: (theoretical)
        - h(k) = [(ak + b) mod p] mod m
            - a and b = random numbers betwee 0 and p-1
            - p = prime number > |U| --> only done once per table
        - worst-case keys k1 != k2
            - probability over a and b colliding
                - Pr{h(k1) = h(k2)} = 1/m
                    - expected collision is the load factor. this is the ideal situation