# Chapter 12: Hash Tables

A has table is a data structure used to store keys, optionally, with corresponding values. Inserts, deletes ad lookups run in O(1) time on average

* Store keys in an array
* Stored in array location ("slots") based on its "hash code"
* integer location is determined from the key by a hash function, chosed the distribute the keys uniformly
* 2 keys mapped to the same location is callaed a "collision"
* collisons are dealt with by maintaining a linked list of objects at each array location
* if there aren't too many collisons the time complexity of lookups, insertions, and deletes is O(1 + n/m)- n is the number of objects and m is the lenth of the array
* If n/m grows large can rehash to move to a larger array, rehashing is expensive O(n+m) time


* inserting and deleting is more efficient (assuing infrequent hasing) then for a BST

Requirements of a Hash Function:
* equal keys have equal has codes
* uniformly distrubite keys across the array
* efficient to compute


* if changing a key - remove it first, update, then add it back
* avoid using mutable objects as keys

**Hash Function for Strings:**
* examine all the characters in the string
* give a large range of values
* don't let one character dominate
* we want a rolling hash function - if one character is deleted from the front and another added tot eh end the hash code can be copute in O(1) time

Example:


In [1]:
def string_has(s, modulus):
    MULT = 997
    return functools.reduce(lambda v, c: (v * MULT + ord(c)) % modulus, s, 0)

**Tips for Hash Tables**
* Hash Tables have the best theoretical and real-world performance for lookup, insert and delete -- O(1)
* The average insert takes O(1) but can take O(n) if the tables has to be resized
* Consider using a hash code as a signature to enhance performance. To filter out candidates
* Consider using a precomputed lookup table instead of if-then code for maapings
* Be sure to understand the relationship between logical equality and the fields the has function must inspect
* you may need a multimap (map that contains multiple values for a single key) or a bi-diretional map

**Hash table Libraries**
* common hash table data structures - set, dict, collections.defaultdict, and collections.Counter
    * set only stores keys while the others store key-value pairs, none allow for duplicate keys
    * dict throws an error if you try to access a key that doesn't exist 
    * defaultdict returns an empty of the default value
    * Counter is used for counting the number of occurances of keys
* set operations are 
    * add
    * remove 
    * discard
    * x in s - iteration yields the keys, to iterate over key-value paris use items() to iterate over values use values() 
    * s <= t (is s a subset of t) 
    * s - t (elements in s not in t)
    

* for a user-defined class implement \__hash(self)\__) 

## 12.1 Test for Palindromic Permutations
A palindrome is a string that reads the same forwards and backwards, e.g., "level", "rotator", and "footaraboof"

Write a program to test whether the letters forming a string can be permuted to form a palindrome. For example edified can be permuted to form deified

In [26]:
# idea: create a hash table (Counter) of the string and if the number of odd keys is less than or equal to 1 
# then it is a palindrome
import collections

def palindrom_permutation(s):
    a = collections.Counter(s) 
    return sum(v % 2 for v in a.values()) <= 1
    
assert(palindrom_permutation("level") == True)
assert(palindrom_permutation("footaratoof") == True)
assert(palindrom_permutation("rotator") == True)
assert(palindrom_permutation("edified") == True)
assert(palindrom_permutation("edifiedt") == False)
assert(palindrom_permutation("rotaator") == True)
assert(palindrom_permutation("rotattor") == False)

## 12.2 Is an anonymous letter constructible?
Write a program that takes test from an anonymous letter and text for a magazine and determines if it is possible to write the anonymous letter using the magazine. 

The letter could be written using the magazine if for each character in the letter the number of times it appears is no more than the number of times it appears in the magazine.

In [52]:
import collections

def letter_construct(letter, magazine):
    letter = letter.replace(" ", "")
    magazine = magazine.replace(" ", "")
    if len(letter) > len(magazine):
        return False
    l = collections.Counter(letter)
    for let in magazine:
        if l[let] > 0:
            l[let] -= 1
        if l[let] <= 0:
            del l[let]
            if len(l) <= 0:
                return True
    return False
    
assert(letter_construct("test me", "text") == False)
assert(letter_construct("test me", "etextsm") == True)
assert(letter_construct("test me", "textsm") == False)
assert(letter_construct("test me", "etextmsasdfasdoijasdfijm") == True)

## 12.3 Implement an ISBN Cache
The ISBN is a unique commercial book identifier. It is a string of length 10. The first 9 characters are digits; the last character is a check character, it is the sum of the first 9 digits, mod 11, with 10 represented by X. 

Create a cache for looking up prices of books identified by their ISBN. You implement lookup, insert, and remove methods. Use the Least Recently Used (LRU) policy for cache eviction. If ISBN is already present, insert should not change the price, but should update that entry to be the most recently used entry. Lookup should also update that entry to be the most recently used entry.

In [124]:
# create an object to store ISBN, price, and last used timestamp
# store in hash table, hashed on ISBN
# brainstorm ways to have eviction be faster than O(n) - could do it with more storage using a heap on time stamp
import time

def ISBN_10digit(ISBN_first9):
    """
    input: 
        ISBN_first9 - first 9 digits of an ISBN as a string
    returns: 
        10 digit ISBN as a string where the last digit is the sum of the first 9 mod 11, with 10 replaced with X
    """
    if not isinstance(ISBN_first9, basestring) or len(ISBN_first9) != 9:
        return "ISBN_first9 must be a string of length 9"
    nums = [int(i) for i in ISBN_first9]
    last_digit = sum(nums) % 11
    if last_digit == 10:
        last_digit = 'X'
    return str(ISBN_first9) + str(last_digit)

assert(ISBN_10digit("111111111") == "1111111119")
assert(ISBN_10digit("111111112") == "111111112X")
assert(ISBN_10digit("111119112") == "1111191127")
assert(ISBN_10digit(1122341) == "ISBN_first9 must be a string of length 9")
assert(ISBN_10digit("1122341") == "ISBN_first9 must be a string of length 9")

class ISBN_item:
    def __init__(self):
        None
    
    def __init__(self, ISBN, price):
        self.ISBN = ISBN
        self.price = price
        self.last_used = time.time()
        
    def __get_item__(self):
        return None
    
    def set_last_used(self):
        self.last_used = time.time()
    
    def get_price(self):
        self.set_last_used()
        return self.price
    
    def get_last_used(self):
        return self.last_used
    
    def get_ISBN(self):
        return self.ISBN

a = ISBN(ISBN_10digit("111111111"), 9)

In [157]:
import collections
import time

class ISBN_cache:
    def __init__(self, max_cache):
        self.ISBN_dict = collections.defaultdict(str)
        self.max_cache = max_cache
        
    def find_oldest(self):
        oldest_time = time.time()
        oldest_ISBN = ""
        for item in self.ISBN_dict:
            #print item
            #print value.get_price()
            if self.ISBN_dict[item].get_last_used() < oldest_time:
                oldest_time = self.ISBN_dict[item].get_last_used()
                oldest_ISBN = item
        return oldest_ISBN
        
    def insert_ISBN(self, ISBN, price):
        if len(self.ISBN_dict) >= self.max_cache:
            self.remove(self.find_oldest())
            
        if self.ISBN_dict[ISBN] == "":
            # ISBN is not yet in the dict add it
            self.ISBN_dict[ISBN] = ISBN_item(ISBN, price)
        else:
            # already in the dict, don't update anything but the last_used time
            self.ISBN_dict[ISBN].set_last_used()
    
    def lookup(self, ISBN):
        if self.ISBN_dict[ISBN] == "":
            self.ISBN_dict.pop(ISBN)
            return "ISBN doesn't exist"
        self.ISBN_dict[ISBN].set_last_used()
        return self.ISBN_dict[ISBN].get_price()
    
    def remove(self, ISBN):
        self.ISBN_dict.pop(ISBN)
    
a = ISBN_cache(5)
a.insert_ISBN(ISBN_10digit("111111111"), 9)
#print a.ISBN_dict["1111111119"].get_last_used()
time.sleep(.5)
assert(a.lookup("1111111119") == 9)
#print a.ISBN_dict["1111111119"].get_last_used()
a.remove("1111111119")
assert(a.lookup("1111111119") == "ISBN doesn't exist")

a.insert_ISBN(ISBN_10digit("111111111"), 9)
a.insert_ISBN(ISBN_10digit("121111111"), 10)
a.insert_ISBN(ISBN_10digit("131111111"), 11)
a.insert_ISBN(ISBN_10digit("141111111"), 12)
a.insert_ISBN(ISBN_10digit("151111111"), 13)
a.insert_ISBN(ISBN_10digit("161111111"), 14)
assert(a.lookup("1111111119") == "ISBN doesn't exist")
assert(a.lookup("121111111X") == 10)
a.insert_ISBN(ISBN_10digit("171111111"), 15)
assert(a.lookup("121111111X") == 10)
assert(a.lookup("1311111111") == "ISBN doesn't exist")

a.insert_ISBN(ISBN_10digit("121111111"), 50)
assert(a.lookup("121111111X") == 10)

## 12.6 Find the Nearest Repeated Entries in an Array

People do not like reading text in which a word is used multiple times in a short paragraph. You are to write a program which helps identify such a problem. 

Write a program which takes as input an array and finds the distnce between a closest pair of equal entries, if s = ["All", "work", "and", "no", "play", "makes", "for", "no", "work", "no", "fun", "and", "no", "results"]. Then the 2nd and 3rd occurance of "no" is the closest pair

In [164]:
# add each word to a hash table keeping 
# Only keep the most recent location
# separately store the current smallest distance and it's 2 locations
import collections

def nearest_words(s):
    min_word = None
    first = None
    second = None
    word_dict = collections.defaultdict()
    
    for i, word in enumerate(s):
        if word in word_dict:
            if first is None or i - word_dict[word] < second - first:
                first = word_dict[word]
                second = i
                min_word = word
        word_dict[word] = i
    return (min_word, first, second)

nearest_words( ["All", "work", "and", "no", "play", "makes", "for", "no", "work", "no", "fun", "and", "no", "results"])

('no', 7, 9)