# Hash Table: Supported Operations

Purpose:
 - Think of like an array. An array is very good at pulling a value at a particular index. 
 - Maintain a (possibly evolving) set of data (transactions, people associated with data, IP addresses, etc.)
 - Does not maintain order for its elements. 
 - In Python, a Dictionary is a Hash Table 
 
Operations, all done using a "key":
 - Insert - add a new record
 - Delete - delete an existing record
 - Lookup - check for a particular record. Basically, give hash table a key, table returns all data associated with key
 - All Operations done in O(1), ridiculous
 - Only true for properly implemented hash tables. May have to create own hash table in the future tho lmfao. 
 - Some other issues tho in that guarantee is for certain conditions 
 
Example Applications:
 - De-Duplication:
     - Given a "stream" of objects (like linear scan through big file or objects are arriving in real time)
     - Goal: remove the duplicates i.e. keep track of unique objects (like reporting unique visitors to web site)
     - Solution: useing a hash table, when new object x arrives, lookup x in hash table. if not found, insert into table. 
 - 2-Sum Problem:
     - Given an array of n numbers and a target sum t. 
     - Goal: determine whether or not 2 numbers in array with x + y = t
     - Solutions:
         - Worst Naive One: Look at n Choose 2 sums, see if any matches t
         - 2nd solution: Sort A upfront (O(nlogn) method). Given entry x, look for y such that x + y = t. Do for each number in array. Searching is O(nlogn) in sorted array remember. So, total is O(nlogn)
     - Hash Table Solution:
         - Note, Hash Tables speed up solutiosn that require repeated lookups. 
         - Looking above, sorted array was really just to make lookups easier. Instead of that, can Hash
         - Insert all numbers into a hash table (linear time, O(n) total)
         - For each element x, look for t - x in hash table (constant time for each x). Total, O(n)
 - Historical Applications: symbol tables in compilers. 
 - Blocking network traffic (such as blocking a spam IP address)
 - Search ALgorithms (e.g. game tree exploration like a chess game; 
     - Use hash table to avoid exploring any configuration (e.g. arrangement of chess pieces) more than once 

# Implementation Details:

## High-Level Idea:

Setup: Identify the universe U (e.g. all IP addresses, all names, all chess board configurations, etc.), all possible configs of whatever storing. Big af. IP addresses is 2^32 universe

Goal: Wnat to maintain evolving set S contained in U. Smaller than U, generally a reasonable size. 

Naive Solutions:
 - Array-based solutions (indexed by U, all elements in universe lmao. Insane. O(1) operations, O(U) space which is impossible)
 - List-based solution (O(S) space but O(S) lookup). 
 
Solution (Kind of like array-based):
 - Pick n = # of "buckets", n ~ S. Maybe n = 2S. Each entry of array is one "bucket." Assume while S itself can change, the size of S does not change too much. 
 - Choose a Hash Function h:U; takes a key as an input and outputs a position in the array (0,1,2,...n-1 elements). Maps universe to buckets in the array. Tells u which position should store a given key in the universe
 - Use array A of length n, store x in A[h(x)]
     - What about collisions? S < U. What if h(x) = h(y)? How do we deal with it? Pretty unavoidable tbh. 
     - For example, only need 23 people with random birthdays, at least a 50% chance that 2 people hve the same birthday. If on a planet with k days, abt sqrt(k) to have at least 50%. For n people, there are abt n^2 pairs of people. So, around there, may begin to see same birthdays. 
     - Hash functions effectively do a compression from U to buckets in n
 - Resolving Collisions (both fairly practical):
     - Separate Chaining - Keep linked list in each bucket. Each bucket will contain list with unbounded # of elements (in principle). 
         - Given a key/object x, perform Insert/Delete/Lookup in list that's in the bucket. 
     - Open Addressing - only one object per bucket. 
         - Hash function now specifies a probe sequence where tries multiple positions in the array to store. 
         - EX: linear probing (try position x, then x+1, x+1...etc.)
         - EX: Double-Hashing, 2 different hash functions. See first one and where it points to, if doesn't work, try 2nd func. 2nd func is an additive shift (so lets say first retuns x, 2nd returns 23, use 23 as an increment. Try x, x+23, x + 46, etc.)
     - Open Addressing saves space compared to Separate Chaining, but deletion is more difficult
         - If need be, implement both lmao

### The Hash Function

What makes one good?

Note: In hash table with chaining, Insertion is O(1) (assuming insert new object x at front of a list in bucket in A[h(s)]
 - O(list length) for Insert/Delete
 - Lookup x in hash table, goes to bucket h(x) in O(1) time. Then, exhaustive search in list at that bucket, O(list length) for Insert/Delete. Can be anywhere from m/n to m for m objects. Worst function maps every x to same value lmao
 - Point: Performance depends on the choice of the hash function. Should distribute data well across buckets. True for both types of collision resolving. 
     - Analogous situation with open addressing. Wants probing so that do not need to repeat probing sequence too many times. 
     
Properties of a "Good" Hash Function:
 - Should lead to good performance -> i.e. spread data out as equally as possible among buckets (Gold Standard: completely random function/hashing)
 - Hash function itself must be easy to store/very fast to evaluate. 
 - Very difficult design. Even now, still very debated and done in diff ways

Bad Hash Functions:
 - Easy to design, bad funcs lead to very bad performance
 - EX: Keys = phone numbers (10 digits in US). Universe size is 10^10, really fuckin big. 
     - Choose n = 1000. Hash func should take some number and spit out a number between 0 and 999.
     - Really bad funtion: take most significant digits of ap hone number to define a mapping to a bucket (like an area code)
         - Wastes space in hash table bc not all buckets are used, and some buckets wil lbe big.
     - Mediocre: Use last 3 digits of phone number. Not guaranteed to be uniformly distributed, vulnerable to patterns.
 - EX: What if keys = memory locations (wil lbe multiples of a power of 2 bc in bytes)
     - Bad hash function h(x) = x % 1000, gets remainder of memory / 1000. x base 10, take last 3 digits basically.
         - All odd buckets guaranteed to be empty lmfao. All mults of power of 2 are even
     
Quick-And-Dirty Hash Funcs:
 - Think of design in 2 separate parts: Objects -> Integers -> Buckets. "Hash Code" + "Compression Function." 
     - Hash code somtimes skipped (EX for phone numbers). But, Hash Func can take string and convert to integers. 
     - Compression Function like % operation above. 
 - Need to choose good number of buckets n. 
     - Quick and dirty, use modulus compression, how choose n?
     - Want to make sure no buckets are guaranteed to be empty (refer to memory EX above). 
     - Number of buckets should not share any factors with the data you are hashing. Make sure # of buckets should be prime, basically no nontrivial factors
     - Number of buckets should be comparable to size of set that stored. Pick prime that is some constant factor away from total
     - Wnat prime not to be too close to patterns in data:
         - EX: patterns may show up in data for phone numbers in base 10 with area codes. Or, memory has base 2 issues.
         
This by no means state-of-the-art
         