Note: Skipped Advanced Union-Find (optional series)

# Introduction and Motivation

Binary Code - maps each character of an alphabet Sigma to a binary string. 
 - EX: Sigma = a to z and various punctuation (size 32 overall for example).To encode, use the 32 5-bit binary strings to encode this Sigma (2^5 = 32). 
     - Can we do better? Yes. If some characters of Sigma are much more frequent than others, using a variable-length code. 
     
Ambiguity EX: Suppose Sigma = {A,B,C,D}. Fixed length encding would be {00, 01, 10, 11}. 2 bit encoding.
 - Use instead, encoding {0, 01, 10, 1} variable length encoding. Try to get away with 1 bit for some charaters. 
 - But, what if someone gives encoding 001. What is original sequence? Not enough informtion to figure out. Can be AAD, AB. Variable length creates ambiguity. Unclear where one symbol starts and where next one begins
 

**Prefix-Free Codes**

Problem: With variable-length codes, not clear where one charater ends and next begins

Solution: Prefix-Free Codes - Make sure that for every pair, i,j in Sigma, neither of the encodings f(i) f(j) is a prefix of the other

Example: Encoding {A,B,C,D}, can encode as {0, 10, 110, 111). So, no code is a "prefix" of any other. This can take advantage of non-uniform frequences in an alphabet.
 - Let's say for frequencesi: A = 0.6, B = 0.25, C = .1, D = 0.05.
 - Performance = expected bit requirement to encode the symbols
 - Fixed Length Performance (2 bits each), so 2 bits per character.
 - Variable Length Performance = 1 * 0.6 + 2 * 0.25 + 3 * (0.1 + 0.05) = 1.55
 
So, we know variable length encoding can improve performance. What variable length code to use?

# Huffman Codes

Useful to think of codes as trees. 

Goal: Best binary prefix-free encoding for a given set of character frequencies.

Useful Fact: Binary codes are basically binary trees.
 - EX: Sigma = {A,B,C,D}. Fixed length: Each char of alphabet is leaf in the final tree. Root to Leaf path gives it the encoding. 0 is Left, 1 is Right. A = 00, B = 01, C = 10, D = 11. 
 - With variable encoding (not prefix-free), A = 0, B = 01. So, at first 0, A is at that node. Then going to 1 from there, goes to B. Ambiguity is where you have internal nodes of the tree (i.e. where A is).  
 - Prefix-Free Code: Not perfectly balanced tree, will only have lables at leaves of the tree. Can draw out A = 0, B = 10, C = 110, D = 111. Each bit goes down one layer of tree. 
 
General Idea:
 - For each i in Sigma, exactly one node is labeled "i"
 - Encoding of each i is bits along path from root to the node "i".
 - Prefix-Free iff labeled nodes = the leaves. (Prefixes means one node is ancester of another).
 - To decode: Start at root, when see 0, go left, see 1, go right. Whenever find a leaf, returns i for that leaf. Then start at root again. 
     - Encoding length of i = depth of i in the tree.

Problem:
 - Input: Probability pi for each character i in Sigma
 - Notation: If T = tree with leaves corresponding to symbols of sigma, then L(T) = Sum(pi * depth of i in T) = avg encoding length.

## The Greedy Algorithm

Question: What's a principled approach for building a tree with leaves corresponding to symbols of Sigma:
 - Natural but Suboptimal - Top-Down/Divide and Conquer
     - Split symbols into Sigma 1, Sigma 2 each with ~ 50% total frequency
     - Recursively compute Ti for Sigma i, return the trees onnected at root. 
 - Huffman's Optimal Idea:
     - Build tree bottom-up, start with leaves of tree and then do successive mergers, each step take two sub-trees and link them together as sub-trees under common internal node.  
     - Intuitively clear, systematic way that builds trees w/prescribed set of trees. 
         - Merge creates a new internal node, then merges 2 subtrees under it. Drops # subtrees by 1. Cool. 
         - Start with N leaves, do N-1 successive merges. Introduce N-1 new unlabeled internal nodes, creates single tree. 
     
Question: Which pair of symbols is "safe" to merge?
 - Observe: final encoding length of i = # of mergers its subtree endures.
     - Mergers increase encoding length of symbols by 1. Gives way to progress with greedy heuristic. 
 - Have N original symbols, must pick 2 to merge. So, should merge symbols least frequent first (increases average by the least). 
 
How to Recurse?
 - 1st iteration of algorithm merges symbols a and b. By merging these two, forces algorithm to output tree where a and b have same parent i.e. siblings). Encodings are identical in length. 
     - In recursion, treat as same symbol. Introduce new metasybol ab, represent all frequencies of either one. The new frequency of ab will be prob(a) + prob(b)

HuffMans:
 - If len(Sigma) = 2, return tree of A,B with one root
 - Let a,b in Sigma have the smallest frequencies
 - Let Sigma` = Sigma with a,b, replaced by symbol ab 
 - Define pab = pa + pb
 - Recursively compute T` (for the alphabet Sigma`) 
 - Extend T` (with leaves same as Sigma`) to a tree T ith leaes Sigma by splitting leaf ab into the two leaves a and b
 - Return T

## Correctness Proof

Theorem: Huffman's algorithm computes a binary tree that minimizes the avg encoding length
 - L(T) = Sum(pi * depth i)
 
Proof (By Induction on size n f alphabet sigma, assume n>= 2):
 - Base Case: when n = 2, algorithm outputs the optimal tree (1 bit per symbol)
 - Inductive Step: Fix integer with n = |Sigma| < 2.
 - By Inductive Hypothesis: Algorithm solves smaller subproblem (for Sigma` < Sigma) optimally
 
Inductive Step:
 - Let Sigma` = Sigma with a,b,  replaced by meta-symbol ab. a,b smallest freqs, pab = pa + pb. 
     - With ab, commits a,b to be siblings in final tree. Can slit metanode ab, insert internal node w/children a, b. 
     - Can go between these two forms, i.e. combine or split ab. 
     - Let Xab = trees for Sigma that have a,b as siblings. 
 - Important: For every such pair T` and T, L(T) - L(T`) is pa[depth of a in T] + pb[depth of b in T] - pab [depth of ab in T`]. 
     - Let d = depth of ab in T`
     - = (pa + pb)[d + 1] - (pa + pb)(d) = (pa + pd)
     - The two trees are p much exactly the same except for a,b being differeint with ab being one level above individual a, b in T. 
     - Key, the difference between two avg encoding lengths is some constant without depending on which trees we started with. 
         - Doesnt matter if tree perfectly balanced or lopsided. 
         - So, avg encoding length preserved up to a universal constant
     - Preserves objective function up to a constant
         
Recall Inductive Hypothesis: Huffman's algorithm computes a tree T`.hat that minimizes L(T`) for Sigma`.
 - Corresponding Tree T.hat minimizees L(T) for sigma over all trees in Xab. 
 - In minimizing avg encoding length for all feasible solutions for the smallest subprblem, the recursive call is actually minimizing the avg encoding length for the original problem with the original alphabet sigma over subset of the feasible solutions. 
 - Right now, getting best possible scenario amongst some solutions in which A and B happen to be siblings. Bad if there is no optimal solution in which A and B are siblings. 
 
Key Lemma: There is an optimal tree for Sigma in Xab (i.e. a and b were safe to merge,  A and B, two lowest freq siblings, are siblings)
 - Intuition: Can make an optimal tree better by pushing a and b as deep as possible (since a and b have smallest frequencies)
 - By exchange argument, Let T* be any tree that minimizes LT) for Sigma. 
 - Let x, y be siblings at the deepest level of T*. a,b are somewhere as leaves in tree
 - Exchange: swapping labels a -> x, b -> y
 - T.hat is in Xab (where a,b are at bottom) by choice of x,y
 - Will show that L(T.hat) <= L(T*). In doing so, shoes T.hat also optimal, completes proof.
     - Reason: the depths of a,b and x,y swapped. 
     - L(T*) - L(T.hat) = (px - pa) * (depth of x in T* - depth of a in T*) + (py - pb) * (depth of y in T* - depth of b in T*)
     - Since a and b have lowest possible frequencies, px - pa; py - pb both >= 0. 
         - depth of x in T* - depth of a in T* and with y and b also >= 0.
         - Since all non-negative, the sum of them all are also nonnegative.
         - Thus, L(T*) >= L(T.hat) which means T.hat is also optimal
         
QED

## Implementation and Running Time

Naive Implementation: O(n^2) where n = |Sigma|
 - Total recursive calls is linear to size of alphabet
 - Each call, searches for minimum frequency symbols A and B, so also linear time. Thus, n^2
 - Note the minimum computation calculations, look to Heap
 
Speed-Up with Heap (to perform repeated min comp):
 - Use keys = frequenies
 - Afte extracting the tw smallest-frequency symbols, re-Insert the new meta-symbol. 
 - Iterative, O(nlogn) implementation
 - Even Faster: (non-trivial) Sorting + O(n) additional work. Stil O(nlogn) but with smaller constants. 
     - If sorting as pre-processing step, do not need to use Heap data structure for this. Can use a queue, likely use two queues. 
     - 1 sort, linear work with 2 queues after. 