In [1]:
# setup
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML(open('rise.css').read()))

# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})


# CMPS 2200
# Introduction to Algorithms

## Huffman Coding Implementation


## Fixed-length encoding and Variable-length encoding


Suppose we are given a document $D$ in which we use the alphabet $\Sigma = \{\sigma_1 \ldots \sigma_k\}$. Our goal is to create a binary encoding of $\Sigma$ to represent $D$ with as few bits as possible. 

Example: Suppose dictionary $\Sigma=\{A, B, C, D\}$, and document $D = \langle A, A, A, A, A, A, A, A, A, B, C, D\rangle$. 

The `fixed-length encoding` could be 

|$$\sigma$$|$$e(\sigma)$$         |
|-------|-----------------------|
| A     | 00 |
| B     | 01 |
| C     | 10 | 
| D     | 11 |

Define $f: \sigma \rightarrow \mathbb{R}$ be the number of times a character appears in $D$. The `variable-length coding` could be

|$$\sigma$$ |$$f(\sigma)$$| $$e'(\sigma)$$|
|-------|--|-------------|
| A     | 9 | 0   |
| B     | 1 |10  |
| C     | 1 |110 | 
| D     | 1 |111 |

In [2]:
from collections import Counter

## we can also do `map`, `reduce`

cnt = Counter()
for word in ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'C', 'B', 'D', 'B', 'C']:
    cnt[word] += 1


## print the overall counter
print(cnt)

## print each frequency per key
for c in cnt.keys():
    print('The charachter is', c, 'with its frequency as', cnt[c])


Counter({'A': 9, 'B': 3, 'C': 2, 'D': 1})
The charachter is A with its frequency as 9
The charachter is B with its frequency as 3
The charachter is C with its frequency as 2
The charachter is D with its frequency as 1


## Encodings as Trees


<img src = "encoding_trees.jpg" width="60%">

Every prefix-free encoding $e$ can be represented by a tree $T_e$. 

The depth of each character $d_T(\sigma)$ in the tree determines how many bits are needed to encode $\sigma$.

<span style="color:red">Question:</span> What is the tree depth for fixed-length coding?

So the optimal compression of $D$ can be achieved by identifying the encoding tree $T$ that minimizes:

$$C(T) = \sum_{\sigma\in\Sigma} f(\sigma)\cdot d_T(\sigma)$$


### Huffman Coding [1951']


The main idea of `Huffman Coding` is to choose the two **least** frequent characters $x$ and $y$ and create a subtree with $x$ and $y$ as sibling leaves for the final encoding. We then remove $x$ and $y$ from $\Sigma$ and add a *new* character $z$ with frequency $f(x)+f(y)$, and recurse to compute a tree $T'$. The final tree $T$ is just $T'$ with $z$ replaced by the subtree with $x, y$ as siblings. 

<img src="huffman_example.jpg" width="60%">


### Implementation using Priority Queues

The *priority queue* maintains a set of elements from a total ordering, allowing at least insertion of a new element and deleting and returning the minimum element. The priority queue data structure needs to support some basic operations:

- *deleteMin*: Identify the element with minimum value and remove it. 

- *insert(x, s)*: insert a new element $x$ with initial value $s$.

Priority queues can be implemented by using a variety of data structures such as Linked lists or Arrays, Balanced Trees, Heaps etc.

### Leftist min-Heap

Every node in the tree is smaller than either of its children. This means that the root of a tree with the heap property is always the minimum element. The `Leftist` is to keep the trees so that the trees are always deeper on the left than the right. So for a binary tree:

<img src="heap_property_fixed_examples.png" width="70%">


Maintaining the heap property upon insertion or deletion requires time proportional to the depth of the tree because we can swap elements upward or downward, following the path from the modification either upward or downward. Work for insertion and deletion is $O(\log n)$. <span style="color:red">[<a href="https://github.com/allan-tulane/cmps2200-slides/blob/main/module-06-greedy/greedy-02.ipynb">Proof is optional</a>]</span> 

### Huffman Coding (Cont'd)

We need to efficiently retrieve the next two smallest frequency nodes.


1. Initialize a min-heap with character frequencies $f(\sigma)$


![huffman-heap-2.png](huffman-heap-2.png)


Then, repeat:

2. Call `deleteMin` twice to get the two least frequent nodes $x$ and $y$
3. Create a new node $z$ with frequency $f(x) + f(y)$
4. Make $x$ and $y$ children of $z$ in the tree.
4. Call `insert` to add $z$ to the heap

In [8]:
import math, queue
from collections import Counter

class TreeNode(object):
    # we assume data is a tuple (frequency, character)
    def __init__(self, left=None, right=None, data=None):
        self.left = left
        self.right = right
        self.data = data
    def __lt__(self, other):
        return(self.data < other.data)
    def children(self):
        return((self.left, self.right))
    

## https://docs.python.org/3/library/queue.html
## Counter({'A': 9, 'B': 3, 'C': 2, 'D': 1})

def priority_queue(f):
    p = queue.PriorityQueue()
    # construct heap from frequencies, the initial items should be
    # the leaves of the final tree
    for c in f.keys():
        p.put(TreeNode(None,None,(f[c], c)))        
    return p
    

p = priority_queue(cnt)

l = p.get()
print(l.data[0], l.data[1])

r = p.get()
print(r.data[0], r.data[1])

p.put(TreeNode(l,r,(5, 'E')))


print('New Tree\n')

while (p.qsize() > 0):
    e = p.get()
    print(e.data[0], e.data[1])
    if e.data[1]=='E':
        l = e.left
        r = e.right
        print('The children of E are', l.data[1], r.data[1])


1 D
2 C
New Tree

3 B
5 E
The children of E are D C
9 A
