In [1]:
# setup
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML(open('rise.css').read()))

# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})


# CMPS 2200
# Introduction to Algorithms

## Greedy Algorithms - Data Compression


We will now look at *Greedy* algorithms. The greedy framework is very simple: 

- Let $\mathcal{X}$ be possible choices for the solution. Initialize solution $S=\emptyset$. 
- Select $x\in\mathcal{X}$ according to a `greedy criterion` $C(x)$ and set $S := S \cup \{x\}, \mathcal{X} := \mathcal{X} - \{x\}$.
- Repeat until solution is complete.

> **Example**: Selection Sort



## Data Compression



<img src="data_compression.png" width="70%">

## Binary Encoding

<img src="encoding.png" width="50%">

## Fixed-length encoding

Suppose we are given a document $D$ in which we use the alphabet $\Sigma = \{\sigma_1 \ldots \sigma_k\}$. Our goal is to create a binary encoding of $\Sigma$ to represent $D$ with as few bits as possible. Of course, the encoding must distinctly represent $\Sigma$.

Example: Suppose dictionary $\Sigma=\{A, B, C, D\}$, and document $\mathcal{D} = \langle A, A, A, A, A, A, A, A, A, B, C, D\rangle$. 

The naive encoding could be 

|$$\sigma$$|$$e(\sigma)$$         |
|-------|-----------------------|
| A     | 00 |
| B     | 01 |
| C     | 10 | 
| D     | 11 |

This is a **fixed-length** encoding of $\Sigma$. What is the number of bits required to encode the entire document with this encoding?


The length of the document with this encoding is $2\cdot 12 = 24$. The encoding is:

$e(D) = "000000000000000000011011"$

<span style="color:red">**Question**:</span> For fixed-length encoding, suppose there are $k$ characters in $\Sigma$, how many the coding bits do we need at least? For example, we need to encode 5 characters, what is number of bits for each character?

## Variable-length encoding


Fixed-length encoding doesn't account for redundancy in the document. 

Let $f: \sigma \rightarrow \mathbb{R}$ be **the number of times** a character appears in $D$; this is easily computed in $O(|D|)$ work. 

Intuitively, we should encode the document by the frequency of the characters in the alphabet. The more frequent the character, the smaller its code should be.


$D = \langle A, A, A, A, A, A, A, A, A, B, C, D\rangle$

|$$\sigma$$ | $$f(\sigma)$$|
|-------|---------------|
| A     | 9 |
| B     | 1 |
| C     | 1 | 
| D     | 1 |



So, following that logic, we could come up with a code like this:

|$$\sigma$$ |$$e'(\sigma)$$|
|-------|---------------|
| A     | 0   |
| B     | 1  |
| C     | 00 | 
| D     | 11 |

Is this a valid encoding? <span style="color:red">Why?</span>

<br><br>
How should we decode `11`? It's ambiguous between `B` and `DD`.

<br><br>
<br><br>
<br><br>

<br><br>
<br><br>

Instead, we could use:


|$$\sigma$$ |$$e'(\sigma)$$|
|-------|---------------|
| A     | 0   |
| B     | 10  |
| C     | 110 | 
| D     | 111 |

This is a **variable-length** encoding, where each character may be encoded by a different number of bits.

This leads to an encoding of $D = \langle A, A, A, A, A, A, A, A, A, B, C, D\rangle$ as:

<br><br><br>

$e'(D) = "00000000010110111"$

<br><br>

This has length $1\cdot 9 + 2\cdot 1 + 3\cdot 1 + 3\cdot 1 = 17$. So this is a bit better. 


<span style="color:red">**Question**</span>: Can you come with other solutions?



In general, the cost of a given encoding $e$ is 

$$C(e) = \sum_{i=0}^{|D|} |e(D[i])| = \sum_{\sigma\in\Sigma} f(\sigma)\cdot e(\sigma).$$

Over all possible valid encodings $e: \Sigma \rightarrow \{0,1\}^*$, we want to find a variable-length encoding $e_*$ so that $C(e_*)$ is minimized.



## Prefix-free Encodings as Trees


<img src = "encoding_trees.jpg" width="60%">

Every prefix-free encoding $e$ can be represented by a tree $T_e$. Note that only leaf represents the character.

The depth of each character $d_T(\sigma)$ in the tree determines how many bits are needed to encode $\sigma$.

So the optimal compression of $D$ can be achieved by identifying the encoding tree $T$ that minimizes:

$$C(T) = \sum_{\sigma\in\Sigma} f(\sigma)\cdot d_T(\sigma)$$



We will come up with a greedy algorithm for constructing $T$ and show that it is optimal.


Intuitively we know we should ensure that when constructing an encoding tree, the higher the frequency is, the shorter the path length is.

How about if we sort the frequencies in descending order and then assign tree positions in this order? But how do we guarantee the highest frequency characters have a short depth? 


[<a href='https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding'>Shannon-Fano Coding</a>] We could group the characters into two sets of equal total frequency, this way the more frequent characters will have lower depth. This divide-and-conquer approach was developed by Shannon-Fano... but is not optimal.







David Huffman (as a graduate student in Robert Fano's class at MIT) came up with a *bottom-up* greedy algorithm as a class project and was able to prove that it was optimal, in the sense that no other prefix code can achieve a shorter average code length.

He invented his coding algorithm in 1951, so it has been around for about 70 years. Yet it is still ubiquitous!

Both **ZIP** and **MP3** file formats make use of Huffman coding as a last step after numerous preprocessing steps.

### Huffman Coding



The main idea of `Huffman Coding` is to choose the two **least** frequent characters $x$ and $y$ and create a subtree with $x$ and $y$ as sibling leaves for the final encoding. We then remove $x$ and $y$ from $\Sigma$ and add a *new* character $z$ with frequency $f(x)+f(y)$, and recurse to compute a tree $T'$. The final tree $T$ is just $T'$ with $z$ replaced by the subtree with $x, y$ as siblings. 
<br><br><br>
<img src="huffman_example.jpg" width="60%">

#### One More Example

|Character  | Frequency |
|---|--------|
|a  | 5      |
|b  | 9      |
|c  | 12     |
|d  | 13     |
|e  | 16     |
|f  | 45     |

    

### How to implement Huffman Coding? -  Using Priority Queues

The *priority queue* is a tree-based data structure that matches well with greedy algorithms since it allows for efficient **insertions**, **removals** and **updates** of items. 


For simplicity, we'll assume that we are always seeking the minimum-value element from the priority queue. The priority queue data structure needs to support some basic operations:

- *deleteMin*: Identify the element with minimum value and remove it. 

- *insert(x, s)*: insert a new element $x$ with initial value $s$.

 


### The Heap Property

The *min-heap property* for a tree states that every node in the tree is smaller than either of its children. This means that the root of a tree with the heap property is always the minimum element. So for a binary tree:

<img src="heap_property_fixed_examples.jpg" width="70%">

Notice that a **binary heap** is less restrictive than a **binary search tree** since the left and right subtrees can be swapped.

> Maintaining the heap property upon insertion or deletion requires time proportional to the depth of the tree because we can swap elements upward or downward, following the path from the modification either upward or downward.


> This leads to $O(\log n)$ work per operation.

We need to efficiently retrieve the next two smallest frequency nodes.


1. Initialize a min-heap with character frequencies $f(\sigma)$


![huffman-heap-2.png](huffman-heap-2.png)


Then, repeat:

2. Call `deleteMin` twice to get the two least frequent nodes $x$ and $y$
3. Create a new node $z$ with frequency $f(x) + f(y)$
4. Make $x$ and $y$ children of $z$ in the tree.
4. Call `insert` to add $z$ to the heap

How many times will this repeat if $|\Sigma| = n$?

<br><br>

<span style="color:red">**Question**:</span> What is work/span of this algorithm? What is the recurrence of work/span?


Because we will always reduce the number of nodes by 1, this will repeat $n$ times (where $n = |\Sigma|$).

The cost of 2 calls to `deleteMin` and one call to `insert` is $3 \lg n$.

Thus, total work is $O(n \lg n)$. 

We unfortunately have not exposed any parallelism in this algorithm, so the span is also $O(n \lg n)$.

In [None]:
import math, queue
from collections import Counter

## we can also do `map`, `reduce`
D = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'C', 'B', 'D', 'B', 'C', 'B']
cnt = Counter()
for c in D:
    cnt[c] += 1

## print each frequency per key
for c in cnt.keys():
    print('The charachter is', c, 'with its frequency as', cnt[c  ])

    


In [None]:
class TreeNode(object):
    # we assume data is a tuple (frequency, character)
    def __init__(self, left=None, right=None, data=None):
        self.left = left
        self.right = right
        self.data = data
    def __lt__(self, other):
        return(self.data < other.data)
    def children(self):
        return((self.left, self.right))


p = queue.PriorityQueue()
# construct heap from frequencies, the initial items should be
# the leaves of the final tree
for c in cnt.keys():
    p.put(TreeNode(None,None,(cnt[c], c)))

## print the priority queue    
for i in range(p.qsize()):    
    print(p.qsize())
    print(p.get().data)