# <font color=blue>Learn More About Huffman Trees</font>
## <font color=blue>Teach One Another</font>


## <font color=red>**TODO** Study a Simple Huffman Tree Representation</font>


In a followup to Question 3 on the reading quiz, use the tree and the Python code given below to decode the following message:


```00010010000111010111011011010111100110000010111101100011101011110```


Also, decode the longer message:


```0001001000111001110101110110110101111001100000101111011000111010111100111```


and explain the reason for the (perhaps) surprising result!


In [None]:
tree = [[['n', 't'], [['i', 'o'], [' ', '']]], ['e', ['h', 'r']]]

In [None]:
def is_leaf(node):
  return isinstance(node, str)

def get_left(node):
  return node[0] if isinstance(node, list) else None

def get_right(node):
  return node[1] if isinstance(node, list) else None

def decode(encoded, tree):
  (leaf, encoded) = find_leaf(encoded, tree)
  message = leaf
  while len(encoded):
    (leaf, encoded) = find_leaf(encoded, tree)
    message += leaf
  return message

def find_leaf(encoded, tree):
  if is_leaf(tree):
    return [tree, encoded]
  elif len(encoded):
    first_char = encoded[0]
    rest_of_encoded = encoded[1:]
    return find_leaf(rest_of_encoded, get_left(tree)) \
           if first_char == '0' else \
           find_leaf(rest_of_encoded, get_right(tree)) \
           if first_char == '1' else None
  else:
    return ['', encoded]

In [None]:
print(decode('00010010000111010111011011010111100110000010111101100011101011110', tree))

In [None]:
print(decode('0001001000111001110101110110110101111001100000101111011000111010111100111', tree))

## <font color=red>Step It Up</font>


Recall the code from [Prepare for Huffman Trees](https://colab.research.google.com/github/byui-cse/cse280-02F21/blob/main/prepare-for-huffman-trees.ipynb).


Here's a better version that uses Python's object-orientedness in the form of the **anytree**, **collections**, and **queue** modules:


In [None]:
!pip install anytree

In [None]:
from anytree import Node, RenderTree, PreOrderIter, Walker
from anytree.util import leftsibling, rightsibling
from anytree.exporter.dotexporter import UniqueDotExporter
from collections import Counter
from queue import PriorityQueue
from math import ceil, floor, log

class HuffmanTreeNode(Node):
  def __lt__(self, other):
    return self.count < other.count

def build_initial_queue(message):
  count_dict = Counter(message)
  print(len(count_dict.keys()))
  q = PriorityQueue()
  for key, val in count_dict.items():
    q.put(HuffmanTreeNode(key, count = val))
  return q

def new_internal_node(left, right):
  return HuffmanTreeNode('', children = [left, right], count = left.count + right.count)

def make_huffman_tree(Q):
  while Q.qsize() > 1:
    left = Q.get()
    right = Q.get()
    Q.put(new_internal_node(left, right))

  return Q.get()

def get_codes(root):
  leaves = [node for node in PreOrderIter(root, filter_=lambda n: not n.children)]
  codes = {}
  w = Walker()
  for leaf in leaves:
    path = w.walk(leaf, root)[0]
    code = ''
    for node in path:
      code = ('1' if leftsibling(node) else '0') + code
    codes[leaf.name] = tuple([code, leaf.count])
  return codes

def calc_compression_ratio(f, v):
  return 100 * (f - v) / f

def get_encoded_size(codes):
  return sum([len(code) * count for _, (code, count) in codes.items()])

def get_fixed_size(codes):
  num_keys = len(codes)
  num_bits_per_key = ceil(log(num_keys, 2))
  return sum([num_bits_per_key * count for _, (code, count) in codes.items()])

def report(codes):
  cr = calc_compression_ratio(get_fixed_size(codes), get_encoded_size(codes))
  print(f'\nCompression ratio: {cr:.2f}%')

Recall Tuesday's task about compressing a larger message?


What is the highest compression ratio achievable using Huffman encoding to compress the entire [Standard Works](https://byui-cse.github.io/cse280-course/scriptures.txt)?!

In [None]:
!curl -s -O https://byui-cse.github.io/cse280-course/scriptures.txt

In [None]:
!curl -s -O https://byui-cse.github.io/cse280-course/scriptures.csv

In [None]:
!head -20 scriptures.csv

In [None]:
def try_The_Word():
  global root
  with open('scriptures.txt') as f:
    message = f.read()

    root = make_huffman_tree(build_initial_queue(message))
    report(get_codes(root))

In [None]:
try_The_Word()

In [None]:
# And if you love things visual...
print(RenderTree(root))

In [None]:
# Dot is a beautiful thing...
UniqueDotExporter(root).to_picture('scriptures_encoded.png')

## <font color=red>**TODO** Compare to Standard</font>

Compare the compression ratio you just saw to the compression ratio using another compression method:


In [None]:
!cp scriptures.txt scriptures.tmp
!gzip scriptures.tmp
!wc -c scriptures.txt
!wc -c scriptures.tmp.gz

Using the reported sizes in bytes of these two files, what is the compression ratio?


What if we were to use PREcalculated frequency counts?

Typically, no punctuation or other special characters are counted, only the 26 uppercase letters, so unless you manipulate the text or add counts for the other characters, these three sources are only FYI:

[Too small a sample](http://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html)

[A better sample](https://www.sttmedia.com/characterfrequency-english)

[From analyzing the Concise Oxford Dictionary](https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html)

## <font color=red>**TODO** Go Above and Beyond --- Study a Little History, Read, and Draw</font>


The Huffman Tree Algorithm is a thing of beauty for three reasons:


1. It is easy to understand and implement --- a classic greedy algorithm.
2. It is provably optimal among methods encoding symbols separately.
3. David Huffman was inspired!


From the [History section of Wikipedia's page on Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding#History):


> In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.


> In doing so, Huffman outdid Fano, who had worked with information theory inventor Claude Shannon to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike top-down Shannon-Fano coding.


Read [A Method for the Construction of Minimum-Redundancy Codes](http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf) (David Huffman's original paper) and by drawing pictures, make the connection between trees and rivers vivid in your mind!
