Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Note that this Pre-class Work is estimated to take **40 minutes**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Enjui Chang"
COLLABORATORS = "Cameron Watts"

---

# CS110 Pre-class Work - Huffman codes

In this Pre-class work we will apply Huffman's algorithm in file compression. 

## Question 1 [time estimate: 2 minutes]
Below is the utility function for downloading a text file from a URL. 

In [2]:
from urllib.request import urlopen
import shutil
import gzip
import os

# Download the file if need be:
def download_file(url, filename):
    if not os.path.exists(filename):
        response = urlopen(url + filename)
        shutil.copyfileobj(
            gzip.GzipFile(fileobj=response), open(filename+'.txt', 'wb'))

url = "http://www.gutenberg.org/ebooks/"
filename = "100.txt.utf-8"

download_file(url, filename) 

Your tasks:

1. Run the cell so that the file "100.txt.utf-8" is downloaded to your local machine. Please allow some time for the code to complete.
2. Check that the file "100.txt.utf-8" has been downloaded to your computer.
3. Open and view the file with your favorite text editor. 
4. In the cell below, write down the size of the downloaded file (for example, 1.2GB)

5.5MB

## Question 2 [time estimate: 8 minutes]

Now, as a bit of an interlude, we will get familiar with the `bitarray` Python library, which is helpful for completing this pre-class work. Go to this [link](https://pypi.org/project/bitarray/) and read the examples in the first three cell of section **Using the module**. Once you complete the reading task, please complete the function `get_bit_array` in the cell below.

In [3]:
from bitarray import bitarray

class Node(object):
    """
    A node in a binary tree that represents a prefix code.
    
    Attributes
    ----------
    freq : float
        Frequency of the character
    symb : str
        A character in the file 
    parent : a node, optional
        Parent of the current node in the tree
    lchild : a node, optional
        Left child node of the current node in the tree
    rchild: a node, optional
        Right child node of the current node in the tree

    """
    
    def __init__(self, freq, symb, parent = None, lchild = None, rchild = None):
        self.freq = freq
        self.symb = symb
        self.parent = parent
        self.lchild = lchild
        self.rchild = rchild
    
    def __lt__(self, other):
        """
        This function allow us push/insert a node into a heap as well as
        extract/pop the minimum node from a heap. 
        
        You can brush up your memory on heaps by visiting this link:
        https://docs.python.org/3.0/library/heapq.html
        
        Note
        ----
        nodeA < nodeB returns True if nodeA.freq < nodeB.freq
        
        """
        return self.freq < other.freq
    
    
def get_bitarray(node):
    """
    Given a node in the tree, determines the corresponding codeword for character
    node.symb, using the rule in Cormen et al.: the binary codeword for a character 
    is the simple path from the root to that character, where 0 means “go to the 
    left child” and 1 means “go to the right child.
    
    Parameters
    ----------
    node: a node
        A node whose codeword represented by the tree is of interest
    
    Returns
    -------
    a : bitarray
        A bit array that represents the codeword. 
        
    Example
    -------
    If the codeword is 01001, then a is bitarray('01001')
    
    """
    # empty list for storage
    a = []
    
    # when there is a parent, interatively run the function
    while node.parent:
    
        # if the node is the left child return 1
        if node.parent.lchild == node:
            a.append(True)
            

        # if the node is the left child return 0
        else:
            node.parent.rchild == node
            a.append(False)
        
        # change the pointer to the parent node
        node = node.parent
        
    # reverse becuase how append works
    a.reverse()
    
    # classify as bitarray
    a = bitarray(a)
    return a

## Question 3 [time estimate: 10 minutes]

Complete the following function that builds a Huffman code, making use of `get_bitarray` and the module `heapq`. 

In [4]:
import heapq
def encode(symb2freq):
    """
    Huffman encode the given dict mapping symbols to weights. 
    
    Parameters
    ----------
    symb2freq : dict 
        A dictionary that maps a symbol/character to the probability frequency 
        in the text file. 
    
    Returns
    -------
    out : dict
         A dictionary that maps a symbol/charcater to a bitarray that represents the 
         codeword for that symbol. 
         
    Examples
    --------
    symb2freq = {'a': .3, 'b':.6, 'c': .1}. This means that symbol 'a' appears with 
    frequency 30%, symbol 'b' 60%, and symbol 'c' 10%.
        
    out = {'a': bitarray('01'), 'b': bitarray('11'), 'c': bitarray('101')}.
    
    """
    
    # find the length
    n = len(symb2freq)
    
    # intialize storage
    min_heap = []
    all_leaf = []
    
    # create min_heap
    for key, freq in symb2freq.items():
        x = Node(freq,key)
        heapq.heappush(min_heap,x) # push into heap
        all_leaf.append(x) # additional list for dictionary construction
    
        
    # build the tree 
    for i in range(n-1):
        
        # pop the the min nodes
        x = heapq.heappop(min_heap)
        y = heapq.heappop(min_heap)
        
        # construct a new nodes as the parents of x and y
        z = Node(x.freq+y.freq, None, None, x, y)
        
        x.parent = z
        y.parent = z
        
        # push the new node in the heap
        heapq.heappush(min_heap,z)
    
    # create output dict
    out = {}
    for i in all_leaf:
        out[i.symb] = get_bitarray(i)
        
    return out

## Question 4 [time estimate: 7 minutes]

Below you are given three functions to 1) build a frequency table for a file, 2) compress a file, and 3) decompress a file. Make use of these functions to do the following:

1. Create a compressed version of file `100.txt.utf-8.txt` that is named `100.txt.utf-8.txt.huff`.
2. Decompress `100.txt.utf-8.txt.huff` to file `100.txt.utf-8.txt.huff.dehuff.txt`. 

In [5]:
from collections import defaultdict 
import pickle

# build a frequency table:
def build_freq(filename):
    freq = defaultdict(int)
    with open(filename, 'rb') as f:
        for line in f:
            for char in line:
                freq[char] += 1
    total = float(sum(freq.values()))
    return {char: count / total for (char, count) in freq.items()}


# Now compress the file:
def compress(filename, encoding, compressed_name=None):
    if compressed_name is None:
        compressed_name = filename + ".huff"
    output = bitarray()
    with open(filename, 'rb') as f:
        for line in f:
            for char in line:
                output.extend(encoding[char])
    N = len(output)
    with open(compressed_name, 'wb') as f:
        pickle.dump(N, f)
        pickle.dump(encoding, f)
        output.tofile(f)


# Now decompress the file:
def decompress(filename, decompressed_name=None):
    if decompressed_name is None:
        decompressed_name = filename + ".dehuff.txt"
    with open(filename, 'rb') as f:
        N = pickle.load(f)
        encoding = pickle.load(f)
        bits = bitarray()
        bits.fromfile(f)
        bits = bits[:N]

    # Totally cheating here and using a builtin method:
    output = bits.decode(encoding)
    with open(decompressed_name, 'wb') as f:
        f.write(bytes(output))

# build the frequent table
freq_table = build_freq("100.txt.utf-8")

# encode the frequent table
encoded = encode(freq_table)

# compress the file
compress("100.txt.utf-8", encoded)

# decompress the file
decompress("100.txt.utf-8.huff")

## Question 5 [time estimate: 3 minutes] 

Give your answer in the cell below:
1. Report the size of the compressed file and the decompressed file in the cell below.
2. How does the size of the decompressed file compare to the size of the original file (`100.txt.utf-8.txt`)?
3. Visually skim the decompressed file and the original file. Do they appear identical?

1. 100.txt.utf-8.txt: 5.5 MB
2. 100.txt.utf-8.txt.huff: 3.2 MB
3. 100.txt.utf-8.txt.dehuff.txt: 5.5 MB

The decompressed file is the same size as the original file and they appear identical to each other.

## Question 6 [time estimate: 10 minutes]

Compute and print out:
1. The percentage of 1’s in the compressed version
2. The percentage of 1’s in the uncompressed version

In [6]:
# modify the frequency table to include the count of each character
def build_freq_modified(filename):
    freq = defaultdict(int)
    with open(filename, 'rb') as f:
        for line in f:
            for char in line:
                freq[char] += 1
    total = float(sum(freq.values()))
    return {char: count for (char, count) in freq.items()}

# build the modified frequency table
freq_table_uncompress = build_freq_modified("100.txt.utf-8")
freq_table_compress = build_freq_modified("100.txt.utf-8.huff")

In [7]:
# compressed version
total_ones = 0
total_zeros = 0
total_bits = 0

# encode the characters within this frequency table
encoded = encode(freq_table_compress)

for key, count in freq_table_compress.items():

    # calculate the 1s, 0s, and bits for each character 
    ones = 0
    zeros = 0
    bits = 0
    
    for bit in encoded[key]:
        bits += 1
        if bit:
            ones += 1
        else:
            zeros += 1
    
    # multiply by the count of each character
    total_ones += ones * count
    total_zeros += zeros * count
    total_bits += bits * count

# print the result
print("---Compressed---")
print("Percentage of 1s:", total_ones * 100 / total_bits, "%")
print("Percentage of 0s:", total_zeros * 100 / total_bits, "%")

# uncompressed version
total_ones = 0
total_zeros = 0
total_bits = 0

# encode the characters within this frequency table
encoded = encode(freq_table_uncompress)

for key, count in freq_table_uncompress.items():

    # calculate the 1s, 0s, and bits for each character 
    ones = 0
    zeros = 0
    bits = 0
    
    for bit in encoded[key]:
        bits += 1
        if bit:
            ones += 1
        else:
            zeros += 1
    
    # multiply by the count of each character
    total_ones += ones * count
    total_zeros += zeros * count
    total_bits += bits * count

# print the result
print("---Uncompressed---")
print("Percentage of 1s:", total_ones * 100 / total_bits, "%")
print("Percentage of 0s:", total_zeros * 100 / total_bits, "%")

---Compressed---
Percentage of 1s: 47.76543165862384 %
Percentage of 0s: 52.23456834137616 %
---Uncompressed---
Percentage of 1s: 46.29646894848983 %
Percentage of 0s: 53.70353105151017 %
