<a href="https://colab.research.google.com/github/davidludington/comp363assignments/blob/main/363_SP_24_Huffman_David_Ludington.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Huffman encoding

## Objective

Encode an input string with a variable-length code based on [Huffman's technique](https://drive.google.com/file/d/1alRJwxo6tYZZA81tStR4pk5OZkFDIua8/view?usp=drive_link). The Huffman algorithm is:

```text
Initialize: forest of leaf nodes, each containing a
symbol from the input string and its frequency.

while forest has more than 1 nodes:

  remove two nodes with the lowest frequencies.
  
  create a new node with no symbol and the sum of
  these two lowest frequencies

  add the removed nodes as left and right children
  to the new node

  add the new node to the forest

The path from the remaining node in the forest to each
leaf node is the Huffman code of the symbol in that node.
```

As we discussed in class, the technique is conceptually straight forward. However its implementation presents us with some challenges.

## How to represent symbol frequencies

The string `'HELLO WORLD'` has eight symbols: `H`, `E`, `L`, `O`, space, `W`, `R`, and `D`. Their frequencies are 1, 1, 3, 2, 1, 1, 1, and 1. In other words, each symbol appears once except for `L` and `O` that appear three and two times respectively.

We discussed two possible ways to represent the frequencies, in implementation: as a humble array (a Python list) or as a hashtable (a Python dictionary).

In [None]:
def frequencies_dict(message: str) -> dict:
  """Returns the symbol frequency of a string as a dictionary."""
  frequency = dict() # Initialize a dictionary
  if message is not None and len(message) > 0: # Input not null and not empty
    for symbol in message: # For every symbol in the string
      if symbol in frequency: # If symbol already in dictionary
        frequency[symbol] += 1 # increase frequency for this symbol
      else: # Symbol not a dictionary yet
        frequency[symbol] = 1 # Initialize frequency for this symbol
  return frequency # Done

In [None]:
def frequencies_list(message: str) -> dict:
  """Returns the symbol frequency of a string as a list."""
  space = ord(' ') # lowest ASCII value to consider
  tilde = ord('~') # highest ASCII valeu to consider
  from_space_to_tilde = tilde-space # Range of ASCII values to consider
  frequency = [0] * from_space_to_tilde # Array for ASCII values to consider
  if message is not None and len(message) > 0: # Input not null and not empty
    for symbol in message: # For every symbol in the string
      symbol_ascii = ord(symbol)-space # Symbol ASCII shifted for array indexing
      if symbol_ascii <= from_space_to_tilde: # ASCII symbol within range
        frequency[symbol_ascii] += 1 # Update frequency for this symbol
  return frequency # Done

In [None]:
class Huffman_Node:
  """Plain node suitable for binary trees. The node stores a frequency and a
  symbol. When no symbol is given, the node considers it null, and stores only
  the frequency."""
  # Constructor
  def __init__(self, frequency, symbol=None):
    self.symbol = symbol
    self.frequency = frequency
    self.left = None
    self.right = None
  # Override < operator to compare nodes based only on frequency
  def __lt__(self, other):
    return self.frequency < other.frequency

# Your assignment

Write as many methods as you feel are needed to complete the encoding of a string.

Your deliverables are:

* the encoded message
* the encoding table (so that who ever received the encoded message, they can decode it),
* a compression report (see below for details),
* a method that decodes an encoded message (when the encoding table is available).

The compression report should be a *formatted* string that, when printed, will display the following:

```
Input string length: 1,234 characters
8-bit storage required: 9,872 bits
Encoded string length: 456 bits
Net compression: 95.3%
```

The input string length is just the `len(message)`. The 8-bit storage required is `8*len(message)`. The encoded string length is the number of 0s and 1s required to replace each string in the message, with the corresponding Huffman code. Net compression is defined as

$$100\times(1-\frac{\text{Encoded string length}}{\text{Input string length}})$$

## Anticipated challenges

Once you finish the while-loop in the pseudocode, you'll need to traverse every path from the root node to each symbol, to obtain its Huffman code. If the input string is `'HELLO WORLD'`, the final tree will be the following.


![huffman-complete](https://drive.google.com/uc?id=1axiu-SAImK4yTIFBoUwMLfWvYnVLTUOl)

Obtaining the codes, for example `LLLL` for `'H'`, `LLLR` for `'E'`, etc is probably the most challenging part of the implementation. Once you have your encoding table, the original message `'HELLO WORLD'` will be encoded as:

```text
LLLLLLLRLRLRLLRRLLRLRLLRRRLLRRRR
```
The Huffman encoding takes 32 bits (if we convert `L` and `R` directions to 0s and 1s). This is significantly shorter than the plain ASCII encoding that requires 88 bits:

```text
0100100001000101010011000100110001001111001000000101011101001111010100100100110001000100
```

To obtain the codes you need to traverse the tree, saving the path directions for each symbol. Remember that the tree is just a graph. Consider the root node as the starting vertex, in a *reachability* scenario. Intuitively, we know that every leaf node is reachable from the starting vertex in this graph.

We looked at graph traversals for reachability studies a few weeks ago. The overall technique is to use some data structure (we called it a *bag)* to save nodes to explore later. When that bag is accessed in stack fashion, the travesal is called *depth-first search* (DFS).

DFS will get you to each leaf node. The trick now is to maintain a string, for each leaf node, accumulating the left/right choices leading to that node. And you may want to replace left/right with 0/1 for a more thorough binary representation.

**If you have any doubts** about your DFS code, please share your notebook with me and let me help.




In [None]:
import heapq
def build_huffman_tree(frequencies: dict) -> Huffman_Node:
    """Builds a Huffman tree given a dictionary of symbol frequencies."""
    priority_queue = [Huffman_Node(freq, sym) for sym, freq in frequencies.items()] #initialize a priority queue with leaf nodes for each symbol, with the frequency and symbol provided.
    heapq.heapify(priority_queue) #Then, we convert the list into a min-heap
    while len(priority_queue) > 1: #for loop to combine the two lowest frequency nodes until there's only one node left in the priority queue
        left = heapq.heappop(priority_queue)
        right = heapq.heappop(priority_queue)
        merged = Huffman_Node(left.frequency + right.frequency)
        merged.left = left
        merged.right = right
        heapq.heappush(priority_queue, merged)
    return priority_queue[0]

In [None]:
def generate_encoding_table(root: Huffman_Node, prefix="", encoding_table=None) -> dict:
    """Generates an encoding table from a Huffman tree."""
    if encoding_table is None:
        encoding_table = {} #init an empty table to recursivly call
    if root is None:
        return encoding_table #If the root is None, we simply return the current state of the encoding table
    if root.symbol is not None: #If the current node is a leaf node (i.e., it has a symbol), we add its symbol and corresponding prefix to the encoding table
        encoding_table[root.symbol] = prefix
    generate_encoding_table(root.left, prefix + "0", encoding_table)
    generate_encoding_table(root.right, prefix + "1", encoding_table)
    return encoding_table

In [None]:
def encode_message(message: str, encoding_table: dict) -> str:
    """Encodes a message using the given encoding table."""
    encoded_message = ""
    for symbol in message:
        encoded_message += encoding_table[symbol]
    return encoded_message

def decode_message(encoded_message: str, encoding_table: dict) -> str:
    """Decodes an encoded message using the given encoding table."""
    reverse_encoding_table = {code: symbol for symbol, code in encoding_table.items()}
    decoded_message = ""
    current_code = ""
    for bit in encoded_message:
        current_code += bit
        if current_code in reverse_encoding_table:
            decoded_message += reverse_encoding_table[current_code]
            current_code = ""
    return decoded_message

def compression_report(message: str, encoded_message: str) -> str:
    """Generates a compression report."""
    input_length = len(message)
    eight_bit_storage = 8 * input_length
    encoded_length = len(encoded_message)
    net_compression = 100 * (1 - encoded_length / input_length)
    report = f"Input string length: {input_length} characters\n"
    report += f"8-bit storage required: {eight_bit_storage} bits\n"
    report += f"Encoded string length: {encoded_length} bits\n"
    report += f"Net compression: {net_compression:.1f}%\n"
    return report

In [None]:
message = "the quick brown fox jumpes over them "

frequencies = frequencies_dict(message)

root = build_huffman_tree(frequencies)

encoding_table = generate_encoding_table(root)

encoded_message = encode_message(message, encoding_table)

decoded_message = decode_message(encoded_message, encoding_table)

report = compression_report(message, encoded_message)

print("Original message:", message)
print("Encoded message:", encoded_message)
print("Decoded message:", decoded_message)
print("\nCompression Report:")
print(report)


Original message: the quick brown fox jumpes over them 
Encoded message: 11001000010111100111011001111110000010111100110100011010100101000111011101101000111110001001101001001001010101011111010101110110001111100100001011001111
Decoded message: the quick brown fox jumpes over them 

Compression Report:
Input string length: 37 characters
8-bit storage required: 296 bits
Encoded string length: 152 bits
Net compression: -310.8%

