# Huffman Coding

- **Created by Andrés Segura Tinoco**  
- **Created on June 20, 2019**

In computer science and information theory, a **Huffman Code** is a particular type of optimal prefix code that is commonly used for lossless data compression. The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. <a href='#link_one'>[1]</a>

In [1]:
# Load Python libraries
import pandas as pd
from collections import Counter

## 1. Huffman Code from Scratch

In [2]:
# Class HuffmanCode from scratch
class HuffmanCode:
    
    # Return a Huffman code for an ensemble with distribution p
    def get_code(self, p_symbols):
        
        # Init validation
        n = len(p_symbols)
        if n == 0:
            return dict()
        elif n == 1:
            return dict(zip(p_symbols.keys(), ['0']))
        
        # Ensure probabilities sum to 1
        self.normalize_weights(p_symbols)
        
        # Returns Huffman code
        return self._get_code(p_symbols);
    
    # (Private) Calculate Huffman code
    def _get_code(self, p):
        
        # Base case of only two symbols, assign 0 or 1 arbitrarily
        if len(p) == 2:
            return dict(zip(p.keys(), ['0', '1']))
        
        # Create a new distribution by merging lowest prob pair
        p_prime = p.copy()
        s1, s2 = self.lowest_prob_pair(p)
        p1, p2 = p_prime.pop(s1), p_prime.pop(s2)
        p_prime[s1 + s2] = p1 + p2
        
        # Recurse and construct code on new distribution
        code = self._get_code(p_prime)
        symbol = s1 + s2
        s1s2 = code.pop(symbol)
        code[s1], code[s2] = s1s2 + '0', s1s2 + '1'
        
        return code;
    
    # Return pair of symbols from distribution p with lowest probabilities
    def lowest_prob_pair(self, p):
        
        # Ensure there are at least 2 symbols in the dist.
        if len(p) >= 2:
            sorted_p = sorted(p.items(), key=lambda x: x[1])
            return sorted_p[0][0], sorted_p[1][0];
        
        return (None, None);
    
    # Makes sure all weights add up to 1
    def normalize_weights(self, p_symbols, t_weight=1.0):
        n = sum(p_symbols.values())
        
        if n != t_weight:
            for s in p_symbols:
                p_symbols[s] = p_symbols[s] / n;

In [3]:
# Create Huffman Code instance
hc = HuffmanCode()

### Simple Example

In [4]:
# Alphabet with 1 symbol
sample_1 = { 'a': 1.0 }
hc.get_code(sample_1)

{'a': '0'}

In [5]:
# Alphabet with 3 symbols and total probability less than 1
sample_2 = { 'a': 0.6, 'b': 0.25, 'c': 0.1 }
hc.get_code(sample_2)

{'a': '0', 'c': '10', 'b': '11'}

In [6]:
# Alphabet with 5 symbols and total probability equal than 1.0
sample_3 = { 'a': 0.05, 'b': 0.09, 'c': 0.12, 'd': 0.13, 'e': 0.16, 'f': 0.45 }
hc.get_code(sample_3)

{'f': '0', 'e': '111', 'c': '100', 'd': '101', 'a': '1100', 'b': '1101'}

### Complex Example

In [7]:
# Read file in low level (Bytes)
def get_image_bytes(file_path):
    with open(file_path, 'rb') as f:
        return bytearray(f.read());
    return None;

In [8]:
# Loading an example image
file_path = "../data/img/example-2.png"
low_byte_list = get_image_bytes(file_path)

In [9]:
# Weight of the image (KB)
round(len(low_byte_list) / 1024, 2)

2884.01

In [10]:
# Calculate code frequency
def get_term_freq(term_list):
    term_freq = {}
    terms_count = dict(Counter(low_byte_list))
    
    for key, value in terms_count.items():
        if isinstance(key, int):
            key = chr(key)
        term_freq[key] = value
    
    return term_freq;

In [11]:
# Alphabet with 256 symbols
term_freq = get_term_freq(low_byte_list)
len(term_freq)

256

In [12]:
# Normalize term frequency
n = sum(term_freq.values())
for term in term_freq:
    term_freq[term] = term_freq[term] / n;
sum(term_freq.values())

0.9999999999999999

In [13]:
# Get Huffman coding
h_code = hc.get_code(term_freq)

In [14]:
# Showing data
codes = pd.DataFrame([term_freq, h_code]).T
codes.reset_index(level=0, inplace=True)
codes.columns = ["code", "frequency", "binary"]
codes.head(20)

Unnamed: 0,code,frequency,binary
0,�,0.00324222,110111
1,,0.00321953,110100
2,,0.00275022,1001
3,,0.00405015,10010101
4,,0.00258937,111100011
5,,0.00314335,101110
6,,0.00338816,1000110
7,,0.00461089,11001101
8,,0.00267674,111111100
9,\t,0.00325069,111011


In [15]:
# Calculate simple and weighted message size average
msg_size_current = 8
msg_size_simple = 0
msg_size_weighted = 0

for key, value in h_code.items():
    msg_size_simple += len(value)
    msg_size_weighted += len(value) * term_freq[key]
        
msg_size_simple = msg_size_simple / len(h_code)

In [16]:
# Current message size average (bits per symbol)
msg_size_current

8

In [17]:
# Simple message size average (bits per symbol)
msg_size_simple

8.02734375

In [18]:
# Weighted message size average (bits per symbol)
msg_size_weighted

7.997494263047528

In [19]:
# Calculating compression ratio
compress_rate = (msg_size_current - msg_size_weighted) / msg_size_current
round(compress_rate * 100, 2)

0.03

## 2. Compress File with Huffman Code

In [20]:
# Compressing file with Huffman code
compress_list = []

for symbol in low_byte_list:
    key = chr(symbol)
    new_symbol = h_code[key]
    compress_list.append(new_symbol)

compress_file = "".join(compress_list)
print(compress_file[:500])

00110011000111011001001001111011011101011111011110010010111110111100110111001101110011011101110101010000010001100111101001100000100001101110011011111001101001111010011011100110111111100011010011111111111000100011000110111001101110011011100100101001101101011011011001111001101110011011100110111001101001010000000000100011110110010010000110111101010011000000001111111110101110011011100110111001101111111000111000111100111100100000110011110000110111001101110101111110111100100100000000011010100110010111


In [21]:
round(len(compress_file) / 8 / 1024, 2)

2883.1

## References

<a name='#link_one' href='https://en.wikipedia.org/wiki/Huffman_coding' target='_blank' >[1]</a> Wikipedia - Huffman coding.  

<hr>
<p><a href="https://github.com/ansegura7/DataCompression/">« Home</a></p>