# Huffman Coding

- **Created by Andrés Segura Tinoco**  
- **Created on June 20, 2019**

In computer science and information theory, a **Huffman Code** is a particular type of optimal prefix code that is commonly used for lossless data compression. The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. <a href='#link_one'>[1]</a>

## Huffman Code from Scratch

In [1]:
# Class HuffmanCode from Scratch
class HuffmanCode:
    
    # Return a Huffman code for an ensemble with distribution p
    def get_code(self, p_symbols):
        
        # Init validation
        N = len(p_symbols)
        if N == 0:
            return dict()
        elif N == 1:
            return dict(zip(p_symbols.keys(), ['0']))
        
        # Ensure probabilities sum to 1
        self.normalize_weights(p_symbols)
        
        # Returns Huffman code
        return self._get_code(p_symbols);
    
    # (Private) Calculate Huffman code
    def _get_code(self, p):
        
        # Base case of only two symbols, assign 0 or 1 arbitrarily
        if len(p) == 2:
            return dict(zip(p.keys(), ['0', '1']))
        
        # Create a new distribution by merging lowest prob pair
        p_prime = p.copy()
        s1, s2 = self.lowest_prob_pair(p)
        p1, p2 = p_prime.pop(s1), p_prime.pop(s2)
        p_prime[s1 + s2] = p1 + p2
        
        # Recurse and construct code on new distribution
        code = self._get_code(p_prime)
        s1s2 = code.pop(s1 + s2)
        code[s1], code[s2] = s1s2 + '0', s1s2 + '1'
        
        return code;
    
    # Return pair of symbols from distribution p with lowest probabilities
    def lowest_prob_pair(self, p):
        
        # Ensure there are at least 2 symbols in the dist.
        if len(p) >= 2:
            sorted_p = sorted(p.items(), key=lambda x: x[1])
            return sorted_p[0][0], sorted_p[1][0];
        
        return (None, None);
    
    # Makes sure all weights add up to 1
    def normalize_weights(self, p_symbols, t_weight=1.0):
        n = sum(p_symbols.values())
        
        if n != t_weight:
            for s in p_symbols:
                p_symbols[s] = p_symbols[s] / n;

## Example execution

In [2]:
# Create Huffman Code instance
hc = HuffmanCode()

In [3]:
# Alphabet with 1 symbol
sample_1 = { 'a': 1.0 }
hc.get_code(sample_1)

{'a': '0'}

In [4]:
# Alphabet with 3 symbols and total probability less than 1
sample_2 = { 'a': 0.6, 'b': 0.25, 'c': 0.1 }
hc.get_code(sample_2)

{'a': '0', 'c': '10', 'b': '11'}

In [5]:
# Alphabet with 5 symbols and total probability equal than 1.0
sample_3 = { 'a': 0.05, 'b': 0.09, 'c': 0.12, 'd': 0.13, 'e': 0.16, 'f': 0.45 }
hc.get_code(sample_3)

{'f': '0', 'e': '111', 'c': '100', 'd': '101', 'a': '1100', 'b': '1101'}

## Metrics
- Time
- Size
- Distribution type: normal, uniform
- Entropy
- Parallelism

## References

<a name='#link_one' href='https://en.wikipedia.org/wiki/Huffman_coding' target='_blank' >[1]</a> Wikipedia - Huffman coding.  

<hr>
<p><a href="https://github.com/ansegura7/Icaro/">« Home</a></p>