# Basic Huffman Compression

In [1]:
import os

When computers save text documents, they usually use ASCII or utf8. Storing the characters as bytes, so there are $256$ possible characters to choose from. The aim is to make the passage of text take less bits to store, we can do this by using a method called Huffman Compression. This is also a form of symmetric encryption as it requires a key to encrypt and decrypt, the key in this case is the binary tree.

This method of compression is lossless and only works on text. There can be no lossy compression methods for texts as that would remove words from the text. The code below was codded using Test Driven Development and uses the python test framework 'unittest'. Additionally, this was designed to use a minimum amount of imported modules therefore it could be used with the base version of Python. 

### How it works

When compressing we need to create a Huffman binary tree, which will be like an encryption and decryption key. The tree is created like so

<ol>
    <li> Count up how many times each character occurs (find the letter frequency). </li>
    <li> Order from highest count to lowest.</li>
    <li> Take the two least common characters, add their frequencies together. This forms a branch.  </li>
    <li> Insert the branch into the list, inserted so the list is still in order. </li>
    <li> Keep taking two least common elements until one is left.</li>
</ol>
We now have our Huffman Binary Tree.😊

When uncompressing a compressed file we need the binary tree, we treat the compressed file's bits like reading morse code. But in morse code there is a brief wait between each of the characters allowing all the nodes in the binary tree to be used. However, since characters can't pause in computer systems, only the end nodes of the tree can be used to store characters. Each binary digit will tell us to go left or right on the tree.

Below is the Huffman Compression code.

In [2]:
from tkinter.filedialog import askopenfilename


def binary_splitter(string, number):
    string += "0" * (len(string) % number)
    split_string = []
    for i in range(0, len(string), number):
        temp_add = string[i:i + number]
        temp_add = int(temp_add, 2)
        split_string.append(temp_add)
    return split_string


def get_binary_string(byte_array):
    binary_string = ""
    for byte in byte_array:
        temp_array = list(bin(byte))
        del temp_array[0:2]
        temp_array.insert(0, '0' * (8 - len(temp_array)))
        binary_string += ''.join(temp_array)
    return binary_string


class Huffman:
    def __init__(self, uncompressed_file="myFile.txt", compressed_file="Byte_File",
                 binary_tree_file="binary_tree.txt", file_browser=True):
        self.uncompressed_file = uncompressed_file
        self.compressed_file = compressed_file
        self.chars = []
        self.all_letters = []
        self.numbers = []
        self.binary_tree = []
        self.binary_string = []
        self.binary_tree_location = binary_tree_file
        self.lookup_table = []
        self.file_browser = file_browser

    def add_letter(self, letter):
        self.all_letters.append(letter)
        if self.chars.count(letter) == 1:
            index_of_element = self.chars.index(letter)
            self.numbers[index_of_element] += 1
        else:
            self.chars.append(letter)
            self.numbers.append(1)

    def bubble_sort(self):
        swapped = True
        while swapped:
            swapped = False
            for i in range(0, len(self.numbers) - 1):
                if self.numbers[i] > self.numbers[i + 1]:
                    temp_number = self.numbers[i]
                    temp_char = self.chars[i]
                    self.numbers[i] = self.numbers[i + 1]
                    self.chars[i] = self.chars[i + 1]
                    self.numbers[i + 1] = temp_number
                    self.chars[i + 1] = temp_char
                    swapped = True

    def read_uncompressed_file(self):
        file = open(self.uncompressed_file, "r", encoding='utf8')
        for elements in file.read():
            self.add_letter(elements)
        file.close()

    def make_tree(self):
        temp_chars = self.chars[:]  # So it passes by value(Makes copy), and not by reference
        temp_num = self.numbers[:]
        while len(self.chars) > 2:
            total_number = self.numbers[0] + self.numbers[1]
            add_section = [self.chars[0], self.chars[1]]
            del self.chars[0:2]
            del self.numbers[0:2]
            self.chars.insert(0, add_section)
            self.numbers.insert(0, total_number)
            self.bubble_sort()

        self.binary_tree = self.chars
        self.chars = temp_chars
        self.numbers = temp_num

    def write_compressed_file(self):
        file = open(self.compressed_file, 'wb')

        byte_array = binary_splitter(self.binary_string, 8)
        file.write(bytes(byte_array))
        file.close()
        self.binary_string = ""

    def read_compressed_file(self):
        self.binary_string = []
        file = open(self.compressed_file, 'rb')
        byte_array = file.read()
        self.binary_string = get_binary_string(byte_array)
        file.close()

    def save_binary_tree(self):
        file = open(self.binary_tree_location, "w", encoding='utf8')
        file.write(str(self.binary_tree))
        file.close()

    def read_binary_tree(self):
        file = open(self.binary_tree_location, "r", encoding='utf8')
        self.binary_tree = eval(file.read())
        file.close()

    def write_uncompressed_file(self):
        file = open(self.uncompressed_file, "w", encoding='utf8')
        for element in self.all_letters:
            file.write(element)
        file.close()

    def create_binary_store_string(self):
        self.lookup_table = []
        self.find_children("", self.binary_tree)
        for element in self.all_letters:
            for lookup in self.lookup_table:
                if lookup[0] == element:
                    self.binary_string += lookup[1]
                    break
        self.binary_string = "".join(self.binary_string)

    def find_children(self, binary_location, children):
        if not (type(children[0]) == str):
            self.find_children(binary_location + "0", children[0])
        elif type(children[0]) == str:
            self.lookup_table.append([children[0], binary_location + "0"])
        else:
            print("Error, find children", children[0])
        if not (type(children[1]) == str):
            self.find_children(binary_location + "1", children[1])
        elif type(children[1]) == str:
            self.lookup_table.append([children[1], binary_location + "1"])
        else:
            print("Error, find children", children[1])

    def read_stored_binary_string(self):
        self.all_letters = []
        temp_tree = self.binary_tree[:]
        for digit in self.binary_string:
            temp_tree = temp_tree[:][int(digit)]
            if type(temp_tree) == str:
                self.all_letters.append(temp_tree)
                temp_tree = self.binary_tree[:]

    def refresh(self):
        self.chars = []
        self.all_letters = []
        self.numbers = []
        self.binary_tree = []
        self.binary_string = []
        self.lookup_table = []

    def compress(self):
        self.refresh()
        self.select()
        self.read_uncompressed_file()
        self.bubble_sort()
        self.make_tree()
        self.save_binary_tree()
        self.create_binary_store_string()
        self.write_compressed_file()
        self.refresh()

    def select(self):
        if self.file_browser:
            print("Pick, uncompressed file")
            self.uncompressed_file = askopenfilename()
            print("Pick, compressed file")
            self.compressed_file = askopenfilename()

    def uncompress(self):
        self.refresh()
        self.select()
        self.read_compressed_file()
        self.read_binary_tree()
        self.read_stored_binary_string()
        self.write_uncompressed_file()
        self.refresh()

In [3]:
#My tests
import unittest

path = 'HuffmanCompression/'
class TestHuffman(unittest.TestCase):
    def test_add_letter_adds_new_letter_to_array_and_stores_one_as_number(self):
        comp = Huffman()
        comp.add_letter("m")
        self.assertEqual("m", comp.chars[0])
        self.assertEqual(1, comp.numbers[0])

    def test_add_letter_adds_to_count_of_letter_with_letter_already_in_array(self):
        comp = Huffman()
        comp.add_letter("M")
        comp.add_letter("M")
        comp.add_letter("M")
        self.assertEqual("M", comp.chars[0])
        self.assertEqual(3, comp.numbers[0])

    def test_add_letter_adds_each_letter_to_an_array(self):
        comp = Huffman()
        comp.add_letter("M")
        comp.add_letter("M")
        comp.add_letter("M")
        self.assertEqual(["M", "M", "M"], comp.all_letters)

    def test_bubble_sort_sorts_data_into_increasing_order_on_number_of_elements(self):
        comp = Huffman()
        comp.chars = ["d", "e", "J", "5", "u", "f", "0"]
        comp.numbers = [10, 8, 2, 2, 15, 12, 5]
        expected_chars = ["J", "5", "0", "e", "d", "f", "u"]
        expected_numbers = [2, 2, 5, 8, 10, 12, 15]
        comp.bubble_sort()
        self.assertEqual(expected_chars, comp.chars)
        self.assertEqual(expected_numbers, comp.numbers)

    def test_reading_text_file_and_adding_and_ordering_arrays(self):
        comp = Huffman(path + "Test.txt")
        comp.read_uncompressed_file()
        comp.bubble_sort()
        self.assertEqual(["H", "e", " ", "\n", "w", "r",
                          "d", "o", "l"], comp.chars)
        self.assertEqual([1, 1, 1, 1, 1, 1, 1, 2, 3], comp.numbers)

    def test_make_tree_creates_the_correct_array_to_store_the_tree_in(self):
        comp = Huffman()
        comp.chars = ["a", "b", "c", "d", "e"]
        comp.numbers = [1, 2, 3, 4, 5]
        comp.make_tree()
        self.assertEqual([[["a", "b"], "c"], ["d", "e"]], comp.binary_tree)

    def test_make_tree_does_not_modify_chars_and_numbers_once_finished(self):
        comp = Huffman()
        comp.chars = ["a", "b", "c", "d", "e"]
        comp.numbers = [1, 2, 3, 4, 5]
        comp.make_tree()
        self.assertEqual(["a", "b", "c", "d", "e"], comp.chars)
        self.assertEqual([1, 2, 3, 4, 5], comp.numbers)

    def test_create_binary_store_string_creates_correct_string_of_0_and_1(self):
        comp = Huffman()
        comp.all_letters = ["a", "b", "c", "d", "e"]
        comp.binary_tree = [[["a", "b"], "c"], ["d", "e"]]
        comp.create_binary_store_string()
        self.assertEqual("000001011011", comp.binary_string)

    def test_write_compressed_file_and_read_compressed_file_so_that_they_work_together(self):
        comp = Huffman()
        comp.compressed_file = path + "Testing_Byte_File"
        comp.binary_string = "000001011011"
        comp.write_compressed_file()
        comp.binary_string = ""
        comp.read_compressed_file()
        self.assertEqual("0000010110110000", comp.binary_string)

    def test_splitter_splits_string_into_array_of_elements_with_length_8(self):
        actual = binary_splitter("000001011011", 8)
        self.assertEqual([0b00000101, 0b10110000], actual)

    def test_get_binary_string_returns_the_correct_string_which_will_contain_0_or_1_for_given_byte_array(self):
        actual1 = get_binary_string([255, 15])
        actual2 = get_binary_string([5, 176])
        self.assertEqual("1111111100001111", actual1)
        self.assertEqual("0000010110110000", actual2)

    def test_save_binary_tree_and_read_binary_tree_so_they_work_together(self):
        comp = Huffman()
        comp.compressed_file = path + "Testing_Byte_File"
        comp.binary_tree_location = path + "binary_tree_test.txt"
        comp.binary_tree = [[["a", "b"], "c"], ["d", "e"]]
        comp.save_binary_tree()
        comp.binary_tree = []
        comp.read_binary_tree()
        self.assertEqual([[["a", "b"], "c"], ["d", "e"]], comp.binary_tree)

    def test_write_uncompressed_file_and_read_uncompressed_file_work_together(self):
        comp = Huffman(path + "Testing.txt")
        comp.all_letters = ["a", "b", "c", "d", "e"]
        comp.write_uncompressed_file()
        comp.all_letters = []
        comp.read_uncompressed_file()
        self.assertEqual(["a", "b", "c", "d", "e"], comp.all_letters)

    def test_find_children_creates_the_correct_lookup_table(self):
        comp = Huffman()
        comp.binary_tree = [[["a", "b"], "c"], ["d", "e"]]
        comp.find_children("", comp.binary_tree)
        self.assertEqual([["a", "000"], ["b", "001"], ['c', '01'], 
                          ['d', '10'], ['e', '11']], comp.lookup_table)

    def test_read_stored_string_returns_correct_string_when_reading_binary_string_and_binary_tree(self):
        comp = Huffman()
        comp.binary_tree = [[["a", "b"], "c"], ["d", "e"]]
        comp.binary_string = "000001011011"
        comp.read_stored_binary_string()
        self.assertEqual(["a", "b", "c", "d", "e"], comp.all_letters)
        

In [4]:
#Run tests
suite = unittest.TestLoader().loadTestsFromTestCase(TestHuffman)
unittest.TextTestRunner().run(suite)

...............
----------------------------------------------------------------------
Ran 15 tests in 0.023s

OK


<unittest.runner.TextTestResult run=15 errors=0 failures=0>

### Example: User Name

Here is a simple example of compressing someone's username. The user name is 'Bell777', 
First we list the characters in order of frequency.

>\begin{align*}
&\text{Character},& &\text{'$7$'},& &\text{'l'},& &\text{'B'},& &\text{'e'},&\\
&\text{Number},& &3,& &2,& &1,& &1,& 
\end{align*}

This will form the binary tree.

In [5]:
"   Start      "
"   /    \     "
" '7'     .    "
"        / \   "
"       /  'l' "
"      .       "
"     / \      "
"   'B' 'e'    "
"              "
#This will be stored in the file as
expected_binary_tree = ['7', [['B', 'e'], 'l']]

In [6]:
#Test correct binary tree formed
my_Huffman = Huffman(uncompressed_file=path + "Example_User_Name.txt",
                     compressed_file=path + "Compressed_User_Name",
                     binary_tree_file=path + "bin_tree_User_Name.txt",
                     file_browser=False)

my_Huffman.compress()
my_Huffman.read_binary_tree()
assert my_Huffman.binary_tree == expected_binary_tree

From the binary tree we have
> 'B' = $100$, 'e' = $101$, 'l' = $11$, '$7$' = $0$

So 'Bell777' = $1001011111000$

In [7]:
my_Huffman.read_compressed_file()
my_Huffman.binary_string

'100101111100000000000000'

In [8]:
uncompress_Huffman = Huffman(uncompressed_file=path + "Testsed_Example_User_Name.txt",
                             compressed_file=path + "Compressed_User_Name",
                             binary_tree_file=path + "bin_tree_User_Name.txt",
                             file_browser=False)

uncompress_Huffman.uncompress()
uncompress_Huffman.read_uncompressed_file()
print(uncompress_Huffman.all_letters)

['B', 'e', 'l', 'l', '7', '7', '7', '7', '7', '7', '7', '7', '7', '7', '7', '7', '7', '7']


As can be seen this is not 'Bell777', this is because the code stores the binary string as a byte, therefore the it fills the end of the byte with zero. This can be fixed by having an stop character for an end node, thus telling the code the string is at an end. This is not needed as with bigger texts this impact will not always occur since the extra bits may not reach an end node, it may seem that using this method will make the file get bigger over time but after one compression and uncompression the file will remain a constant size when doing this method again as the binary string will be a multiple of 8, this is given that the extra letters don't change the outcome of the binary tree, the bigger the text passage the less likely the binary tree will be changed by this. But this should be fixed in the future. 

### The Lord of The Examples

Here we will test the program works on large texts and see its compression statistics. We will be using the book Lord Of The Rings The Fellowship of the Ring.

In [9]:
book_path = "Books/"
huff_path = "HuffmanCompression/"

Lord_Huffman = Huffman(uncompressed_file=book_path + "The Lord Of The Rings The Fellowship of the Ring.txt",
                       compressed_file=huff_path + "Compressed_LOTR",
                       binary_tree_file=huff_path + "The_Lord_Of_The_Binary_Trees.txt",
                       file_browser=False)

Lord_Huffman.compress()

#Here is the binary tree in raw format
Lord_Huffman.read_binary_tree()
print(Lord_Huffman.binary_tree)

[[' ', [[[['.', ','], ['c', [['T', ['’', 'A']], 'v']]], 'h'], ['n', 'o']]], [[[[[['k', [[[['x', 'q'], 'R'], 'F'], ['S', ['E', ['C', 'Y']]]]], 'y'], [['\n', ['I', [[[['"', ['(', ')']], 'D'], ['O', [['ó', [['á', [[[['ы', 'н'], ['к', 'з']], '6'], [[[['у', 'ь'], ['_', 'г']], '*'], [['Z', 'м'], [['…', '='], '©']]]]], 'ú']], 'z']]], ['‘', ['L', 'N']]]]], 'm']], 'a'], ['t', ['l', ['f', 'g']]]], [[[[[["'", ['B', ['`', 'M']]], 'p'], 'w'], 'd'], [['u', [[[['W', 'G'], [';', [[[[['”', [['8', ['л', ['в', ['|', ['б', 'х']]]]], ['/', '7']]], 'K'], [['“', 'J'], 'U']], 'P'], ':']]], [['H', '!'], ['-', ['?', ['j', [[[[['â', ['п', ['о', 'с']]], 'ë'], ['3', ['5', [['и', 'т'], ['‚', 'р']]]]], [['9', '2'], ['é', 'Q']]], [[['4', 'û'], 'V'], [[[['е', 'ä'], 'í'], [['а', ['Ó', 'д']], '0']], '1']]]]]]]], 'b']], 's']], [['r', 'i'], 'e']]]]


In [10]:
# Now to uncopress the file
Un_Lord_Huffman = Huffman(uncompressed_file=book_path + "Comp and Uncomp LOTR.txt",
                          compressed_file=huff_path + "Compressed_LOTR",
                          binary_tree_file=huff_path + "The_Lord_Of_The_Binary_Trees.txt",
                          file_browser=False)
Un_Lord_Huffman.uncompress()

#View the file first part of the file
Un_Lord_Huffman.read_uncompressed_file()
print(''.join(Un_Lord_Huffman.all_letters[0:10000]))

John Ronald Reuel Tolkien
The Lord Of The Rings:
The Fellowship of the Ring
(1954)
© J.R.R.Tolkien, 1954

E-Text: Greylib

Contents
Foreword

 

Prologue

    1. Concerning Hobbits

    2. Concerning Pipe-weed

    3. Of the Ordering of the Shire

    4. Of the Finding of the Ring

    5. Note on the Shire Records

Book I

    Chapter 1. A Long-expected Party

    Chapter 2. The Shadow of the Past

    Chapter 3. Three is Company

    Chapter 4. A Short Cut to Mushrooms

    Chapter 5. A Conspiracy Unmasked

    Chapter 6. The Old Forest

    Chapter 7. In the House of Tom Bombadil

    Chapter 8. Fog on the Barrow-Downs

    Chapter 9. At the Sign of The Prancing Pony

    Chapter 10. Strider

    Chapter 11. A Knife in the Dark

    Chapter 12. Flight to the Ford

Book II

    Chapter 1. Many Meetings

    Chapter 2. The Council of Elrond

    Chapter 3. The Ring Goes South

    Chapter 4. A Journey in the Dark

    Chapter 5. The Bridge of Khazad-dûm

    Chapter 6. Lothlórien

    

In [11]:
#Open the file in the books folder to see, all of book

#Compare, gets the size in btyes
comp_file_size = os.stat(huff_path + "Compressed_LOTR").st_size
normal_file_size = os.stat(book_path + "The Lord Of The Rings The Fellowship of the Ring.txt").st_size
binary_tree_size = os.stat(huff_path + "The_Lord_Of_The_Binary_Trees.txt").st_size
print(" Compressed LOTR Byte Size: " + str(comp_file_size))
print("     Normal LOTR Byte Size: " + str(normal_file_size))
print("Binary Tree LOTR Byte Size: " + str(binary_tree_size))


comp_ratio = (comp_file_size+binary_tree_size)/normal_file_size
print("Compression Ratio: " + str(comp_ratio))

 Compressed LOTR Byte Size: 559895
     Normal LOTR Byte Size: 1013626
Binary Tree LOTR Byte Size: 867
Compression Ratio: 0.5532237728708617


### Conclusion of Huffman Compression

Clearly this method is not efficient for short texts as the binary tree has to be stored. Additionally, Huffman proved that for longer texts this is an efficient way of assigning $0$'s and $1$'s to store a passage of text. But how do word and zip files do better, because this is a basic Huffman Compression algorithm it only has characters as end nodes on the binary tree, if we worked with blocks bigger than one character better results can be obtained.