#### Combinatorial Optimization 

1. Edit Distance
2. Huffman Codes

#### 1. Edit Distance


#### Implementation of the Edit Distance in Python

In [153]:
# We create the "cost" function which takes in 5 parameters.
#x and y are the strings we will compare
#inst, dlt, sub are the costs associated to each of the operations (insert, delete, substitute)
#The Levenshtein Distance would correspond to this function with inst, dlt, sub set to 1
#The cost function will return the minimum cost needed to convert the string x into y  taking into accunt the cost of each operation

def cost(x,y, ins, dlt, sub):
    
    # Add an empty element in front of the string in order for 
    #this element to take the index 0 in our future dictionary. 
    x = ' ' + x 
    y = ' ' + y
    
    # Create an empty dictionary to store the results of the edit distance(X, Y) 
    # which is the minimum number of operations{D, I, S} needed to perform on X to produce Y. 
    # the dictionary in this case performs the same role that the matrix.
    d = {} 

    
    # The length of x and y will be the amount of characters + 1 for the empty space we have added
    Y = len(y) 
    X = len(x) 
    
    # The for loop will go from 0 to length of X (which is characters +1)
    for i in range(X): 
        
        #To convert x into nothing we need to delete all elements in x and multiple each deletion by the deletion cost
        d[i, 0] = i * dlt 
    
    # To create y from nothing in x we need to include each element in y and multiple each insertion for the cost of insertion
    for j in range (Y):
        d[0, j] = j * ins 
    
    for j in range(1,Y):
        for i in range(1,X):
            if x[i] == y[j]: 
                #If the i and j elements are equal we do not have to do any edit operations 
                #therefore the distance cost will remain the same
                d[i, j] = d[i-1, j-1] 
            else:
                #Otherwise d[i,j] will be equal to the minimum value plus the cost of 
                # the operation of each of the upper, left and diagonal neighbours. 
                d[i, j] = min(d[i-1, j] + dlt, d[i, j-1] + ins, d[i-1, j-1] + sub) 
    
    # Return the matrix d and the minimum edit distance
    return d, d[X-1,Y-1] 

#### Implementation of Backtracking in Python

In [154]:
#Create a new function which will reconstruct from bottom up the path of operations needed to transform string x into y
#d is the dictionary which contains the minimum edit distance already computed for each element i and j in x and y string
    
def get_path(d,x,y): 
    
    # Store the length of both input strings, will be needed for traversing the dictionary
    i = len(x)
    j = len(y)
    
    # Steps is an empty string which will store from bottom up the operations done to convert x in y. 
    #Then we reverse the order of the string steps
    steps = "" 
    
    # While we havent reached the end of the matrix / dictionary, find the step
    while i != 0 and j != 0: 
        
        #If in the edit distance matrix the i,j cost is equal to the i-1, j-1 cost (we are moving diagonally up) 
        if d[i-1,j-1] == min(d[i-1, j], d[i, j-1], d[i-1, j-1]): 
            
            # Check if the d[i-1,j-1] entries are the same as d[i,j]
            if d[i-1,j-1]==d[i,j]: 
                
                # If yes, both strings were already equal and therefore none operation takes place
                steps+=' null' 
            
            else:
                #In the other case d[i-1,j-1]!=d[i,j] 
                #and the elements are not equal  therefore we move diagonally and record a substitution
                steps+=' subs'
            
            # Move both pointers in the matrix, equivalent to a diagonal step.
            i -= 1 
            j -= 1 
        
        # Else, check if the entry d[i,j-1] is the minimum of the surrounding entries
        elif d[i,j-1] == min(d[i-1, j], d[i, j-1], d[i-1, j-1]):
            # If the value in i, j is equal to the value on the left this means we have done an insertion
            steps +=" ins" 
            
            # Move pointer accordingly
            j -= 1
        else:
            #If we are moving up the step taken is a deletion
            steps +=" del"
            
            # Move pointer accordingly 
            i -=1
    
    #If we are on the left border of the matrix this means element in string y is 0 
    #but there are still elements i in string x therefore all those i elements must be delete
    if i != 0 and j == 0:
        # Add "del" i times to the string
        steps += " del" * i 
    
    
    #If we are in the top of the matrix this means there are no more elements in x 
    # but there are j elements in y. Therefore, the j elements must be inserted in order for y=x
    elif i == 0 and j != 0:
        # Add "insert" j times to the string
        steps += " insert" *j 
    
    # We split each operation as an element in the step list, this is needed for reversing the list
    step_list = steps.split()       
    
    # Reverse the order of the list as we have been including the operations one by one from the bottom 
    # down to the top left and it is more conveniente to read the steps from the top left
    step_list.reverse() 
    
    # Return the list of operations done to convert x into y   
    return step_list 

#### Example 1: DNA

In [155]:
x ="ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA"
y = "TACTAGCTTACTTACCCATCAGGTTTTAGAGATGGCAACCA"

 #We first run the algorithm for equal costs for each operation
d, distance = cost(x,y,1,1,1)
#We print the minimum edit dictance 
distance 

10

The edit distance with unit cost for the two DNA strings is 10. The optimal path to convert X to Y is the following

In [156]:
#We print one of the optimal sequence of operations transforming X into Y.
get_path(d,x,y)

['del',
 'del',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'subs',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'subs',
 'subs',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'del',
 'del',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'del',
 'subs',
 'null',
 'subs',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null',
 'null']

Now calculating the minimum edit distance with cost of insertion and deletion = 2 and substitution=1

In [157]:
#We calculate the minimum edit distance with 
d, distance = cost(x,y,2,2,1) 

# Print the distance
distance

15

The edit distance is now 15, due to the higher cost of deletion and insertion.

#### Example 2: Proteins

In [158]:
x ="ASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP"
y = "SGAPGQRGEPGPQGHAGAPGPPGPPGSDG"

#We first run the algorithm with equal costs for each operation
d, distance = cost(x,y,1,1,1) 

#We print the minimum edit distance 
distance 

36

The edit distance between the two protein strings is 36.

In [159]:
#We print one of the optimal sequence of operations transforming X into Y.
get_path(d,x,y)

['del',
 'null',
 'del',
 'del',
 'del',
 'del',
 'null',
 'subs',
 'null',
 'subs',
 'null',
 'subs',
 'subs',
 'subs',
 'subs',
 'subs',
 'subs',
 'null',
 'subs',
 'subs',
 'null',
 'subs',
 'subs',
 'null',
 'subs',
 'null',
 'subs',
 'subs',
 'null',
 'null',
 'subs',
 'null',
 'del',
 'del',
 'del',
 'del',
 'del',
 'del',
 'del',
 'del',
 'null',
 'del',
 'del',
 'del',
 'null',
 'del',
 'del',
 'del',
 'del']

Calculate the minimum edit distance with cost of insertion and deletion = 2 and substitution=1 in the protein case.

In [160]:
#We calculate the minimum edit distance with cost of insertion and deletion = 2 and substitution=1
d, distance = cost(x,y,2,2,1) 
distance

56

The distance is now 56 and significantly longer.

### 2. Huffman Codes

#### The data compression problem
At the heart of every computer lies the representation of symbols using binary digits (bits). In computers,  letters of the alphabet are not stored as such, but are being represented by a combination of zeros and ones. 

On the one hand, this encoding requieres a unified standard of encoding, so that machines can communicate with each other. That is why unified transformation formats such as ASCII or UTF are of great importance. In these systems, symbols are being represented through a fixed number of bits. ASCII for example uses 7 bits for encoding, which allows the representation of $2^7 = 128$ different symbols. This encoding ensures that files can be exchanged between different machines without loss of information. However, it seems like encoding every symbol with 7 bits is quite wasteful from a storage persepective (Kleinberg & Tardos 2008). Encoding the string "hello world" using 7 bits for each symbol (including spaces) results in a total number of 77 bits. This seems rather large, given that the original string has only 11 symbols. It would be better if one could encode every symbol with the smallest number of bits possible. A solution for this problem is the use of variable length encoding (Kleinberg & Tardos, 2008). At the basis of variable length encoding is the idea, that symbols which occur often in a given document, are encoded with a smaller amount of bits than others. 

The question is however, how to ensure the uniqueness of each encoding. Lets say we've encoded "a" as a "1", b as "0" and c as "01". A string "abc" would then theoretically look like this: "1001". However, an algorithm which tries to decode this string would traverse over the zeros and ones and then return a letter once it finds a match. In this case, in the first step the algorithm would find the "1" and return "a". In the second step it would find the "0" and return a "b"; so far so good. However, in the third step, the algorithm would again stop at the "0" and return another "b". The decoded string would then be "abba" and not "abc". 

A way to deal with this is the use of so called prefix code. We have to make sure, that every code is not the prefix of another code. It is no problem however, if a string is a substring of another string, as long as it's not a prefix.(Kleinberg & Tardos, 2008).

So the problem at hand is now twofold: How do we get a code that can be used to en- and decode a given set of symbols unabmigously and how do we ensure that the encoding is of minimal length?

The definition of encoding legnth is as following (Tardos & Kleinberg, 2008).
Given text $T$, each symbol $s \in T$  occurs with a  given frequency $f_s$. The encoding of $s$ is $c(s)$ where $c(s)$ represents a binary string. The average bit length required to encode a given text is then:

$$\sum_{s \in T}^{} f_s \mid c(s) \mid $$ 

where $\mid c(s) \mid $ denotes the length of the encoding $c(s)$ in number of bits.

#### 2.1.1 The Huffman Code 

Huffman (1952) propses a greedy algorithm as a solution to this problem. The idea behind Huffmans encoding can be easily represented using a binary tree (Kleinberg & Tardos, 2008). In a binary tree, each node that is not a leaf has two children. We can imagine a tree, for which leaf is associated with one symbol from a given alphabet that we want to encode. From such a binary tree, a prefix code arises naturally (Kleinberg & Tardos, 2008). For a given letter, one can follow the path from the root to the leaf and write down a 0 for every left path one takes and a 1 for every right path. The resulting string is then a prefix code for the given letter (Kleinberg & Tardos, 2008). This also ensures, that no encoding is a prefix of another encoding, since all symbols are in the leafs of the tree. Consequently, finding a binary tree for a given alphabet is equivalent to finding a prefix code for the alphabet.

Further, the tree structure also translates to the length of the encoding. The length of the path from the root to a given letter $x$ is equivalent to the length of the encoding in bits. This can be represented as the depth of the tree. Huffman (1952) presents an algorithm to find the optimal prefix code.

Given an alphabet $A$, take the two lowest frequency symbols $x^*$ and $y^*$ and construct a "meta-symbol" out of them whose frequency is the sum of the frequencies of $x^*$ and $y^*$. This "meta-symbol" will later then be added to the alphabet, which has now been reduced by one symbol. Recursively repeat this combining of symbols until the alphabet has been reduced to two symbols. Once this state has been reached, label one symbol with 1 and the other with 0. Now "unpack" the "meta-symbols" and label them accordingly. 

The resulting binary tree yields an optimal prefix code. What is the greedy part of this algorithm? At every step, the algorithm only considers the two lowest frequency symbols of the current alphabet. At the moment of the merging, it is not clear, where in the tree these two symbols will end up. However, this algorithm still results in an optimal binary tree and prefix code, as we will discuss now.

#### 2.1.2 Proof of Correctness

Our proof of correctness follows the exposition by Kleinberg and Tardos (2008).
The algorithm is invoked recursively on smaller and smaller alphabets, until it reaches an alphabet with two symbols. The case of two symbols is trivial, since an optimal encoding will just be representing one symbol with 1 and the other with 0. Now suppose by induction, that the algorithm works on alphabets with size $k-1$. We'll now have to prove, that it in fact also works for alphabets of size $k$ (Kleinberg & Tardos, 2008). At every step the algorithm merges the two lowest-frequency symbols $y^*, z^* \in A$ into a meta-symbol $\omega$ and is invoked recurisvely on the smaller alphabet $A'$. By induction it then provides an optimal prefix code for $A'$ represented by a binary tree $T'$. It then augments $T'$ to $T$ by unpacking $\omega$ and adding the symbols $y^*$ and $z^*$ as new children to $\omega$. We now have to check how the depth of the trees changes from $T'$ to $T$. How did the average bit length change from $T'$ to $T$? In $T'$ the depth of each letter besides the "packed" $y^*$ and $z^*$ is the same as in $T$. Further, the depth of these two symbols is one step larger than the depth of $\omega$ in $T'$. Given this, and using that $f_{\omega}= f_{z^*} + f_{y^*}$ we can see, that the average bit length of $T$, $ABL(T)$ is equal to the average bit length of $T'$ plus the frequency of $\omega$. (Kleinberg & Tardos, 2008).

$$ ABL(T) = f_{\omega} + ABL(T') $$

We can now prove that the Huffman Code for a given algorithm results in a minimum average number of bits per letter by proof by contracdition. Assume that there is a tree tree $T$ resulting from Huffmans algorithm that is not optimal. Let $Z$ be an optimal tree that with leaves of $y^*$ and $z^*$ which are siblings. Deleting the leaves $y^*$ and $z^*$ from $Z$ and merging them into a meta symbol $\omega$ we get a tree $Z'$ that defines a prefix code for a smaller alphabet. We can then show, that $ABL(Z) = ABL(Z') + f_{\omega}$. In this case, $ABL(T') = ABL( T ) - f_{\omega}> ABL( Z ) - f_{\omega} = ABL(Z')$ which would mean that $T'$ is suboptimal as well. This is a contractidtion, since by induction, $T'$ is optimal.

#### 2.1.3 Complexity

In our implementation, the complexity is $O(k^2)$ where $k$ is the number of symbols in the alphabet. At every step of the recursion, the algorithm is called on a smaller alphabet, at at every step the two symbols with the lowest frequencies have to be identified. Finding the symbols with the lowest frequency takes at most time $O(k)$, if one has to traverse the whole alphabet. In total, there are $k-1$ recursions, so in total the time is $O(k^2)$ (Kleinberg & Tardos, 2008). 

However, in our implementation it is important to note, that the frequencies of the alphabet also have to be computed. This can also be done in linear time, since every symbol in the text has to be counted once. One could imagine a case, where the original text only consists of a couple of letters, in the case of DNA for example, but is very long. Given a DNA string consisting of three letters, but with length 1000, the counting would take longer than the building of the binary tree which has only three leaves. 

### 2.2.4 Implementation of the Huffman Code

#### Helper Functions for the Huffman Code Function
First, we define helper-function which will be later used in the final function.

In [100]:
# Defining a helper function for getting the alphabet of the current text. 
# The alphabet defines the frequency of every letter in the current text.
def get_alphabet(text):
    
    # Initialize empty array in which letters and letter counts can be stored as key : value pairs
    alphabet = {}
    
    # For every unique letter in the text (including signs and spaces), add a letter : count(letter) pair to the dict.
    for letter in set(text):
        alphabet.update({letter: text.count(letter) / len(text)})
    
    # Return the alphabet
    return alphabet

# Defining a helper function for assigning the actual prefix code to a given set of nodes in a tree.
# The helper function receives a set of nodes in the form of a dictionary, a given root node from where
# to start and a prefix string . 
# The default value of the prefix is null. 
def assign_code(nodes, symbol, code, prefix = ''):
    
    # Get the children of the node
    children = nodes[symbol]
    
    # Set up an empty tree which will be built up through the recursive calls. 
    tree = {}
    
    # If the returned children are two, then we are at a meta symbol (a node).
    # In case we encounter a meta symbol add two entries to the tree 0 and 1.
    # Recursively call the assign_code function, but for the first child add
    # a 0 to the prefix string and 1 for the second.
    # This ensures that the optimal prefix code is handed down until a leaf is reached
    if len(children) == 2:
        
        tree['0'] = assign_code(nodes, children[0], code, prefix +'0')
        tree['1'] = assign_code(nodes, children[1], code, prefix +'1')
        return tree
    else:
    
    # Once the recursive calls have reached a leaf, save the accumulated 
    # prefix string in the code dictionary.
        code[symbol] = prefix
        return symbol

#### Function for getting the optimal prefix using Huffman Codes

The following function takes a text as an input and returns the optimal prefix code as defined by Huffman (1952).

In [103]:
# Function for getting the optimal prefix of a given text
def huffman_code(text):
    
    # Get the alphabet of the input text. The frequencies of the symbols will be needed to 
    # form the prefixes.
    alphabet = get_alphabet(text.lower())
    
    # Set up empty dictionary. Each entry in the dictionary will represent a node in the
    # Optimal prefix tree.
    nodes =  {}
    
    # Initialize the tree, assigning all symbols of the alphabet to a leaf in the tree.
    for symbol in alphabet.keys():
        nodes[symbol] = []
    
    
    # Now comes the step of generating "meta-symbols"
    # As long as the alphabet contains more than one symbol, take the two symbols "x" and "y" with
    # the lowest frequency f_x , f_y and merge them into a 'meta-symbol' "xy" with the frequency f_x + f_y
    # Remove the letters x and y from the alphabet and replace them with their "meta"-symbol'. Further, add
    # the meta symbols to the set of nodes. These meta symbols are actual nodes in the final tree. 
    # The while loop finishes when there's only one 'meta-symbol' left which has frequency one. This 
    # will be the root of the tree.
    
    # Repeat until alphabet has been shrunk to length 1
    while len(alphabet) > 1:
        
        # Return the two letters / meta-letters x and y with the lowest frequency of the current 
        # instance of the alphabet. This will be merged into a new meta letter and added to the set of nodes
       
        # Get the symbol with the lowest frequency, this will be merged to a node with the second lowest.
        x  = min(alphabet, key=alphabet.get)
        
        # Get the corresponding frequency and remove the symbol from the dictionary
        freq = alphabet.pop(x)
        
        # Get the symbol with the lowest frequency from the NEW dictionary , which was the one
        # with the second lowest frequency before
        y = min(alphabet, key = alphabet.get)
        
        # Get the frequency of the second lowest, add it to the frequency of the lowest and remove it from 
        # alphabet 
        freq += alphabet.pop(y)
    
        # Add the new "meta" - letter "xy" to the alphabet with frequency f_x + f_y
        alphabet[x+y] = freq
    
        # Add the new meta letter "xy" to the tree. The meta letter has key "xy" and a list 
        # of the symbols x and y as a value.
        nodes[x+y] = [x, y]
        
    # Once the nodes have been built, the codes can be assigned by traversing the tree. Starting
    # from the root node which is the last meta-letter that has been added in the while loop.
    # Set up root node
    root = x + y
    
    # Empty dictionary which will store the prefix codes.
    code = {}
    
    # Call the helper function that returns the complete tree.
    # The code dictionary will also be modified through the call to the assign_codes function.
    tree = assign_code(nodes, root, code, prefix = '');
    
    # Return the code-dictionary 
    return code

#### 2.1.5 Discussion and output of the algorithm for the given data

Here we show the output of the algorithm for the two texts. (Note that we assume that the text is one line. In case of line breaks this would be encoded as "\n".)


In [134]:
# String of the Shakespeare text
hamlet ='O all you host of heaven! O earth! What else? And shall I couple '\
             'hell? Oh, fie! Hold, hold, my heart, And you, my sinews, grow not instant '\
             'old, But bear me stiffly up. Remember thee! Ay, thou poor ghost, whiles '\
             'memory holds a seat In this distracted globe. Remember thee! Yea, from '\
             'the table of my memory I’ll wipe away all trivial fond records, All saws '\
             'of books, all forms, all pressures past That youth and observation copied '\
             'there, And thy commandment all alone shall live Within the book and volume '\
             'of my brain, Unmixed with baser matter. Yes, by heaven! O most pernicious '\
             'woman! O villain, villain, smiling, damned villain! My tables! Meet it is '\
             'I set it down That one may smile, and smile, and be a villain. At least '\
            'I’m sure it may be so in Denmark. So, uncle, there you are. Now to my word.'

# String of the Goethe text
faust = "Habe nun, ach! Philosophie, Juristerei und Medizin, Und leider auch Theologie" \
        "Durchaus studiert, mit heissem Bemühn. Da steh ich nun, ich armer Tor! Und bin so klug als wie" \
        "zuvor; Heisse Magister, heisse Doktor gar Und ziehe schon an die zehen Jahr Herauf, herab und" \
        "quer und krumm Meine Schüler an der Nase herum Und sehe, dass wir nichts wissen können! Das"\
        "will mir schier das Herz verbrennen. Zwar bin ich gescheiter als all die Laffen, Doktoren, Magister,"\
        "Schreiber und Pfaffen; Mich plagen keine Skrupel noch Zweifel, Fürchte mich weder vor Hölle noch"\
        "Teufel Dafür ist mir auch alle Freud entrissen, Bilde mir nicht ein, was Rechts zu wissen, Bilde mir" \
        "nicht ein, ich könnte was lehren, Die Menschen zu bessern und zu bekehren. Auch hab ich weder" \
        "Gut noch Geld, Noch Ehr und Herrlichkeit der Welt; Es möchte kein Hund so länger leben! Drum "\
        "hab ich mich der Magie ergeben, Ob mir durch Geistes Kraft und Mund Nicht manch Geheimnis" \
        "würde kund; Dass ich nicht mehr mit saurem Schweiss Zu sagen brauche, was ich nicht weiss; Dass" \
        "ich erkenne, was die Welt Im Innersten zusammenhält, Schau alle Wirkenskraft und Samen, Und"\
        'tu nicht mehr in Worten kramen.'

# Get the optimal prefix code for Hamlet using the huffman_code function
hamlet_prefix = huffman_code(hamlet)

# Get the optimal prefix code for Faust using the huffman_code function
faust_prefix = huffman_code(faust)

In [135]:
# Displaying as data frames
import pandas as pd

hamlet_pd = pd.DataFrame.from_dict(hamlet_prefix,orient = 'index')
hamlet_pd.columns = ['Prefix Code Hamlet']

faust_pd = pd.DataFrame.from_dict(faust_prefix, orient='index')
faust_pd.columns = ['Prefix Code Faust']

display(faust_pd)

Unnamed: 0,Prefix Code Faust
u,0
a,1
z,1000
f,1001
g,1010
q,1011000
ä,1011001
!,101101
ö,101110
.,101111


In [136]:
display(hamlet_pd)

Unnamed: 0,Prefix Code Hamlet
,0
f,10000
v,10001
y,1001
i,101
t,110
d,1110
",",1111
l,1000
o,1001


#### 2.2 Algorithm for decoding a text T, given a prefix.

Having shown how to reach an optimal prefix encoding, the question is now how to decode an encoded text, given a prefix code. 

#### 2.2.1 Decoding Algorithm

The proposed algorithm traverses the given input string, which only consists of 0 and 1. At the begining, it reads in the first bit and compares it to the given dictionary. If the first bit is encoding a symbol, as defined in the dictionary, then the symbol will be appended to an output string. The pointer then moves one step further.
If the first bit doesn't encode a symbol, then the next bit will be added to the first bit, resulting in a larger bit string. Then, the comparision with the dictionary is repeated. This process continues until a bit string has been built that actually encodes a symbol. 

Having found a bit string and having appended the resulting symbol to the output string, the algorithm continues at the next bit of the string which hasn't been encountered yet. The algorithm finishes once it has reached the end of the encoded string. The algorithm is in a sense greedy, since it only checks the first bit it encounters. If it doesn't find a matching key in the dictionary, it will take the next available bit and so on.

#### 2.2.2 Proof of Correctness

It is trivial to show, that the algorithm finishes, since it traverses the encoded string from left to right and finishes once it has traversed the whole string. We have to prove, that the decoding actually results in the right text. This can be shown using the properties of the prefix code. 
## PROOF

#### 2.2.3 Complexity

The complexity of the decoding algorithm is linear in the number of bits in the encoded string $n$. The algorithm has to visit each bit in the string exactly once and then check whether it is in the dictionary. Finding a key in the dictionary has complexity $O(1)$ in Python. So the total complexity is $O(n)$. 

#### 2.2.4 Implementation in Python

The following code shows the implementation of the decoding algorithm in Python3.

In [None]:
# Function for decoding a given text using a prefix map in the form of a dictionary.
def decode(prefix,encoded):
    
    # Copy the prefix code, so that it can be inverted
    code_copy = prefix.copy()
    
    # Invert the code map
    # This transforms values into keys and keys into values
    # Using a copy here, since otherwise the original code map would be emptied.
    prefix_symbol = {prefix: symbol for symbol, prefix in code_copy.items()}
    
    # Set up a null string to which the decoded text will be appended to
    decoded = ''
    
    # Set counter to zero as the starting block
    i = 0
    
    # While the end of the encoded string hasn't been reached:
    while i < len(encoded):
        
        # Get the current bit from the encoded string
        bit = encoded[i]
        
        # As long as the current bit is not a key in the inverted map
        while bit not in prefix_symbol:
            
            # Increase the counter by one
            i += 1
            
            # Concatenate the next bit to the current bit string
            bit += encoded[i]
    
        # Once a bit is found that belongs to a key in the map, append the symbol 
        # corresponding to that key to the decoded text.
        decoded += prefix_symbol[bit]
    
        # Increase counter by one, so that the the next iteration starts at the next
        # Bit that hasn't been reached
        i += 1
    
    # Return the decoded text
    return decoded

#### Function for encoding a text, given a prefix map in the form of a dictionary

The following function takes a text and a prefix code as an input and returns the encoded text. This function is for demonstrating the decoding algorithm.

In [143]:
# Function for encoding a text given a prefix_map in the form of a dicitonary
def encode(text,code):
    
    # Initialize a null string, the encoded text will be appended to this string
    encoded = ''
    
    # Loop through the symbols in the text, making sure that all of them are lower case
    for symbol in text.lower():
        # For every symbol, get the corresponding code using the dictionary and concatenate it 
        # to the string
        encoded += code[symbol]
    
    # Return the encoded string. 
    return encoded

#### 2.2.5 Output of the algorithm
 
We will now run the algorithm on the two given texts. (Note that all letters will be interpreted as lower case.)

The encoded version of the Shakespeare text using the prefix code which was shown before has the following form:

In [138]:
# Encode the Hamlet text using the hamlet function
hamlet_encoded = encode(hamlet,hamlet_prefix)
hamlet_encoded

'100100101110001000000100110011010100011000100111111011000100101000000110001110101101000111101111011010110010010011101011110010110110001101011001010011100010110110001110100011111111010101110100101111110011100011111110001011100010000001010010100011001101010110101010001110001100011101000100010101110100100111000011110001000001011110110101100110001001100001110011110011000100110000111001111001101101001001100011101011110010110011110010111111001110000100110011010100111100110110100100111110101111101110101001111110111100101011111100110011010010011110100101100001011111011111011010111111001100010011000011100111100110100101010011000110100111010111100100110111110001111101100101010000010000100001001001010101101010101011000110011110110111110110111101001110110010001101100011101110110101100101101001011110001101100010011010100011010101001100111001001010111111000100111111011001111001010011100001011000111011111001101111101101110011100101001001100010011000011101111100101100111111110101101100001011111000011

We will now decode the text using the decoding function.

In [140]:
decode(hamlet_prefix,hamlet_encoded)

'o all you host of heaven! o earth! what else? and shall i couple hell? oh, fie! hold, hold, my heart, and you, my sinews, grow not instant old, but bear me stiffly up. remember thee! ay, thou poor ghost, whiles memory holds a seat in this distracted globe. remember thee! yea, from the table of my memory i’ll wipe away all trivial fond records, all saws of books, all forms, all pressures past that youth and observation copied there, and thy commandment all alone shall live within the book and volume of my brain, unmixed with baser matter. yes, by heaven! o most pernicious woman! o villain, villain, smiling, damned villain! my tables! meet it is i set it down that one may smile, and smile, and be a villain. at least i’m sure it may be so in denmark. so, uncle, there you are. now to my word.'

Except for the lower- and uppercase differences, the texts are the same. The same holds for the Goethe text as shown in the following section.

In [141]:
# Encode the Faust text using the encoding function
faust_encoded = encode(faust,faust_prefix)
faust_encoded

'100000010110100101111011000010111101011110001110011000001011011111101000010001010011001100010011110001110100001000101001011010111111010011000000111101000111001101001110101010111000010111101111110010010110111010001000101010111101011110000101111011111011000101010110110100111111000100001100110001111001110000101100010110011000100101010100101101100000111110011000000100000011111001110011000011011101001001111001111010111110010101010011111100001010100011001101010010111011010010100101101001010001011001011111111101100011110011100110101000111101011001100011110110000101111010111110101100110001110001011110010010011111110011110001011100101101111000010111101111101101010101011111001111000111101101101100000000101011100010110000111111100001010010001000000011010011111000101111101000111110000101010001100110101111001000010010101010001110011010011111010111110000101010001100110101111101111000101101110011110001011111100101000010111111000010111101111100100010100101000010111001111001100011000110111110001101111

In [142]:
# Decode the text using the decoding function
decode(faust_prefix,faust_encoded)

'habe nun, ach! philosophie, juristerei und medizin, und leider auch theologiedurchaus studiert, mit heissem bemühn. da steh ich nun, ich armer tor! und bin so klug als wiezuvor; heisse magister, heisse doktor gar und ziehe schon an die zehen jahr herauf, herab undquer und krumm meine schüler an der nase herum und sehe, dass wir nichts wissen können! daswill mir schier das herz verbrennen. zwar bin ich gescheiter als all die laffen, doktoren, magister,schreiber und pfaffen; mich plagen keine skrupel noch zweifel, fürchte mich weder vor hölle nochteufel dafür ist mir auch alle freud entrissen, bilde mir nicht ein, was rechts zu wissen, bilde mirnicht ein, ich könnte was lehren, die menschen zu bessern und zu bekehren. auch hab ich wedergut noch geld, noch ehr und herrlichkeit der welt; es möchte kein hund so länger leben! drum hab ich mich der magie ergeben, ob mir durch geistes kraft und mund nicht manch geheimniswürde kund; dass ich nicht mehr mit saurem schweiss zu sagen brauche, was

## 3. References

Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098-1101.