#### Combinatorial Optimization 

1. Edit Distance
2. Huffman Codes

##### General Set-Up
- Write text of at most 6 pages containing a discussion of the problem, propsed algo. and proof of correctness.
- Include references!!
- Write complete computer code, each line commented indicating meaning and flow of the algorithm
- Check correctness
- Report output of the algorithm on the data sets provided

Deliverable:
Single PDF file with discusison of problem and proposed algorithm. Proof of correctness, complexity as a function of input size, brief discussion of the paradigm (greedy, dynamic, divide and conquer). Code, References, Output.


#### 1. Edit Distance

1. Discussion of the problem
Given two strings of text, measure the distance between them using the following operations:

- D: Deletion
- I: Insertion
- S: Substitution

Edit distance $d(X,Y)$ is the minimum number of operations needed to perform on X to produce Y.

3. Proof of correctness
4. Complexity
5. Implementation of the code
6. Output


Use dynmaic programming -> Solve smaller subproblems.

In [8]:
def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

In [9]:
levenshteinDistance('ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA','TACTAGCTTACTTACCCATCAGGTTTTAGAGATGGCAACCA')

10

#### Huffman Codes

##### Discussion of the problem

- Find the minimal encoding for a given text. 
- Enconding means writing the text using bits (zeros and ones)
- Naive approach would be to use a fixed number of bits for each symbol in the alphabet, example: 32 symbols to encode, use 5 bits per symbol $2^5 = 32$ symbols. (Kleinberg, Tardos)
- Question is: Do we really need to use 5 bits for every symbol? Some symbols are being used more often than others. From a data storage persepective it would be wasteful to encode all symbols with the same number of bits. In the example above, we would need $32*5$ bits to encode the alphabet, the space, comma, period, question mark and exclamation point.
- We could also use a smaller number of bits to encode the symbols that are more frequent! For example, "a" is a lot more frequent than "x" in the english alphabet. 
- The question is now, how to find an optimal encoding, so that the encoding is minimal (minimum "size").
- On the other hand, the encoding should ensure that a coded text can be decoded unambigously

Solution: Use Variable-Length encoding schemes

In comes the prefix code: Say we want to transport a message using only zeros and ones. The message looks like a big string of zeros and ones. Lets say we've encoded "a" as a "1", b as "0" and c as "01". A string "abc" would then theoretically look like this: "1001". However, an algorithm which tries to decode this string would traverse over the zeros and ones and then return a letter once it finds a match. In this case, in the first step the algorithm would find the "1" and return "a". In the second step it would find the "0" and return a "b"; so far so good. However, in the third step, the algorithm would again stop at the "0" and return another "b". The decoded string would then be "abba" and not "abc". The issue here is, that the encoding for "b" is a prefix of the enconding for "c".  We therefore have to find an encoding that maps letters to bit strings so that no bit string is the prefix of another. It is no problem however, if a bit string is a sub string of another string, as long as it's not a prefix! (Kleinberg, Tardos). 

The question is now: How do we get an optimal set of prefixes, so that 

1: We can en- and decode a given set of symbols unambigously
2: The encoding takes up minimal space.

How is minimal space defined?

Each symbol $s \in T$ (Text T) occurs with a  given frequency $f_s$. The encoding of $s$ is $c(s)$ where $c(s)$ represents a binary string. The average number of bits required to encode a given text is then:

$$\sum_{s \in T}^{} f_s \mid c(s) \mid $$ 

where $\mid c(s) \mid $ denotes the length of the encoding $c(s)$, number of bits.
The question is now: How do we find the optimal encoding?

#### The Huffman Code - Algorithm

What to discuss here:
- Just the algorithm
- Proof of correctness (-> it is a prefix code, it is minimal, etc) in the second part.

Use a greedy method to construct an optimal prefix code! (Kleinberg, Tardos).

- Use binary trees
- Number of leaves is equal to the size of the alphabet (unique symbols in text S)
- each leaf is labeled with a distinct letter S
- This binary tree naturally describes a prefix code
- For each letter 








In [211]:
# Defining a helper function for getting the alphabet of the current text. 
# The alphabet defines the frequency of every letter in the current text.
def get_alphabet(text):
    
    # Initialize empty array in which letters and letter counts can be stored as key : value pairs
    alphabet = {}
    
    # For every unique letter in the text (including signs and spaces), add a letter : count(letter) pair to the dict.
    for letter in set(text):
        alphabet.update({letter: text.count(letter) / len(text)})
    
    # Return the filled dictionary
    return alphabet

# Defining a helper function for assigning the actual prefix code to a given set of nodes in a tree.
# The helper function receives a set of nodes in the form of a dictionary.




# Function for getting the optimal prefix of a given text
def get_prefix(text):
    
    # Get the alphabet of the input text. The frequencies of the symbols will be needed to 
    # form the prefixes.
    
    alphabet = get_alphabet(text)
    
    # Set up empty dictionary. Each entry in the dictionary will represent a node in the
    # Optimal prefix tree.
    nodes =  {}
    
    # Initialize the tree, assigning all symbols of the alphabet to a leaf in the tree.
    for symbol in alphabet.keys():
        nodes[symbol] = []
        
    # Now comes the step of generating "meta-symbols"
    # As long as the alphabet contains more than one symbol, take the two symbols "x" and "y" with
    # the lowest frequency f_x , f_y and merge them into a 'meta-symbol' "xy" with the frequency f_x + f_y
    # Remove the letters x and y from the alphabet and replace them with their "meta"-symbol'. Further, add
    # the meta symbols to the set of nodes. These meta symbols are actual nodes in the final tree. 
    # The while loop finishes when there's only one 'meta-symbol' left which has frequency one. This 
    # will be the root of the tree.
    
    # Repeat until alphabet has been shrunk to length 1
    while len(alphabet) > 1:
        
        # Sort the current instance of the alphabet in reverse order so that the symbols / meta -
        # symbols with the lowest frequency can be extracted.
        # This returns a list with the first entry of the list being the symbol with lowest frequency
        sorted_alphabet = sorted(alphabet.items(),key=lambda x:x[1]) 
        
        # Return the two letters / meta-letters x and y with the lowest frequency of the current 
        # instance of the alphabet. This will be merged into a new meta letter and added to the set of nodes
        
        # Symbol with lowest frequency (just the symbol)
        x = sorted_alphabet[0][0]
        
        # Symbol with lowest frequency (just the symbol)
        y = sorted_alphabet[1][0]
        
        # Delete them from the alphabet, making a new combined letter 'xy' with the combined frequency
        # dict.pop(foo) removes the 'foo' entry from the dict, returning its value
        # Adding alph.pop(x) + alph.pop(y) together removes x and y from the dict, while 
        # summing their frequencies The sum of the frequencies is assigned to a 
        # new 'meta'-letter with key "xy" and frequency f_x + f_y
        alphabet[x+y] = alphabet.pop(x) + alphabet.pop(y) 
    
        # Add the new meta letter "xy" to the tree. The meta letter has key "xy" and a list 
        # of the symbols x and y as a value.
        nodes[x+y] = [x, y]
        
    # Once the nodes have been built, the codes can be assigned by traversing the tree. Starting
    # from the root node which is the last meta-letter that has been added in the while loop.
    # Set up root node
    root = x + y
    
    # Set up an empty dictonary which will store the symbols and prefix codes 
    # This will be handed down to the helper function, which will fill it up
    code = {}

        
        
    return code
    

In [212]:
get_prefix(text)

{'a': [], 'b': [], 'c': [], 'cb': ['c', 'b'], 'acb': ['a', 'cb']}

In [170]:
    



# Call the helper function get_nodes
tree = assign_code(nodes, root, code)   # assignment of the code for the given binary tree      
    return code, tree


# Start at the root of the tree, return the children of the root. 
# This can either be a list of two, then we are in a node, or a list of one, then we've reached a leaf
children = nodes[root]

# If we are in a leaf (list is of length one), split the tree, hand down the prefix string
if len(children) == 2:
    tree['0'] = children[0]
    tree['1'] = children[1]
nodes

{'a': [], 'b': [], 'c': [], 'cb': ['c', 'b'], 'acb': ['a', 'cb']}

In [126]:
# Get the children of the current node (saved as a list of the meta node)
childs = nodes[root]

# Set up empty tree
tree = {}

# If the length of the childs is 2, recursively call assign_code
# if len(childs) == 2:
#        tree['0'] = assign_code(nodes, childs[0], result, prefix+'0')
#        tree['1'] = assign_code(nodes, childs[1], result, prefix+'1')     
#        return tree
# If the length of the childs is not two, assign no prefix
code[label] = prefix
label

SyntaxError: invalid syntax (<ipython-input-126-99b9c1ba350b>, line 13)

In [203]:

def Huffman_code(_vals):    
    vals = _vals.copy()
    nodes = {}
    for n in vals.keys(): # leafs initialization
        nodes[n] = []

    while len(vals) > 1: # binary tree creation
        s_vals = sorted(vals.items(), key=lambda x:x[1]) 
        a1 = s_vals[0][0]
        a2 = s_vals[1][0]
        vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
        nodes[a1+a2] = [a1, a2]        
    code = {}
    root = a1+a2
    tree = {}
    tree = assign_code(nodes, root, code)   # assignment of the code for the given binary tree      
    return code, tree


In [204]:
alphabet = {'a' : 0.32,
            'b' : 0.25,
            'c' : 0.2,
            'd' : 0.18,
            'e' : 0.05}

code, tree = Huffman_code(alphabet)
code

{'c': '00', 'e': '010', 'd': '011', 'b': '10', 'a': '11'}

In [178]:
alphabet = {'a' : 0.32,
            'b' : 0.25,
            'c' : 0.2,
            'd' : 0.18,
            'e' : 0.05}

code, tree = Huffman_code(alphabet)
code

{'c': '00', 'e': '010', 'd': '011', 'b': '10', 'a': '11'}

In [None]:
text = """O all you host of heaven! O earth! What else? And shall I couple hell? Oh, fie! Hold,
hold, my heart, And you, my sinews, grow not instant old, But bear me stiffly up. Remember
thee! Ay, thou poor ghost, whiles memory holds a seat In this distracted globe. Remember thee!
Yea, from the table of my memory I’ll wipe away all trivial fond records, All saws of books, all
forms, all pressures past That youth and observation copied there, And thy commandment all
alone shall live Within the book and volume of my brain, Unmixed with baser matter. Yes, by
heaven! O most pernicious woman! O villain, villain, smiling, damned villain! My tables! Meet it
is I set it down That one may smile, and smile, and be a villain. At least I’m sure it may be so in
Denmark. So, uncle, there you are. Now to my word."""

text2 = """Habe nun, ach! Philosophie, Juristerei und Medizin, Und leider auch Theologie
Durchaus studiert, mit heissem Bem¨uhn. Da steh ich nun, ich armer Tor! Und bin so klug als wie
zuvor; Heisse Magister, heisse Doktor gar Und ziehe schon an die zehen Jahr Herauf, herab und
quer und krumm Meine Schüler an der Nase herum Und sehe, dass wir nichts wissen können! Das
will mir schier das Herz verbrennen. Zwar bin ich gescheiter als all die Laffen, Doktoren, Magister,
Schreiber und Pfaffen; Mich plagen keine Skrupel noch Zweifel, F¨urchte mich weder vor Hölle noch
Teufel Dafür ist mir auch alle Freud entrissen, Bilde mir nicht ein, was Rechts zu wissen, Bilde mir
nicht ein, ich könnte was lehren, Die Menschen zu bessern und zu bekehren. Auch hab ich weder
Gut noch Geld, Noch Ehr und Herrlichkeit der Welt; Es m¨ochte kein Hund so l¨anger leben! Drum
hab ich mich der Magie ergeben, Ob mir durch Geistes Kraft und Mund Nicht manch Geheimnis
w¨urde kund; Dass ich nicht mehr mit saurem Schweiss Zu sagen brauche, was ich nicht weiss; Dass
ich erkenne, was die Welt Im Innersten zusammenh¨alt, Schau alle Wirkenskraft und Samen, Und
tu nicht mehr in Worten kramen."""

In [71]:
# Test element:
alphabet = get_alphabet(text2.lower())

code, tree = Huffman_code(alphabet)

encoded = ''.join([code[t] for t in text2.lower()])
#print('Encoded text:',encoded)
len(encoded)

KeyError: 'h'

In [69]:
alphabet

{'z': 12,
 'r': 69,
 'u': 50,
 'j': 2,
 'v': 3,
 'e': 131,
 'd': 45,
 ';': 5,
 'i': 77,
 'o': 22,
 'n': 83,
 '\n': 12,
 'c': 43,
 'f': 13,
 'g': 15,
 'p': 5,
 '¨': 8,
 'ö': 2,
 ' ': 185,
 'a': 51,
 'q': 1,
 'w': 20,
 'k': 17,
 ',': 23,
 'ü': 1,
 'b': 17,
 's': 65,
 't': 38,
 'm': 35,
 'h': 72,
 'l': 32,
 '.': 4,
 '!': 4}

In [44]:
decoded = []
i = 0
while i < len(encoded): # decoding using the binary graph
    ch = encoded[i]  
    act = tree[ch]
    while not isinstance(act, str):
        i += 1
        ch = encoded[i]  
        act = act[ch]        
    decoded.append(act)          
    i += 1

print('Decoded text:',''.join(decoded))

Decoded text: Hello Georgi and Maia, i don't know why but it works.


In [16]:
import graphviz

def draw_tree(tree, prefix = ''):    
    if isinstance(tree, str):            
        descr = 'N%s [label="%s:%s", fontcolor=blue, fontsize=16, width=2, shape=box];\n'%(prefix, tree, prefix)
    else: # Node description
        descr = 'N%s [label="%s"];\n'%(prefix, prefix)
        for child in tree.keys():
            descr += draw_tree(tree[child], prefix = prefix+child)
            descr += 'N%s -> N%s;\n'%(prefix,prefix+child)
    return descr


import subprocess
with open('graph.dot','w') as f:
    f.write('digraph G {\n')
    f.write(draw_tree(tree))
    f.write('}') 
subprocess.call('dot -Tpng graph.dot -o graph.png', shell=True)



0

In [17]:
tree

{'0': 'c', '1': {'0': 'a', '1': 'b'}}

In [18]:
code

{'c': '0', 'a': '10', 'b': '11'}