Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [2]:
NAME = ""
IMMATRICULATION_NUMBER = ""

---

# Exercise 6: String RePAIR

Goals of this exercise are:  
* to get an understanding of the RePair compression technique applied to strings
* to have a basis for implementing tree compression, graph compression, and recompression based on RePair 

## 1. Compute a most frequent digram of a string 

Implement a function <code>findMFD</code> that can be called by 

<code>mfd, occs = findMFD(inputString)</code>

and that returns a string containing a most frequent digram occurring in an input string and the number of occurrences of the digram within the input string. For example, 

<code>mfd, ocss = findMFD(“abcabc”)</code>

should return either return the tuple “ab”,2 or the tuple “bc”,2 for mfd and occs, but not both. 

<b>Hint: </b>
To simplify your implementation, you shall assume that the input string does not contain any overlapping digrams (like e.g. “aa” in “aaa”). 


In [8]:
def findMFD(inputString):
    digrams_count = {}
    for i in range(len(inputString)-1):
        digram = inputString[i:i+2]
        if digram in digrams_count:
            digrams_count[digram] += 1
        else:
            digrams_count[digram] = 1
    max_key = max(digrams_count, key=digrams_count.get)
    return max_key, digrams_count[max_key]

In [9]:
assert findMFD('abababab')==('ab',4)
assert findMFD('peterpiperpickedapeckofpickledpeppersapeckofpickledpepperspeterpiperpickedifpeterpiperpickedapeckofpickledpepperswheresthepeckofpickledpepperspeterpiperpicked')==('pe',20)
assert findMFD('iwishtowishthewishyouwishtowishbutifyouwishthewishthewitchwishesiwontwishthewishyouwishtowish')==('wi',13)

## 2. Replace all digram occurrences of a given most frequent digram of a string 

Implement a function <code>replaceMFD</code> that can be called by 

<code>newString = replaceMFD(digram,inputString,nonterminal)</code>

and that replaces all occurrences of a given most frequent digram in a given input string by a given nonterminal symbol. 

For example, 

<code>newString = replaceMFD(“bc”,“abcabc”,“A”)</code>

should return the new string “aAaA”. 

<b>Hint: </b>
Whereas in the previous exercise you do not have to consider overlapping digram occurrences (i.e., you are allowed to count the digram 'aa' in the string 'aaa' for 2 occurences (although these two cannot be replaced), the function <code>replaceMFD</code> has to be able to replace the string correctly. This means that <code>replaceMFD("aa","aaa","A")</code> can return "Aa" or "aA" but must not return "AA", i.e., if we resubmit the nonterminal by the digram, the original inputString has to be returned.


In [12]:
def replaceMFD(digram,inputString: str,nonterminal):  
    # while digram in inputString:
        # inputString = inputString.replace(digram,nonterminal,1)
    inputString = inputString.replace(digram,nonterminal)
    return inputString

In [13]:
assert replaceMFD("bc","abcabc","A")=="aAaA"

assert replaceMFD("ab","abcababdabcdabc","A")=="AcAAdAcdAc"
assert replaceMFD("Ac","AcAAdAcdAc","B")=="BAAdBdB"
assert replaceMFD("dB","BAAdBdB","C")=="BAACC"

assert replaceMFD("ab","abdabcababdabdabcabdaabc","A")=="AdAcAAdAdAcAdaAc"
assert replaceMFD("Ad","AdAcAAdAdAcAdaAc","B")=="BAcABBAcBaAc"
assert replaceMFD("Ac","BAcABBAcBaAc","C")=="BCABBCBaC"
assert replaceMFD("BC","BCABBCBaC","D")=="DABDBaC"

assert replaceMFD("aa","aaa","A")=="aA" or replaceMFD("aa","aaa","A")=="Aa"

## 3. Compute the grammar generated by RePair for a given input string 

Implement a function <code>repair</code> that can be called by 

<code>rules = repair(inputString)</code>

and that computes and returns the grammar that repair generates for the input strings. 

For example, 

<code>rules = repair("abcabcbc")</code>

should return the following list of rules: 

<code>["A-->bc", "B-->aA", "C-->BBA"] </code>

Remember that Repair searches a most frequent digram (e.g. “bc”) that occurs at least twice in the input. If such a digram exists, RePair selects a fresh new nonterminal (e.g. ‘A’) and replaces all occurrences of the selected digram by the new nonterminal. Furthermore, RePair extends the generated grammar by a rule for the nonterminal and the digram (“A-->bc” in this example). 

Replacement of all occurrences of a selected most frequent digram is continued as long as a digram occurs more than once. 

Finally, if no digram occurs more than once in the current string (e.g. ”BBA”), RePair adds a start rule with that current string as right hand side to the grammar (“C-->BBA” in this example). 

<b>Hints:</b> 
* As before, to simplify your implementation, you shall assume that overlapping digrams (like e.g. “aa” in “aaa”) are counted correctly. 
* Furthermore, you shall assume that the initial input string contains only lower case letters (‘a’..’z’) and that the uppercase letters ‘A’, ’B’, … can be used as nonterminals. 
* You do not have to implement the "pruning", i.e., you do not have to replace nonterminals that do not contribute to the compression, as e.g. they are called at most once.



In [16]:
def repair(inputString):
    grammar = {}
    def get_unused_nonterminal():
        for i in range(26):
            if chr(ord('A')+i) not in grammar:
                return chr(ord('A')+i)
    while True:
        digram, count = findMFD(inputString)
        if count <= 1:
            break
        terminal = get_unused_nonterminal()
        grammar[terminal] = digram
        inputString = replaceMFD(digram,inputString,terminal)
    
    start = get_unused_nonterminal()
    grammar[start] = inputString

    grammar_rules = [f"{key}-->{value}" for key, value in grammar.items()]

    return grammar_rules

In [17]:
assert repair("abcabcbc")==["A-->bc", "B-->aA", "C-->BBA"]
assert repair("abcababdabcdabc")==['A-->ab', 'B-->Ac', 'C-->dB', 'D-->BAACC']
assert repair("abdabcababdabdabcabdaabc")==['A-->ab', 'B-->Ad', 'C-->Ac', 'D-->BC', 'E-->DABDBaC']

## 4. Compute a decompression method for the grammar generated by RePair 

Implement a function <code>decomp</code> that can be called by

<code>decompressedString = decomp(rules)</code>

and that computes and returns the decompressed string from a list of rules that have been computed by your RePair compression algorithm. 

For example, if rules contains the following list of rules

<code>["A-->bc", "B-->aA", "C-->BBA"] </code>

the call of 

<code>decompressedString = decomp(rules)</code>

should return the string "abcabcbc".

<b>Hint:</b> 
To simplify your implementation, you can assume that the position of the start rule in the given list of rules is fixed, i.e., the start rule is always the last rule.


In [22]:
def decomp(rules):
    grammar = {}
    for rule in rules:
        lhs, rhs = rule.split('-->')
        grammar[lhs] = rhs
    last_nonterminal = rules[-1].split('-->')[0]
    result = last_nonterminal
    for nonterminal in rules[::-1]:
        result = result.replace(nonterminal.split('-->')[0], grammar[nonterminal.split('-->')[0]])
    return result

In [23]:
assert decomp(['A-->bc', 'B-->aA', 'C-->BBA'])=="abcabcbc"
assert decomp(['A-->ab', 'B-->Ac', 'C-->dB', 'D-->BAACC'])=="abcababdabcdabc"
assert decomp(['A-->ab', 'B-->Ad', 'C-->Ac', 'D-->BC', 'E-->DABDBaC'])=="abdabcababdabdabcabdaabc"

def test(S):
    assert decomp(repair(S))==S, "Test failed for input String " + S

import random
import string

def get_random_string(length):
    letters = [chr(c) for c in range(ord('a'), ord('a')+4) ]
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str 
    
test("abracadabra")
test("tobeornottobeortobeornot")
test("yabbadabbadoo")
test("iwishtowishthewishyouwishtowishbutifyouwishthewishthewitchwishesiwontwishthewishyouwishtowish")

for i in range(10):
    test(get_random_string(24))
