Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = ""
IMMATRICULATION_NUMBER = ""

---

# Exercise 9: Recompression of String Grammars

Goals of this exercise:  
* Get a better understanding of different building blocks of Recompression of String Grammars
* Get a better understanding of counting digrams in string grammars
* Get a better understanding of computing rule variations.

## First and last terminal generated by a grammar rule 

Write a method `first` that can be called by 

```
first(Grammar, Symbol)
```

and that returns the terminal symbol itself if `Symbol` is a terminal and that returns the first terminal of the decompressed string generated by `Symbol` if `Symbol` is a nonterminal defined in the `Grammar`. 

For example, for `Grammar =  {"A":"fBBBg", "B":"CC", "C":"gf"}`, the following calls of first shall give the following results:
```
first(Grammar, 'f'):	'f'
first(Grammar, 'C'):	'g'
first(Grammar, 'B'):	'g'	
first(Grammar, 'A'):	'f'
```

You shall assume that ‘A’...’Z’ are all possible nonterminal symbols, that ‘a’...’z’ are all possible terminal symbols, and that each nonterminal used as a parameter of a call of first or of last is defined by a grammar rule in the `Grammar`. 

Write also a method `last` that can be called by 
	
```
last(Grammar, Symbol)
```
    
and that returns the terminal symbol itself if `Symbol` is a terminal and that returns the last terminal of the decompressed string generated by `Symbol` if `Symbol` is a nonterminal defined in the `Grammar`. You shall assume that each nonterminal used as a parameter of a call of last is defined by a grammar rule in the `Grammar`. 

For example, for `Grammar =  {"A":"fBBBg", "B":"CC", "C":"gf"}`, the following calls of last shall give the following results:
```
last( Grammar, 'g'):	'g'
last( Grammar, 'C'):	'f'
last( Grammar, 'B'):	'f'
last( Grammar, 'A'):	'g'
```

In [2]:
def first(G, NT):                       # return first char generated by rule of NT
    while NT in G:
        NT = G[NT][0]
    return NT


In [3]:
def last(G, NT):                        # return last char generated by rule of NT
    while NT in G:
        NT = G[NT][-1]
    return NT


In [4]:
Grammar =  {"A":"fBBBg", "B":"CC", "C":"gf"}

assert first(Grammar, 'f') == 'f'
assert first(Grammar, 'C') == 'g'
assert first(Grammar, 'B') == 'g'    
assert first(Grammar, 'A') == 'f'

In [5]:
Grammar =  {"A":"fBBBg", "B":"CC", "C":"gf"}

assert last( Grammar, 'g') == 'g'
assert last( Grammar, 'C') == 'f'
assert last( Grammar, 'B') == 'f'
assert last( Grammar, 'A') == 'g'

## Compute direct and indirect calls of each grammar rule
Implement a method 'computeUsage' that can be called by 
```
directAndIndirectCalls = computeUsage(Grammar)
```
and that computes for each rule in the grammar, how often the rule is directly or indirectly called if 
the start rule is called exactly once. 
To simplify the implementation, you shall assume that the non-terminal symbols of the grammar rule are given
in a topologically sorted order, 
such that each rule with a non-terminal symbol nt only calls rules the 
non-terminal symbol of which is lexicographically greater than nt. Furhtermore, the start rule gets the 
lexicographically smallest non-terminal symbol.
For example, `computeUsage(Grammar)` should return the following results for the following grammars:

- `computeUsage({"A":"fBBBg","B":"DD","D":"gf"})=={"A":1, "B":3, "D":6}`
- `computeUsage({"A":"fBDBBg","B":"CDC","C":"DD","D":"EE","E":"gf"})=={"A":1, "B":3, "C":6, "D":16, "E":32}`
- `computeUsage({"A":"CdBBhi","B":"eC","C":"fg"})=={"A":1, "B":2, "C":3}`

In [6]:
def computeUsage(G):                   # how often is each symbol directly or indiectly used? 
    NTs = sorted(G.keys())
    usage = {}
    usage[NTs[0]] = 1
    for NT in NTs:
        rule = G[NT]
        for char in rule:
            if char in NTs:
                usage[char] = usage.get(char, 0) + usage[NT]
    
    return usage

In [7]:

assert computeUsage({"A":"fBBBg","B":"DD","D":"gf"})=={"A":1, "B":3, "D":6}
assert computeUsage({"A":"fBBBg","B":"CC","C":"DD","D":"gf"})=={"A":1, "B":3, "C":6,"D":12}
assert computeUsage({"A":"fBBBg","B":"CC","C":"DD","D":"EE","E":"gf"})=={"A":1, "B":3, "C":6,"D":12,"E":24}
assert computeUsage({"A":"fBDBBg","B":"CDC","C":"DD","D":"EE","E":"gf"})=={"A":1, "B":3, "C":6,"D":16,"E":32}
assert computeUsage({"A":"CdBBhi", "B":"eC", "C":"fg"})=={"A":1, "B":2, "C":3}

## Counting the number of digram occurrences

Implement a method `digramOccs` that can be called by
```
digramCounts = digramOccs( Grammar)
```

and that computes how often each digram consisting of two terminals occurs in the tree generated by the grammar. The number of digrams shall be computed by the digram counting technique presented in the lecture, i.e., without decompression of the grammar into a tree`

For example, calling `digramOccs( Grammar)` with the following parameters shall give the following results:
- `digramOccs({"A":"fBBBg","B":"DD","D":"gf"})=={"fg":7,"gf":6}`
- `digramOccs({"A":"fBDBBg","B":"CDC","C":"DD","D":"EE","E":"gf"})=={"fg":33,"gf":32}`
- `digramOccs({"A":"CdBBhi", "B":"eC", "C":"fg"})=={"fg":3,"ef":2,"gd":1,"de":1,"ge":1,"gh":1,"hi":1}`

In [8]:
    
def digramOccs(G):                     # count all digram occurences in G
    occs = {}
    usage = computeUsage(G)
    for NT in G:
        rule = G[NT]
        for i in range(len(rule)-1):
            digram = [rule[i], rule[i+1]]
            digram[0] = last(G, digram[0])
            digram[1] = first(G, digram[1])
            digram = "".join(digram)
            occs[digram] = occs.get(digram, 0) + usage[NT]
    return occs

In [9]:
assert digramOccs({"A":"fBBBg","B":"DD","D":"gf"})=={"fg":7,"gf":6}
assert digramOccs({"A":"fBBBg","B":"CC","C":"DD","D":"gf"})=={"fg":13,"gf":12}
assert digramOccs({"A":"fBBBg","B":"CC","C":"DD","D":"EE","E":"gf"})=={"fg":25,"gf":24}
assert digramOccs({"A":"fBDBBg","B":"CDC","C":"DD","D":"EE","E":"gf"})=={"fg":33,"gf":32}
assert digramOccs({"A":"CdBBhi", "B":"eC", "C":"fg"})=={"fg":3,"ef":2,"gd":1,"de":1,"ge":1,"gh":1,"hi":1}

## Identifying all rule versions that have to be generated


Write a method `identifyVersions` that can be called by 

```
versions = identifyVersions(Grammar, NT, digram) 
```

and that returns all versions of each non-terminal symbol contained within the right-hand side of the rule with non-terminal symbol `NT`. If for a rule with non-terminal symbol R the version R-fl (i.e. R without its first and last symbol) is needed, add ("R","fl") to the result set. If the version R-l is needed, add ("R","l"); if the version R-f is needed, add ("R","f"); and if R is needed in its original form, add("R","").

For example, let  `Grammar = {"A":"CdBBBhi", "B":"eC", "C":"fg"}`.

Then, the following calls of `identifyVersions` shall return the following results:
```
1. identifyVersions( Grammar, "A", "hi" ) == {("C",""),("B","")}
2. identifyVersions( Grammar, "A", "gd" ) == {("C","l"),("B","")}
3. identifyVersions( Grammar, "A", "de" ) == {("C",""),("B","f"),("B","")}
4. identifyVersions( Grammar, "A", "gh" ) == {("C",""),("B","l"),("B","")}
5. identifyVersions( Grammar, "A", "ge" ) == {("C",""),("B","l"),("B","fl"),("B","f")}
6. identifyVersions( Grammar, "A", "ef" ) == {("C",""),("B","")}
7. identifyVersions( Grammar, "B", "ef" ) == {("C","f")}
8. identifyVersions( Grammar, "B", "fg" ) == {("C","")}
9. identifyVersions( Grammar, "C", "fg" ) == set()
```

**Remarks:**
1.	Position 4 in the rule is a digram occurrence of "hi". However, as 'h' and 'i' at positions 4 and 5 are terminals, the digram "hi" is already isolated, i.e., no versions need to be produced; we need C and B in their original form. 
2.	Position 0 in the rule is a digram occurrence of "gd", because the last terminal generated by 'C' at position 0 is 'g'. We need C in the form ("C","f") and B only in its original form
3.	Position 1 in the rule is a digram occurrence of "de", because the first terminal generated by 'B' at position 2 in pos is 'e'. We need B in the forms ("B","f") and in its original form, as the later B's in rule A do not produce the digram "de", and we need C in its original form.
4.	Position 3 in the rule is a digram occurrence of "gh", because the last terminal generated by 'B' at position 3 is 'g'. We need B in the forms ("B","l") and in its original form, as the other B's in rule A do not produce the digram "gh", and we need C in its original form.
5.	Position 2 in the rule is a digram occurrence of "ge", because the last terminal generated by 'B' at position 2 is 'g' and the first terminal generated by 'B' at position 3 is 'e' “. Further occurrences are at positions 3 and 4. Thus, we need B in the three forms ("B","l"), ("B","fl") and ("B","f") and C in its original from.
6.	There is no digram occurrence for "ef" in the rule "A":"CdBBhi". Thus, we return all non-terminal symbols in their original form.
7.	Position 0 in the rule "B":"eC" is a digram occurrence of "ef", because the first terminal generated by 'C' at position 1 is 'f'. Therefore, we need C in the form ("C","f")
8.	There is no digram occurrence for "fg" in the rule "B":"eC". Thus, we need C in its original form.
9.	Position 0 in the rule "C":"fg" is a digram occurrence of "fg". However, as 'f' and 'g' at positions 0 and 1 are terminals, the digram "fg" is already isolated. As rule C does not contain non-terminal symbols, we return the empty set.


**Hint:**
* Use your procedures first and last of the previous subtask. 


In [10]:
def identifyVersions(G,NT,mfd):  # which versions are needed for the given grammar rule?
    rule = G[NT]
    versions = set()
    digrams = []
    for i in range(len(rule)-1):
        digram = [rule[i], rule[i+1]]
        digram[0] = last(G, digram[0])
        digram[1] = first(G, digram[1])
        digram = "".join(digram)
        digrams.append(digram)
    for i, char in enumerate(rule):
        if char not in G:
            continue
        version = ""
        if i > 0:
            left_digram = digrams[i-1]
            if mfd == left_digram:
                version += "f"
        if i < len(rule)-1:
            right_digram = digrams[i]
            if mfd == right_digram:
                version += "l"
        versions.add((char, version))

    return versions

In [11]:
Grammar = {"A":"CdBBBhi", "B":"eC", "C":"fg"}

assert identifyVersions( Grammar, "A", "hi" ) == {("C",""),("B","")}
assert identifyVersions( Grammar, "A", "gd" ) == {("C","l"),("B","")}
assert identifyVersions( Grammar, "A", "de" ) == {("C",""),("B","f"),("B","")}
assert identifyVersions( Grammar, "A", "gh" ) == {("C",""),("B","l"),("B","")}
assert identifyVersions( Grammar, "A", "ge" ) == {("C",""),("B","l"),("B","fl"),("B","f")}
assert identifyVersions( Grammar, "A", "ef" ) == {("C",""),("B","")}
assert identifyVersions( Grammar, "B", "ef" ) == {("C","f")}
assert identifyVersions( Grammar, "B", "fg" ) == {("C","")}
assert identifyVersions( Grammar, "C", "fg" ) == set()



## Create a version of a given rule
**_(approx. 12 lines in Python)_**

Write a method `createVersion` that can be called by 
```
String = createVersion(Grammar, NT, fl)
```
and that returns the version defined by `fl` for the rule with non-terminal symbol `NT`. 

If `fl`is the empty string, you can simply return the right-hand side of rule with non-terminal symbol `NT`. If it contains the letter `'f'` you have to return a version without the first terminal symbol, if it contains the letter `'l'` you have to return a version without the last terminal symbol. As a consequence, you have to return a version without the first and the last terminal symbol, if the parameter `fl` contains both letters `'f'` and `'l'`.
In order to remove the first (or last) symbol:
- If the first (last) symbol is a terminal symbol, remove it.
- If the first symobl is a non-terminal symbol `'R'` and you want to remove the first symbol, replace it by `'(R,f)'` to represent the non-terminal symobl for the new rule containing the content of rule with non-terminal symbol `R` without its first symbol.
- If the last symobl is a non-terminal symbol `'R'` and you want to remove the last symbol, replace it by `'(R,l)'` to represent the non-terminal symobl for the new rule containing the content of rule with non-terminal symbol `R` without its last symbol.

Then, for `Grammar = {"A":"CdBBhi", "B":"eC", "C":"fg"}` the following calls of `createVersion` shall return the following strings:

```
1. createVersion(Grammar,"A","")=="CdBBhi"
2. createVersion(Grammar,"A","f")=="(C,f)dBBhi"
3. createVersion(Grammar,"A","l")=="CdBBh"
4. createVersion(Grammar,"A","lf")=="(C,f)dBBh"

5. createVersion(Grammar,"B","")=="eC"
6. createVersion(Grammar,"B","f")=="C"
7. createVersion(Grammar,"B","l")=="e(C,l)"
8. createVersion(Grammar,"B","lf")=="(C,l)"

9. createVersion(Grammar,"C","")=="fg"
10. createVersion(Grammar,"C","f")=="g"
11. createVersion(Grammar,"C","l")=="f"
12. createVersion(Grammar,"C","lf")==""
```



In [18]:
def createVersion(G,NT,fl):                  # create the specified version
    rule = G[NT]
    if "f" in fl:
        if rule[0] in G:
            rule = f"({rule[0]},f){rule[1:]}"
        else:
            rule = rule[1:]
    if "l" in fl:
        if rule[-1] in G:
            rule = f"{rule[:-1]}({rule[-1]},l)"
        else:
            rule = rule[:-1]
    # print(f"{NT} -> {rule}")
    return rule

In [19]:
Grammar = {"A":"CdBBhi", "B":"eC", "C":"fg"}

assert createVersion(Grammar,"A","")=="CdBBhi"
assert createVersion(Grammar,"A","f")=="(C,f)dBBhi"
assert createVersion(Grammar,"A","l")=="CdBBh"
assert createVersion(Grammar,"A","lf")=="(C,f)dBBh"

assert createVersion(Grammar,"B","")=="eC"
assert createVersion(Grammar,"B","f")=="C"
assert createVersion(Grammar,"B","l")=="e(C,l)"
assert createVersion(Grammar,"B","lf")=="(C,l)"

assert createVersion(Grammar,"C","")=="fg"
assert createVersion(Grammar,"C","f")=="g"
assert createVersion(Grammar,"C","l")=="f"
assert createVersion(Grammar,"C","lf")==""


A -> CdBBhi
A -> (C,f)dBBhi
A -> CdBBh
A -> (C,f)dBBh
B -> eC
B -> C
B -> e(C,l)
B -> (C,l)
C -> fg
C -> g
C -> f
C -> 
