# Taaltheorie & Taalverwerking 2017 - Assignment 3

In [1]:
# FILL THIS IN FOR YOUR GROUP, also name your file as: tttv_ass3_<group>_<name1>_<name2>.ipynb
# Group        : G
# Name - UvaID : Deborah Lambregts - 11318643
# Name - UvaID : Bram Otten - 10992456
# Date         : 23-04-2017

In [2]:
import nltk
from nltk import CFG
from nltk.grammar import FeatureGrammar
from nltk.parse import RecursiveDescentParser, FeatureEarleyChartParser

# Function that works for multiple types of parsers (You are free to use something else if you want.)
def check_sentence(parser, sentence):
    print("--------------------------------------------------")
    print("Checking if provided sentence matches the grammar:")
    print(sentence)
    if isinstance(sentence, str):
        sentence = sentence.split()
    tree_found = False
    results = parser.parse(sentence)
    for tree in results:
        tree_found = True
        print(tree)
    if not tree_found:
        print(sentence, "Does not match the provided grammar.")
    print("--------------------------------------------------")
    return tree_found

The main goal of this set of exercises is to implement the shift-reduce algorithm for bottom-up parsing. We will use the following grammar for most of our examples, which is a slightly modified version of the grammar we used in the exercises last week:

<table>
<tr>
    <td>Phrase structure rules</td>
    <td>Lexicon</td>
</tr>
<tr>
    <td>S $\rightarrow$ NP VP</td>
    <td>Det $\rightarrow$ *the*</td>
</tr>
<tr>
    <td>NP $\rightarrow$ Det N</td>
    <td>N $\rightarrow$ *journalist* | *detective*</td>
</tr>
<tr>
    <td>VP $\rightarrow$ V NP</td>
    <td>V $\rightarrow$ *interviews* | *photographs*</td>
</tr>
<tr>
    <td>V $\rightarrow$ V C V</td>
    <td>C $\rightarrow$ *and* | *or*</td>
</tr>
</table>


This grammar licenses sentences such as *the detective interviews the journalist* as well as sentences such as *the journalist interviews and photographs the detective*.


### Question 1 (3 pts)

Encode the above grammar as a CFG named **cfg_1**, using NLTK. Then try to parse the following sentence: *The detective interviews and photographs the journalist.*


In [3]:
# Finish the declaration of cfg_1
cfg_1 = CFG.fromstring("""
  S -> NP VP
  
  NP -> Det N
  VP -> V NP
  V -> V C V
  
  Det -> 'the'
  N -> 'journalist' | 'detective'
  V -> 'interviews' | 'photographs'
  C -> 'and' | 'or'
""")

# Use RecursiveDescentParser for this example.
cfg_1_parser = RecursiveDescentParser(cfg_1)
# The following inputs should produce the corresponding results
print(check_sentence(cfg_1_parser, 'the detective interviews and photographs the journalist')) # RecursionError

--------------------------------------------------
Checking if provided sentence matches the grammar:
the detective interviews and photographs the journalist


RecursionError: maximum recursion depth exceeded while calling a Python object

As you will see, this does not work for the *RecursiveDescentParser*.

If this does work, then try to parse the string *the detective interviews*, which should fail, and continue with the exercise. 
Trace the execution of the query. Document and explain what happens by making reference to the properties of the grammar rules and the parsing strategy employed by the CFG (_Hint_: The error traces have comments, which could help you. Try scrolling through them and try if you can see a pattern). Be thorough in your answer.

**Answers:**


In [4]:
"""
The grammar rule V -> V C V is recursive, and causes the parser to exceed maximal depth.
The recursive descent parser will always try to 'deconstruct' a V into V C V.
The first V of that deeper V C V will be 'deconstructed' again, into V C V. 
And that will go on until the depth limit is hit.
"""

"\nThe grammar rule V -> V C V is recursive, and causes the parser to exceed maximal depth.\nThe recursive descent parser will always try to 'deconstruct' a V into V C V.\nThe first V of that deeper V C V will be 'deconstructed' again, into V C V. \nAnd that will go on until the depth limit is hit.\n"

Would a top-down breadth-first parsing strategy have the same problem as the *RecursiveDescentParser* parser? Explain why.

**Answers:**

In [5]:
"""
No, the 'top-down-ness' is not a problem. A breadth first parser would
return the shallowest tree it could find before going deeper 
and deeper down the V C V hole.
"""

"\nNo, the 'top-down-ness' is not a problem. A breadth first parser would\nreturn the shallowest tree it could find before going deeper \nand deeper down the V C V hole.\n"

### Question 2 (1 pt)
Encode the same grammar using the following notation. Each rule is represented by a fact of the form **rule\[Left\] = Right**, where **Left** stands for an atom representing the lefthand side of a rule, and **Right** stands for the list of terminal and nonterminal symbols on the righthand side of the rule. For the sake of keeping our sanity in Python we will not store the rules individually, but we store them in a dictionary where **Left** forms the key, and **Right** is stored in a list for each **Left**. For example, the first and the last rule of our grammar would be added as follows:


In [6]:
def add_rule(rules, left, right):
    
    # If the key does not already exist, initialize it with a list.
    if left not in rules:
        rules[left] = []
    
    rules[left].append(right)

rules = dict()
add_rule(rules, 's', ['np', 'vp'])

add_rule(rules, 'np', ['det', 'n'])
add_rule(rules, 'vp', ['v', 'np'])
add_rule(rules, 'v', ['v', 'c', 'v'])
add_rule(rules, 'det', ['the'])
add_rule(rules, 'n', ['journalist'])
add_rule(rules, 'n', ['detective'])
add_rule(rules, 'v', ['interviews'])
add_rule(rules, 'v', ['photographs'])
add_rule(rules, 'c', ['and'])
add_rule(rules, 'c', ['or'])

print(rules)

{'s': [['np', 'vp']], 'np': [['det', 'n']], 'vp': [['v', 'np']], 'v': [['v', 'c', 'v'], ['interviews'], ['photographs']], 'det': [['the']], 'n': [['journalist'], ['detective']], 'c': [['and'], ['or']]}


### Question 3 (5 pts)
Now implement the shift-reduce algorithm as a function **shift_reduce(rules, atoms_list, goal)**. The first argument holds the rules of the CFG, the second arguments represents the string of words to be parsed (a list of atoms), and the third argument represents the constituent(s) as which those words should be parsed, i.e., the parsing goal (also a list of atoms, possibly just a single atom). 

For example, this indicates that this input string of words can be parsed as a sentence: 

    shift_reduce(rules, 'the journalist interviews and interviews and interviews the detective'.split(), ['s']) # True

This indicates that this string of words cannot be parsed as the two constituents NP and VP:

    shift_reduce(rules, 'the journalist interviews'.split(), ['np', 'vp']) # False
    
Proceed as follows. Rather than implementing **shift_reduce(rules, atoms_list, rules_list)** directly, implement it with an optional parameter *mem_list*: **shift_reduce(rules, atoms_list, rules_list, mem_list=[])**, where the *mem\_list* argument is used to represent the memory component (which is also a list of atoms). The function should have the following basic structure: 
1. The first one represents the base case: if the current memory component is identical with the parsing goal (rules\_list), then you are done. 
2. The second is a recursive rule implementing the shift operation: remove the first word from the current sentence and shift it onto the memory stack (i.e., append it to the end of the list representing the memory component). 
3. Finally, your third rule represents the reduce operation: find a rule the righthand side of which matches the last few elements on the memory stack, remove them from the memory, and instead include the lefthand side of the chosen rule at the same position in the memory component. 

In [37]:
# Doesn't work.

def shift_reduce(rules, atoms_list, goal, mem_list=[]):
    
    # Ze shift.
    if len(atoms_list) > 0:    
        mem_list.append(atoms_list[0])
        shift_reduce(rules, atoms_list[1:], goal, mem_list)
    
    # Ze reduce.
    for left_hand in rules:

        rev_list = reversed(mem_list)
        
        # Test all possible combinations of the mem_list in reverse order
        for i in range(len(mem_list)):
            for j in range(len(mem_list)):
                combo = mem_list[-i:-j:-1] # This is probably where it's going wrong.
                
                # If this combination is a right hand side, replace it with left_hand in mem_list
                if combo in reversed(rules[left_hand]):
                    loc = mem_list.index(combo[0])
                    mem_list.insert(loc, left_hand)
                    for item in combo:
                        mem_list.remove(item)
                    
    
    # So, status: the first sentence is getting to det n v det n.
    # That's correct, but the next steps are: np v np; np vp; s
    # (Or: det n = vp; v np = vp; det n = np; np vp = s)
    print(mem_list)
    
    return goal in mem_list

print(shift_reduce(rules, 'the journalist interviews and interviews and interviews the detective'.split(), ['s'])) #True
print(shift_reduce(rules, 'the journalist interviews'.split(), ['np', 'vp'])) # False

['det', 'n', 'v', 'c', 'v', 'and', 'interviews', 'the', 'n']
['det', 'n', 'v', 'c', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
['det', 'n', 'v', 'det', 'n']
False
['det', 'n', 'v', 'det', 'n', 'det', 'n', 'v']
['np', 'v', 'det', 'n', 'det', 'n', 'v']
['np', 'v', 'np', 'det', 'n', 'v']
['np', 'v', 'np', 'det', 'n', 'v']
False


Document that your predicate **shift_reduce(rules, atoms_list, rules_list)** (Leaving the mem\_list empty) works as intended by showing the output for three (interesting) queries (that need to be different from the queries shown above).

In [38]:
### TODO ###

print(shift_reduce(rules, 'interesting sentence 1', []))
print(shift_reduce(rules, 'interesting sentence 2', []))
print(shift_reduce(rules, 'interesting sentence 3', []))

['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '1']
['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '1']
['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '1']
['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '1']
['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '1']
['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '1']
['np', 'v', 'np', 'det', 'n', 'v', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 