# LELA70331 Computational Linguistics Week 11

This week we are going to take a look at Syntactic parsing


We are going once again to use tools from NLTK, which we need to import as follows: 

In [None]:
import nltk
from nltk.parse.generate import generate
from nltk import CFG, Tree
nltk.download('punkt')


We can define phrase structure grammars using rewrite rules (see week 10 lecture for a definition) as follows: 

In [None]:
grammar = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> Det N | Pronoun
    VP -> V NP 
    Det -> 'the' 
    Pronoun -> 'I'
    N -> 'dishes'  
    V -> 'washed'
 """)

We can then "parse" tokenised input sentences as follows:

In [None]:
# define sentence and tokenize it
sent = 'I washed the dishes'
sent = nltk.word_tokenize(sent)
# use a parser to generate all possible syntax trees for the input sentence given our grammar
parser = nltk.ChartParser(grammar)
# print out all analyses
for tree in parser.parse(sent):
    nltk.Tree.fromstring(str(tree)).pretty_print()

And we can generate from the grammar as follows:

In [None]:
for sentence in generate(grammar):
     print(' '.join(sentence))

Activity: Update the grammar so that it will parse "They washed the car". You can use the "|" symbol to allow multiple words or symbols on the right hand side of the rule, e.g. V -> 'washed' | 'threw'

Activity: Update the grammar so that it will parse "The boy and his dog enter the park". Note - it is permitted for the same terminal symbol to appear on the left and the right hand side of the same rule.

Activity: Generate from the grammar again. Why does it crash?

Activity: Now we have fixed that, generate from the grammar again. What problems do you notice in the output? What might the solution be?

Activity: Update the grammar so that it will correctly parse the sentence "I washed the dishes on the counter". The intended interpretation is that the dishes were formerly on the counter and the washing took place in the sink. So the correct parse is as follows.

<img src="https://drive.google.com/uc?id=160-bmjhw_Jk6FNBaCCOtJxyXuCL5I7GG">





In [None]:
grammar = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> Det N | Pronoun
    VP -> V NP 
    Det -> 'the' 
    Pronoun -> 'I'
    N -> 'dishes' 
    V -> 'washed'
 """)

In [None]:
sent = 'I washed the dishes on the counter'
sent = nltk.word_tokenize(sent)
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
    nltk.Tree.fromstring(str(tree)).pretty_print()

Activity: now add rules to the same grammar to also give the correct analysis to the sentence "I washed my hair in the shower"

In [None]:
sentences = ['I washed the dishes on the counter', 'I washed my hair in the shower']
parser = nltk.ChartParser(grammar)
for sent in sentences:
    for tree in parser.parse(nltk.word_tokenize(sent)):
        nltk.Tree.fromstring(str(tree)).pretty_print()

# Probabilistic Grammar
Because even very simple grammars can allow multiple, and sometimes a great many, analyses for simple sentences, particularly as the grammar gets big, it becomes necessary to find a way to prefer one parse over others. One way to accomplish this is with probabilistic grammars where a weight is given to each rule.

In [None]:
grammar = nltk.PCFG.fromstring("""
    S -> NP VP [1.0]
    NP -> Det N [0.25]
    NP -> NP PP [0.25]
    NP -> N PP [0.25]
    NP -> Pronoun [0.25]
    PP -> P NP [1.0]
    VP -> V NP [0.5]
    VP -> VP PP [0.5]
    Det -> 'the' [0.5]
    Det -> 'my' [0.5]
    Pronoun -> 'I' [1.0]
    N -> 'dishes'  [0.25]
    N -> 'sink' [0.25]
    N -> 'breakfast' [0.25]
    N -> 'pyjamas'[0.25]
    V -> 'washed' [0.5]
    V ->  'ate' [0.5]
    P -> 'in' [1.0]
 """)

In [None]:
sentences = ['I ate my breakfast in my pyjamas', 'I washed the dishes in the sink']
parser = nltk.ViterbiParser(grammar)
import re
for sent in sentences:
    for tree in parser.parse_all(nltk.word_tokenize(sent)):
        tree = re.sub("\(p[^\)]+\)","",str(tree))
        nltk.Tree.fromstring(str(tree)).pretty_print()


Activity: Change the probabilities to assign the correct analysis for I washed the dishes in the sink

Getting the correct solution for both sentences at the same time requires an additional change to the form of the grammar. Any ideas what might work?

## Treebanks and grammar induction

Just writing these few small toy grammars has been quite involved. Writing full grammars that will have wide coverage is extremely difficult. We therefore learn them from corpora that have been annotated with syntax trees, known as treebanks.

Some treebanks are build into NLTK and we can load an example as follows:

In [None]:
from nltk.corpus import treebank
nltk.download('treebank')

We can inspect an example tree as follows:

In [None]:
t = treebank.parsed_sents('wsj_0001.mrg')[0]
nltk.Tree.fromstring(str(t)).pretty_print()

We can learn a grammar from treebank data as follows. 

First we have to make a slight change to the format of the trees:

In [None]:
productions = []
for item in treebank.fileids():
  for tree in treebank.parsed_sents(item):
    # perform optional tree transformations, e.g.:
    tree.collapse_unary(collapsePOS = False)# Remove branches A-B-C into A-B+C
    tree.chomsky_normal_form(horzMarkov = 2)# Remove A->(B,C,D) into A->B,C+D->D
    productions += tree.productions()

And then we can "induce" a probabilistic grammar as follows.

In [None]:
from nltk import induce_pcfg, grammar 
S = grammar.Nonterminal('S')
grammar_PCFG = induce_pcfg(S, productions)
print(grammar_PCFG)

In [None]:
sentences = ['I drive in the city']
parser = nltk.ViterbiParser(grammar_PCFG)
import re
for sent in sentences:
    for tree in parser.parse_all(nltk.word_tokenize(sent)):
        tree = re.sub("\(p[^\)]+\)","",str(tree))
        nltk.Tree.fromstring(str(tree)).pretty_print()