# Parsing 

Note that this chapter is heavily influenced by the structure and content of [Mike Collins' PCFG lecture](http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/pcfgs.pdf). 

In many NLP applications it is useful to understand the syntactic structure of a sentence: where are the verbs, what are the subject and object of the verbs, which phrases form coherent sub-structures of the sentence? Understanding this enables the machine to more effectively translate from Japanese to English, or to understand the query ["who is the president of the united state"](https://www.google.co.uk/search?q=who+is+the+president+of+the+united+state&oq=who+is+the+president+of+the+united+state&aqs=chrome..69i57j0l5.252j0j4&sourceid=chrome&es_sm=119&ie=UTF-8) and execute it against a database. 

In linguistics these questions are asked in the field of **syntax**, from the Greek syntaxis (arrangement). There are three core concepts:

* **Constituency**: groups of words act as single units.
* **Grammatical Relations**: object, subject, direct object etc. 
* **Subcategorization**: restrictions on the type of phrases that go with certain words.

## Context Free Grammars
A common approach to capture constituency, grammatical relations and subcategorization is based on [Context Free Grammars](https://www.cs.rochester.edu/~nelson/courses/csc_173/grammars/cfg.html) (CFGs). On a high level, these grammars assume that legal sentences can be derived by repeatedly and _independently_ expanding abstract symbols (such as "NounPhrase" or "Adjective") into more concrete sequences of symbols (such as "Adjective Noun" or "green") until each symbol is a concrete word. 

More formally, a CFG is a 4-tuple \\(G=(N,\Sigma,R,S)\\) where

  * \\(N\\) is a set of _non-terminal symbols_.
  * \\(\Sigma\\) is a set of _terminal symbols_.
  * \\(R\\) is a finite set of _rules_ \\(X \rightarrow Y_1 Y_2\ldots Y_n\\) where \\(X \in N\\) and \\(Y_i \in N \cup \Sigma\\). 
  * \\(S \in N\\) is a _start symbol_. 

Before we show examples, let us define a Python data structure for CFGs.

In [18]:
class CFG:
    def __init__(self, n, sigma, r, s):
        self.n = n
        self.sigma = sigma
        self.r = r
        self.s = s
    
    @classmethod
    def from_rules(cls, rules, s='S'):
        non_terminals = {rule[0] for rule in rules}
        left_hand_sides = {node for rule in rules for node in rule[1]}
        terminals = {n for n in left_hand_sides if n not in non_terminals}
        return cls(non_terminals, terminals, rules, s)
    
    def _repr_html_(self):
        rules = ["<tr><td>{}</td><td>{}</td></tr>".format(rule[0]," ".join(rule[1])) for rule in self.r]
        return "<table>{}</table>".format("".join(rules))

0,1
S,NP VP
NP,DT Nom


Let us now create an example CFG.

In [21]:
cfg = CFG.from_rules([('S', ['NP_p','VP_p']),('S',['NP_s','VP_s']), 
                      ('NP_p', ['Matko', 'raps']),
                      ('VP_p', ['are', 'ADJ']),
                      ('NP_s', ['Matko']),
                      ('VP_s', ['raps', 'in', 'StatNLP']),
                      ('ADJ', ['silly'])
                     ])

cfg

0,1
S,NP_p VP_p
S,NP_s VP_s
NP_p,Matko raps
VP_p,are ADJ
NP_s,Matko
VP_s,raps in StatNLP
ADJ,silly


## (Left-most) Derivation
A left-most derivation given a CFG \\(G\\) is a sequence of strings \\(s_1 \ldots s_n\\) such that 

* \\(s_1 = S\\), that is, the first string consists only of the start symbol.
* \\(s_n \in \Sigma^*\\), that is, the last string consists of only terminals.
* Each \\(s_i\\) for \\(i > 1\\) is generated by replacing the left-most non-terminal \\(\alpha\\) with the right-hand side of any rule that has \\(\alpha\\) as left-hand side. 

Let us write some code that puts this definition into action and generates random derivations based on a grammar. 

In [None]:
import random
def generate_deriv(cfg, sentence, results = None):
    actual_result = (sentence,) if results is None else results
    non_terminals = ((t,i) for i, t in enumerate(sentence) if t in cfg.n)
    first_non_terminal, first_index = next(non_terminals, (None, -1))
    if first_non_terminal is not None:
        relevant_rules = [rule for rule in cfg.r if rule[0] == first_non_terminal]
        sampled_rule = random.choice(relevant_rules)
        new_sentence = sentence[:first_index] + sampled_rule[1] + sentence[first_index+1:]
        return generate_deriv(cfg, new_sentence, actual_result + (new_sentence,))
    else:
        return actual_result

Let us generate an example derivation.

In [69]:
generate_deriv(cfg, [cfg.s])

(['S'],
 ['NP_s', 'VP_s'],
 ['Matko', 'VP_s'],
 ['Matko', 'raps', 'in', 'StatNLP'])

## Parse Trees
Derivations can be compactly present as trees where each non-leaf node corresponds to an expanded left-hand-side and its children to the rules' right hand side.
