## Exercise 5
**Due: Saturday 18th**

The goal of this exercise is to familiarize you with treebank style trees, grammars and parsing. For this, we will use a toolkit called NLTK (Natural Language processing toolkit), which is a python module. NLTK can also be used for many other NLP tasks, like POS-tagging, sentiment analysis, sense disambiguation, etc. 

### Installing and running NLTK

You can download and install the latest version of NLTK from this link http://nltk.org/

The next step is to import the NLTK libraries into your environment, and also download some of the corpora we will be using in this exercise. To carry out this step, type:
```
>>> import nltk 
>>> nltk.download()

```

NLTK contains a number of corpora with parse trees already available – that is, someone has already trained a parser and produced parse trees for each sentence in the corpus. The Penn Treebank is an example of a parsed corpus. It contains several sub-corpora including the Wall Street Journal (WSJ) corpus

Import the portion of the WSJ corpus provided in NLTK:
```
>>> from nltk.corpus import treebank
     ```
The treebank contains 199 articles:
```
>>> len(treebank.fileids())
```
Each article contains a number of sentences, each of which has been parsed. The parsed sentences are provided as a list of trees. Print out the first sentence of the first article:
```
>>> t = treebank.parsed_sents(’wsj_0001.mrg’)[0]
>>> print(t)
```     
We can use nltk.tree to draw the trees, making them easier to read:

```
>>> from nltk import tree
>>> t.draw()
```     


There are many more things that can be done with nltk.tree. The source code contains a “demo” section which outlines some of these things. If you are interested in learning more about nltk.tree please visit: http://www.nltk.org/_modules/nltk/tree.html

Print the first two trees from the treebank. Write down by hand the PCFG that can be derived from just these two trees using MLE. (Marks 3)

In [4]:
import nltk
from nltk.corpus import treebank
from nltk import tree

t = treebank.parsed_sents('wsj_0001.mrg')[0]
print(t)
print("-----")

t = treebank.parsed_sents('wsj_0001.mrg')[1]
print(t)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))
-----
(S
  (NP-SBJ (NNP Mr.) (NNP Vinken))
  (VP
    (VBZ is)
    (NP-PRD
      (NP (NN chairman))
      (PP
        (IN of)
        (NP
          (NP (NNP Elsevier) (NNP N.V.))
          (, ,)
          (NP (DT the) (NNP Dutch) (VBG publishing) (NN group))))))
  (. .))


### Deducted rules

S -> NP-SBJ, VP, '.' : 1.0

---

NP-SBJ -> NP, ',', ADJP, ',' : 0.5

NP-SBJ -> NNP, NNP : 0.5

---

NP -> NN : 0.143

NP -> NP, ',', NP : 0.143

NP -> DT, NNP, VBG, NN : 0.143

NP -> DT, NN : 0.143

NP -> DT, JJ, NN : 0.143

NP -> NNP, NNP : 0.143

NP -> CD, NNS, JJ : 0.143

---

VP -> VBZ, NP-PRD : 0.333

VP -> MD, VP : 0.333

VP -> VB, NP, PP-CLR, NP-TMP : 0.333

---

ADJP -> NP : 1.0

---

PP-CLR -> IN, NP : 1.0

---

NP-TMP -> NNP, CD : 1.0

---

NP-PRD -> NP, PP : 1.0

--- 

PP -> IN, NP : 1.0

### Deducted lexicon

NNP -> Pierre : 0.125

NNP -> Vinken : 0.25

NNP -> Mr. : 0.125

NNP -> Dutch : 0.125

NNP -> Elsevier : 0.125

NNP -> N.V. : 0.125

NNP -> Nov. : 0.125

---

CD -> 61 : 0.5

CD -> 29 : 0.5

---

NNS -> years : 1.0

---

JJ -> old : 0.5

JJ -> nonexclusive : 0.5

---

MD -> will : 1.0

---

VB -> join : 1.0

VBZ -> is : 1.0

VBG -> publishing : 1.0

---

DT -> the : 0.666

DT -> a : 0.333

---

NN -> board : 0.25

NN -> director : 0.25

NN -> chairman : 0.25 

NN -> group : 0.25

---

IN -> as : 0.5

IN -> of : 0.5

---

