# Part of Speech Tagging

For the following exercises, use the training portion of the Brown corpus as defined by the following code:

In [1]:
import nltk
nltk.download("brown")
from nltk.corpus import brown
import random
random.seed(0)
tagged_sents = list(brown.tagged_sents())
random.shuffle(tagged_sents)
train = tagged_sents[:10000]
test = tagged_sents[10000:10200]

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Remember how the training corpus is structured in this corpus: it is a list of training sentences, and each sentence is a list of tagged words. For example, the first sentence of the training set is:

In [2]:
train[0]

[('Stars', 'NNS-HL'), ('for', 'IN-HL'), ('marriage', 'NN-HL')]

### Exercise: Find all labels and all vocabulary

Using the training data (not the test data!), find the following information:

1. What is the vocabulary size?
2. What is the number of distinct lavels?
3. What is the total number of words?
4. What is the total number of labels?


Vocabulary size is 23192
Label set size is 329
Number of words is 203375
Number of labels is 203375


### Exercise: Find all word bigrams and all label bigrams

Below we are going to find out statistics about what is the most likely word following another word, what is the most likely word or label beginning a sentence, and what is the most likely word or label ending a sentence. A common approach to do statistics on the first and last word or token is add a new word or token such as `$` at the beginning and the end of each sentence. This is what is called *padding*. For this exercise, generate the word and label bigrams after padding with the `$` symbol. For example, if the corpus is the pair of sentences:

```
[[('my','PRP$'),('sentence','NN')],[('my','PRP$'),('second','OD'),('sentence','NN')]]
```

Then the word bigrams are:

```
[('$','my'),('my','sentence'),('sentence','$'),('$','my'),('my','second'),('second','sentence'),('sentence','$')]
```


And the label bigrams are:

```
[('$','PRP$'),('PRP$','NN'),('NN','$'),('$','PRP$),('PRP$','OD'),('OD','NN'),('NN','$')]
```

### Exercise: Most common words and labels beginning and ending a sentence

1. What are the most common words beginning a sentence? 
2. What are the most common words ending a sentence?
3. What are the most common PoS labels beginning a sentence?
4. What are the most common PoS labels ending a sentence?

### Exercise: Most common PoS after another PoS

Write a function `most_common_label(label, n)` that, when given a PoS label and a number n, it prints the n most common labels that follow the given label.

In [11]:
most_common_label('$', 5)

[('AT', 1421), ('PPS', 1097), ('IN', 831), ('``', 763), ('RB', 681)]

In [12]:
most_common_label('NN', 3)

[('IN', 7362), ('.', 3499), (',', 3263)]

# Named Entity Recognition

The lectures of week 6 described several possible features to train a classifier for the task of named entity recognition. Implement feature extractors that focus on some of these features and train a Naive Bayes classifier. Use the training and test files from the CONLL 2002 data.

In [13]:
import nltk
nltk.download("conll2002")
from nltk.corpus import conll2002

[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2002.zip.


In [14]:
conll2002.fileids()

['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']

In [15]:
train = conll2002.iob_sents('esp.train')
test = conll2002.iob_sents('esp.testa')

In [16]:
train[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

### Exercise: Distribution of IOB tags

Calculate the distribution of IOB tags both in the train and the test set and confirm that the distributions are similar.

### Exercise: Feature extractor

Implement an NLTK feature extractor that extracts the following features of a word:

* The Part of speech (this is the second element of the input data).
* True if the first letter is a capital letter, False otherwise.
* True if the word contains a digit, False otherwise.

### Exercise: Classification

Train a NLTK Naive Bayes classifier with the above features and test the accuracy on the test set. Compare the results with the system provided in the NER notebook of the lectures of week 6.