# Part of Speech Tagging

For the following exercises, use the training portion of the Brown corpus as defined by the following code:

In [2]:
import nltk
nltk.download("brown")
from nltk.corpus import brown
import random
random.seed(0)
tagged_sents = list(brown.tagged_sents())
random.shuffle(tagged_sents)
train = tagged_sents[:10000]
test = tagged_sents[10000:10200]

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Remember how the training corpus is structured in this corpus: it is a list of training sentences, and each sentence is a list of tagged words. For example, the first sentence of the training set is:

In [3]:
train[0]

[('Stars', 'NNS-HL'), ('for', 'IN-HL'), ('marriage', 'NN-HL')]

### Exercise: Find all labels and all vocabulary

Using the training data (not the test data!), find the following information:

1. What is the vocabulary size?
2. What is the number of distinct lavels?
3. What is the total number of words?
4. What is the total number of labels?


### Exercise: Find all word bigrams and all label bigrams

Below we are going to find out statistics about what is the most likely word following another word, what is the most likely word or label beginning a sentence, and what is the most likely word or label ending a sentence. A common approach to do statistics on the first and last word or token is to add a new word or token such as `$` at the beginning and the end of each sentence. For this exercise, generate the word and label bigrams after padding with the `$` symbol. For example, if the corpus is the pair of sentences:

```
[[('my','PRP$'),('sentence','NN')],[('my','PRP$'),('second','OD'),('sentence','NN')]]
```

Then the word bigrams are:

```
[('$','my'),('my','sentence'),('sentence','$'),('$','my'),('my','second'),('second','sentence'),('sentence','$')]
```


And the label bigrams are:

```
[('$','PRP$'),('PRP$','NN'),('NN','$'),('$','PRP$),('PRP$','OD'),('OD','NN'),('NN','$')]
```

### Exercise: Most common words and labels beginning and ending a sentence

1. What are the most common words beginning a sentence? 
2. What are the most common words ending a sentence?
3. What are the most common PoS labels beginning a sentence?
4. What are the most common PoS labels ending a sentence?

### Exercise: Most common PoS after another PoS

Write a function `most_common_label(label, n)` that, when given a PoS label and a number n, it prints the n most common labels that follow the given label. In your implementation, first create an NLTK `ConditionalFreqDist`, and then implement the function so that it uses your created `ConditionalFreqDist`.

### Optional exercise: Generate random sentences

The following function generates the most likely sentence given a seed word. You can see that the generated sentence easily enters an infinite loop and it never ends.

In [21]:
cfd_words = nltk.ConditionalFreqDist(XX)
def generate_text(seed, n):
    prev = seed
    for i in range(n):
        if prev == '$':
            return
        print(prev)
        prev = cfd_words[prev].max()

In [22]:
generate_text('the', 20)

the
same
time
,
and
the
same
time
,
and
the
same
time
,
and
the
same
time
,
and


Can you implement an improved version that is stochastic? That is, instead of printing the most common next word, each iteration prints a word randomly such that words that follow the previous word more frequently will be printed more frequently. An example of two consecutive calls to the function could return this:
```python
In []: generate_text_stochastic('the',20)
the
women
struggling
to
twist
my
life
span
everything
she
couldn't
be
noble
singular
.
In []: generate_text_stochastic('the',20)
the
uncomfortably
close
.
```

In [41]:
def generate_text_stochastic(seed, n):
  # Write your code here

In [45]:
generate_text_stochastic('The',20)

The
two
parallel
to
perpetual
easement
to
maintain
or
any
piece
of
mankind's
thrusts
to
Greville
in
themselves
,
picking


# Embedding Part of Speech Tagging in a Web Application

The lecture notes and notebooks show how you can use NLTK to implement a part of speech tagger. In this exercise you will incorporate this information in a web application.

The following file is a simple web application that uses [Python's bottle framework](http://bottlepy.org/docs/dev/index.html) and NLTK to present a Web form where a person can input text. When the form is submitted, the application annotates the text with its parts of speech and shows the result to the user.

* [pos.py](pos.py)

### Exercise: Try out the web application

Download the web application and run it on your computer. The terminal console will indicate a URL. Open the URL on a browser and try out the application.

### Exercise: Add your own PoS tagger

Replace NLTK's tagger in the web application with an implementation of your own tagger. Remember that you only need to train the tagger once, and then you can use the trained model each time the user sends text with the form.