In [1]:
import nltk
from nltk.corpus import brown
import pprint

##A Simple Baseline Tagger##
Keep in mind that the brown corpus is already tagged.  The simplest possible tagger assigns the **most likely** tag to each token. This establishes a baseline tagger.  So let's use the data we have to figure out what the most likely tag for English is.

In [2]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

We can use FreqDist and max() to find out which tag is the **most likely tag** for English according to the Brown corpus by counting how many tags have been assigned to the words in this corpus.

In [3]:
nltk.FreqDist(tags).max()

'NN'

Now that we know empirically which is the most likely tag for English, we can make a baseline tagger that automatically assigns the most likely tag when we don't know what else to do.

In [4]:
default_tagger = nltk.DefaultTagger('NN')
raw = r'''what will this silly tagger do?'''
tokens = nltk.word_tokenize(raw)
print (default_tagger.tag(tokens))

[('what', 'NN'), ('will', 'NN'), ('this', 'NN'), ('silly', 'NN'), ('tagger', 'NN'), ('do', 'NN'), ('?', 'NN')]


##Train a Unigram Tagger From Pre-Tagged Text##
Now train a unigram tagger on the news portion of the Brown corpus.

In [5]:
brown_tagged_sents = brown.tagged_sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print (unigram_tagger.tag(tokens))

[('what', 'WDT'), ('will', 'MD'), ('this', 'DT'), ('silly', 'JJ'), ('tagger', None), ('do', 'DO'), ('?', '.')]


###Separate Training From Testing Data###
But really we need to separate training and testing data.  We can use the handy python string slicing operator to do this *really* easily.  Here we divide into 90% training and 10% testing data.

In [6]:
def create_data_sets(tagged_sents):
    size = int(len(tagged_sents) * 0.9)
    train_sents = tagged_sents[:size]
    test_sents = tagged_sents[size:]
    return train_sents, test_sents

sample_sents = brown.tagged_sents(categories='news')
train_sents, test_sents = create_data_sets(sample_sents)

unigram_tagger = nltk.UnigramTagger(train_sents)
print (unigram_tagger.tag(tokens))

[('what', 'WDT'), ('will', 'MD'), ('this', 'DT'), ('silly', 'JJ'), ('tagger', None), ('do', 'DO'), ('?', '.')]


###Evaluation Metric###
NLTK's tagger has a handy evaluation function built right in!  It automatically compares the output of your tagger with the tags assigned to the Brown corpus.  The score shown below is the average across the entire test collection.

In [7]:
print ("%0.3f" % unigram_tagger.evaluate(test_sents))


0.812


###Question###
What is this evaluation metric measuring?
* Answer: 



Which tags did it get wrong?  If you want to see what the gold standard tags were vs. what the tagger produced, here is some code to do it (written by Jason Ost, MIMS from 2014).  The first column is the word, the second is the tag from the gold standard, and the third is what the algorithm assigned.  (The last element is a little tricky: the tagger's tag() function expects a list of words as input, so you have to enclose "w" in square brackets, and it returns a list of tagged words (as two-element tuples), so you have to grab the second element of the first tuple, which is the predicted tag.  This works because the unigram tagger looks at each word in isolation.)

In [8]:
[(w, t, unigram_tagger.tag([w])[0][1]) for w, t in test_sents[3]]

[('For', 'IN', 'IN'),
 ('18', 'CD', 'CD'),
 ('months', 'NNS', 'NNS'),
 (',', ',', ','),
 ('Hamilton', 'NP', None),
 ('Holmes', 'NP', None),
 (',', ',', ','),
 ('19', 'CD', 'CD'),
 (',', ',', ','),
 ('and', 'CC', 'CC'),
 ('Charlayne', 'NP', None),
 ('Hunter', 'NP', 'NP-TL'),
 (',', ',', ','),
 ('18', 'CD', 'CD'),
 (',', ',', ','),
 ('had', 'HVD', 'HVD'),
 ('tried', 'VBN', 'VBN'),
 ('to', 'TO', 'TO'),
 ('get', 'VB', 'VB'),
 ('into', 'IN', 'IN'),
 ('the', 'AT', 'AT'),
 ('university', 'NN', 'NN'),
 ('.', '.', '.')]

In [9]:
[(w, t, unigram_tagger.tag([w])[0][1]) for w, t in test_sents[16]]

[('Negro', 'NP', 'NP'),
 ('lawyers', 'NNS', 'NNS'),
 ('dug', 'VBD', 'VBN'),
 ('into', 'IN', 'IN'),
 ('the', 'AT', 'AT'),
 ('records', 'NNS', 'NNS'),
 ('of', 'IN', 'IN'),
 ('300', 'CD', 'CD'),
 ('white', 'JJ', 'JJ'),
 ('students', 'NNS', 'NNS'),
 (',', ',', ','),
 ('found', 'VBD', 'VBN'),
 ('that', 'CS', 'CS'),
 ('many', 'AP', 'AP'),
 ('were', 'BED', 'BED'),
 ('hardly', 'RB', 'RB'),
 ('interviewed', 'VBN', 'VBD'),
 ('at', 'IN', 'IN'),
 ('all', 'ABN', 'ABN'),
 ('--', '--', '--'),
 ('and', 'CC', 'CC'),
 ('few', 'AP', 'AP'),
 ('had', 'HVD', 'HVD'),
 ('academic', 'JJ', 'JJ'),
 ('records', 'NNS', 'NNS'),
 ('as', 'QL', 'CS'),
 ('good', 'JJ', 'JJ'),
 ('as', 'CS', 'CS'),
 ('Hamilton', 'NP', None),
 ('Holmes', 'NP', None),
 ('.', '.', '.')]

##Train an N-Gram Tagger With Backoff ##

Below is code for a bigram tagger with backoff.  When it encounters a token, it first
1. Tries tagging the token with the bigram tagger.
2. If the bigram tagger is unable to find a tag for the token, tries the unigram tagger.
3. If the unigram tagger is also unable to find a tag, uses the default tagger.

In [19]:
def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2
ngram_tagger = build_backoff_tagger(train_sents)
bigram_tagger = ngram_tagger
print ("%0.3f" % ngram_tagger.evaluate(test_sents))

0.845


If you want to compare the output of your trained bigram tagger to the gold standard, here is code for that as well, again courtesy Jason Ost.  This a little more complicated, since you need to give the tagger code not only the current word, but also the one before it, unless it's the first word in a sentence, in which case it supplies some padding words.

In [11]:
[(w2, t2, bigram_tagger.tag([w1,w2])[1][1]) 
 for (w1, t1), (w2, t2) in nltk.bigrams(test_sents[16], pad_left=True, pad_symbol=(None, None))]

[('Negro', 'NP', 'NP'),
 ('lawyers', 'NNS', 'NNS'),
 ('dug', 'VBD', 'VBN'),
 ('into', 'IN', 'IN'),
 ('the', 'AT', 'AT'),
 ('records', 'NNS', 'NNS'),
 ('of', 'IN', 'IN'),
 ('300', 'CD', 'CD'),
 ('white', 'JJ', 'JJ'),
 ('students', 'NNS', 'NNS'),
 (',', ',', ','),
 ('found', 'VBD', 'VBD'),
 ('that', 'CS', 'CS'),
 ('many', 'AP', 'AP'),
 ('were', 'BED', 'BED'),
 ('hardly', 'RB', 'RB'),
 ('interviewed', 'VBN', 'VBD'),
 ('at', 'IN', 'IN'),
 ('all', 'ABN', 'ABN'),
 ('--', '--', '--'),
 ('and', 'CC', 'CC'),
 ('few', 'AP', 'AP'),
 ('had', 'HVD', 'HVD'),
 ('academic', 'JJ', 'JJ'),
 ('records', 'NNS', 'NNS'),
 ('as', 'QL', 'CS'),
 ('good', 'JJ', 'JJ'),
 ('as', 'CS', 'CS'),
 ('Hamilton', 'NP', 'NN'),
 ('Holmes', 'NP', 'NN'),
 ('.', '.', '.')]

## EXERCISE: Train and Evaluate a Trigram Tagger ##

Modify build_backoff_tagger() to build a backoff trigram tagger.  Evaluate the results.  How does it do compared to the previous backoff tagger?

In [20]:
def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    t3 = nltk.TrigramTagger(train_sents, backoff=t2)
    return t3
trigram_tagger = build_backoff_tagger(train_sents)
print ("%0.3f" % trigram_tagger.evaluate(test_sents))

0.843


## EXERCISE: Train a Simplified Tagger ##
Train and evaluate a bigram backoff tagger like the one above but using the universal Brown tagset (or make a tagset of your own by discarding all but the first character of each tag name). This tagger has fewer distinctions to make but more ambiguity.  Evaluate its performance.  How does it compare to the previous tagger?

In [13]:
train, test = create_data_sets(brown.tagged_sents(categories='news', tagset='universal'))
                                           
ut = build_backoff_tagger(train)
print ("%0.3f" % ut.evaluate(test))

0.846


In [14]:
[(w2, t2, ut.tag([w1,w2])[1][1]) 
 for (w1, t1), (w2, t2) in nltk.bigrams(test[16], pad_left=True, pad_symbol=(None, None))]

[('Negro', 'NOUN', 'NOUN'),
 ('lawyers', 'NOUN', 'NOUN'),
 ('dug', 'VERB', 'VERB'),
 ('into', 'ADP', 'ADP'),
 ('the', 'DET', 'DET'),
 ('records', 'NOUN', 'NOUN'),
 ('of', 'ADP', 'ADP'),
 ('300', 'NUM', 'NUM'),
 ('white', 'ADJ', 'ADJ'),
 ('students', 'NOUN', 'NOUN'),
 (',', '.', '.'),
 ('found', 'VERB', 'VERB'),
 ('that', 'ADP', 'ADP'),
 ('many', 'ADJ', 'ADJ'),
 ('were', 'VERB', 'VERB'),
 ('hardly', 'ADV', 'ADV'),
 ('interviewed', 'VERB', 'VERB'),
 ('at', 'ADP', 'ADP'),
 ('all', 'PRT', 'PRT'),
 ('--', '.', '.'),
 ('and', 'CONJ', 'CONJ'),
 ('few', 'ADJ', 'ADJ'),
 ('had', 'VERB', 'VERB'),
 ('academic', 'ADJ', 'ADJ'),
 ('records', 'NOUN', 'NOUN'),
 ('as', 'ADV', 'ADP'),
 ('good', 'ADJ', 'ADJ'),
 ('as', 'ADP', 'ADP'),
 ('Hamilton', 'NOUN', 'NN'),
 ('Holmes', 'NOUN', 'NN'),
 ('.', '.', '.')]

## Evaluating a Tagger by Looking at Tags that Follow Tags ##
(For this exercise, use your regular tagger, not the simplified one.)  The word **to** is frequently confused; it can be helpful to inspect the context it occurs in.  This code shows how to view the frequency of the tag that *follows* the word.

In [15]:
def examine_tag_contexts(tagger, target_word, target_tag):
    test_sents = [tagger.tag(sent) for sent in brown.sents(categories='editorial')]
    tags = [b[1] for test_sent in test_sents 
            for (a,b) in nltk.bigrams(test_sent)
            if a[0] == target_word and a[1] == target_tag]
    fd = nltk.FreqDist(tags)
    print ("Tags that follow the target word and tag " + target_word + " and " + target_tag)
    fd.tabulate(20)
examine_tag_contexts(ngram_tagger, 'to', 'TO')
examine_tag_contexts(ngram_tagger, 'to', 'IN')                                               

Tags that follow the target word and tag to and TO
  VB   NN   AT   BE   DO  PP$   JJ  PPO   NP   HV  NNS NN-TL   DT   CD  VBN JJ-TL   CS  VBG  WDT   `` 
 361  220  155   83   28   28   25   18   18   12    9    7    7    6    6    6    6    6    5    5 
Tags that follow the target word and tag to and IN
  VB   NN   AT   BE   NP   CD   JJ  PPO  NNS   DT   DO   HV  PP$  ABN   `` NN-TL  DTS  DTI  VBD NNS-TL 
 142  114  103   29   14   13    9    9    9    6    5    4    4    3    3    3    3    3    3    2 
