Now, we shall see how to train our own custom tagger on a corpus of tagged sentences, such as Treebank. There are many different ways to train custom taggers. Some discuss here: http://www.nltk.org/book/ch05.htm. We shall focus on the ones we studied in the text book.

Below we load tagged sentences from the Treebank corpus, find out all unique tags and unique words. The unique tags will become states in HMM and the unique words will be observation symbols in HMM. Let's analyze this:

In [None]:
import nltk
from nltk.corpus import treebank



# Training sentences extracted from Treebank corpus
#Let's take 100 sentences only
trainData = treebank.tagged_sents()[:100]

## Extract distinct words(observations) and tags (states)
allStates=set() # tags in our case
observationSymbols=set() #Vocabulary in our case
for t in trainData:
    for (word,tag) in t:
        allStates.add(tag)
        observationSymbols.add(word)

allStates=list(allStates)
observationSymbols=list(observationSymbols) 

print ("Total States (tags): ",len(allStates))
print("Total Observation Symbols (Vocab): ",len(observationSymbols))

print("********** A sample sentence **********")
print (trainData[3])



Let's build an HMM model by training it on the above extracted data.  There are two ways to train an HMM. A supervised method and an unsupervised emthod. In a suspervised method we know the observation sequences and their states  in the training data and we estimate probabilities using a maximum likelihood estimate (MLE). In unsupervised method we use Baulm Welch algorithm for training an HMM on the data of only observation sequences (states are not present in the data). It learns probabilities itself by iterating over data. 

We shall use the supervise method as we have both word (observations) sequences  and tags (states). In addition to training, we are also testing the model on a test sentence to generate POS (part-of-speech) tags.

In [None]:
from nltk.probability import MLEProbDist
from nltk.tag import hmm

smoothingFunction = lambda fdist, bins: MLEProbDist(fdist,bins)

# And train with the data
trainer = hmm.HiddenMarkovModelTrainer(states=allStates,symbols=observationSymbols)
tagger = trainer.train_supervised(trainData,estimator=smoothingFunction)
print (tagger)
test="Cigarette has caused a high percentage of cancer."
print (tagger.tag(nltk.word_tokenize(test)))

Look closely at the output above and compare the tags with the tags in a sentence of the original Treebank corpus, printed in the step before the previous step.

All the tags are incorrect. We have the same JJR tag for everything. 


Let's try to fix this by using Laplace smoothing, see below code. When you run the code below, you'll find out that most of the tags are correct by comparing with the original tags shown above. However, there are some tags which are incorrect. Can you determine whihc one? The accuracy for the test sentence  is approximately 70-80% (just compare how many are correct).

In [None]:
from nltk.probability import LaplaceProbDist
test="Cigarette has caused a high percentage of cancer."

# Prints the basic data about the tagger

smoothingFunction = lambda fdist, bins: LaplaceProbDist(fdist,bins)
trainer = hmm.HiddenMarkovModelTrainer(states=allStates,symbols=observationSymbols)
tagger = trainer.train_supervised(trainData,estimator=smoothingFunction)
print(tagger)
print (tagger.tag(nltk.word_tokenize(test)))
#print (tagger.tag("Most parts of speech can


Let's try to improve our accuracy by training on a larger number of records. With approximately 2000 records, we are almost there. (See below). At the end, we shall also see the comparison with NLTK's trained tagger.

In [None]:
from nltk.probability import LaplaceProbDist
from nltk.corpus import treebank
from nltk.tag import hmm


trainData2 = treebank.tagged_sents()[:2000]

allStates2=set()
observationSymbols2=set() #Vocabulary in our case
for t in trainData2:
    for (word,tag) in t:
        allStates2.add(tag)
        observationSymbols2.add(word)

allStates2=list(allStates2)
observationSymbols2=list(observationSymbols2) 
        
print ("Total States (tags): ",len(allStates2))
print("Total Observation Symbols (Vocab): ",len(observationSymbols2))

smoothingFunction = lambda fdist, bins: LaplaceProbDist(fdist,bins)

trainer = hmm.HiddenMarkovModelTrainer(states=allStates2,\
                                       symbols=observationSymbols2)
tagger = trainer.train_supervised(trainData2,estimator=smoothingFunction)

print(tagger)

test="Cigarette has caused a high percentage of cancer."

print ("\n**** Test: Ouptut of HMM tagger based on Viterbi algorithm ****")
print (tagger.tag(nltk.word_tokenize(test)))

print ("\nComparsion with NLTK's trained tagger*****")
print( nltk.pos_tag(nltk.word_tokenize(test)))

Exercise 3.2: Train an HMM model on the sentences of Brown corpus. Divide the data set (tagged sentences) into 70% for training and 30% for testing. Find out the accuracy of your trained HMM model on the sentences in test data. To determine accuracy, you will need to compare the predicted tags of words with the original tags of words in the test data. 


Exercise 3.3: Use the same test data as in Exercise 3.2, use NLTK's tagger to predict the tags and determine accuracy of prediction.