# HMMs in Natural Language Processing

In this example we'll see how HMMs perform in one of the task they're most widely used for - part-of-speech tagging. For that purpose, we'll utilize the NLTK library, which provides a variety of tools for the purpose of NLP.

In [None]:
import numpy as np
import nltk
from nltk.corpus import treebank
from nltk.tag import hmm
from nltk.stem import PorterStemmer
nltk.data.path.append('/home/marcin/.nltk_data')
import random

As the NLTK datasets are huge, you have to download them first!

In [None]:
nltk.download()

Now, we can move on to our dataset:

In [None]:
random.seed(0)
data = list(treebank.tagged_sents()[:4000])
random.shuffle(data)
train_data = data[:3000]
test_data = data[3000:]

len(train_data), len(test_data)

In [None]:
train_data[0]

Those tags in capitals don't tell a lot! Let's inspect them:

In [None]:
all_tags = set()

for sentence in train_data:
    for word, tag in sentence:
        all_tags.add(tag)

all_tags

Thankfully, NLTK can also tell us what they mean:

In [None]:
for tag in all_tags:
    print(tag)
    nltk.help.upenn_tagset(tag)
    print()

Now that we understand the dataset, let's train a HMM on it:

In [None]:
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)
tagger

In [None]:
tagger._states
# tagger._symbols

In [None]:
print(tagger.tag("Joe met Joanne in Delhi .".split()))

print(tagger.tag("Chicago is the birthplace of Ginny".split()))

How does the tagger do on the training data?

In [None]:
tagger.test(train_data)

How about the test data?

In [None]:
tagger.test(test_data)

Well, this sucks.

Let's play with the data a bit to see if we can improve those results. We'll stem the words to make the datasets more uniform:

In [None]:
porter = PorterStemmer()
porter.stem('intelligence')

In [None]:
stemmed_train_data = [ [(porter.stem(word), tag) for word, tag in sent] for sent in train_data]
stemmed_test_data = [ [(porter.stem(word), tag) for word, tag in sent] for sent in test_data]

train_data[0], stemmed_train_data[0]

In [None]:
stemmed_tagger = trainer.train_supervised(stemmed_train_data)
stemmed_tagger

In [None]:
stemmed_tagger.test(stemmed_train_data)

In [None]:
stemmed_tagger.test(stemmed_test_data)