# HMMs in Natural Language Processing

In this example we'll see how HMMs perform in one of the task they're most widely used for - part-of-speech tagging. For that purpose, we'll utilize the NLTK library, which provides a variety of tools for the purpose of NLP.

In [None]:
import numpy as np
import random

from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import treebank
from nltk.tag import hmm
from nltk.stem import PorterStemmer
nltk.data.path.append('/home/marcin/.nltk_data')


As the NLTK datasets are huge, you have to download them first!

In [None]:
nltk.download()

Now, we can move on to our dataset:

In [None]:
random.seed(0)
data = list(treebank.tagged_sents()[:4000])
random.shuffle(data)
train_data = data[:3000]
test_data = data[3000:]

len(train_data), len(test_data)

In [None]:
train_data[0]

Those tags in capitals don't tell a lot! Let's inspect them:

In [None]:
all_tags = set()

for sentence in train_data:
    for word, tag in sentence:
        all_tags.add(tag)

all_tags

Thankfully, NLTK can also tell us what they mean:

In [None]:
for tag in all_tags:
    print(tag)
    nltk.help.upenn_tagset(tag)
    print()

Now that we understand the dataset, let's train a HMM on it:

In [None]:
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)
tagger

In [None]:
tagger._states
# tagger._symbols

In [None]:
print(tagger.tag("Joe met Joanne in Delhi .".split()))

print(tagger.tag("Chicago is the birthplace of Ginny".split()))

How does the tagger do on the training data?

In [None]:
tagger.test(train_data)

How about the test data?

In [None]:
tagger.test(test_data)

Well, this sucks.

Let's play with the data a bit to see if we can improve those results. We'll stem the words to make the datasets more uniform:

In [None]:
porter = PorterStemmer()
porter.stem('intelligence')

In [None]:
def to_stemmed(data):
    return [ [(porter.stem(word), tag) for word, tag in sent] for sent in data]

In [None]:
stemmed_train_data = to_stemmed(train_data)
stemmed_test_data = to_stemmed(test_data)

train_data[0], stemmed_train_data[0]

In [None]:
stemmed_tagger = trainer.train_supervised(stemmed_train_data)
stemmed_tagger

In [None]:
stemmed_tagger.test(stemmed_train_data)

In [None]:
stemmed_tagger.test(stemmed_test_data)

Let's see if we can train other classifiers on the data. 

Firts, some utilities to transform data into numerical fearures. We won't try too hard - the datapoint will consist of the word in question, as well as the previous word - so the same data HMM would take into account:

In [None]:
def token_to_features(sentence, index, neighbors=3):
    result = {
        'word': sentence[index],
        'prev_word': '' if index == 0 else sentence[index - 1]
    }
    return result

In [None]:
def untag(tagged_sentence):
    return [w for w, t in tagged_sentence]

In [None]:
def to_X_y(tagged_sentences):
    X, y = [], []
    for tagged in tagged_sentences:
        untagged = untag(tagged)
        for index in range(len(tagged)):
            X.append(token_to_features(untagged, index))
            y.append(tagged[index][1])
 
    return np.array(X), np.array(y)

In [None]:
dataset = data[:10000]
# dataset = to_stemmed(data)

X_dict, y = to_X_y(dataset)

vectorizer = DictVectorizer()
X = vectorizer.fit_transform(X_dict)

split = int(len(y) * 0.7)
X_train = X[:split]
X_test = X[split:]
y_train = y[:split]
y_test = y[split:]

X_train.shape, y_train.shape, X_test.shape, y_test.shape

We'll use three very simple classifiers - they won't have any recurrent properties. The only recurrence happens in the datapoints, which contain the $n^{th}$ and $n-1^{th}$ word:

In [None]:
decision_tree = DecisionTreeClassifier()
linear_model = LogisticRegression()
neural_network = MLPClassifier(verbose=True, max_iter=10)

decision_tree, linear_model, neural_network

In [None]:
for classifier in [decision_tree, linear_model, neural_network]:
    classifier.fit(X_train, y_train)
    y_pred_train = classifier.predict(X_train)
    y_pred_test = classifier.predict(X_test)
    print('train', accuracy_score(y_train, y_pred_train))
    print('test', accuracy_score(y_test, y_pred_test))
    print()

Et voila. Looks like there's a good reason why HMMs aren't as hot a topic anymore :P