# Parts of Speech Tagging

This notebook provides an introdiction on how to perform parts of speech tagging on Python using NLTK

The notebook contains information on how to use the following algorithms

*   N-Gram Taggers and Backoffs
*   Averaged Perceptron Tagger
*   Hidden Markov Model
*   Conditional Random Fields

## Initialize NLTK

Download some of the resources that NLTK needs

In [None]:
import nltk
nltk.download('book')

## Load the tagged dataset

NLTK's built in loader will be used to load the Treebank corpus. The Treebank corpus is a tagged dataset containing the parts of speech per word. This labeled dataset shall be used to evaluate the algorithms for automatic tagging.

NLTK returns a list of tuples after reading the data. The tuple contains two elements, the word and the tag respectively.

The dataset is also split in a 80-20 ratio. The first split is used for allowing the algorithms to discover the patterns in tagging while the second split is used to evaluate the tagger on sentences it has not seen.

In [None]:
DATA = nltk.corpus.treebank.sents()
DATA_TAGGED = nltk.corpus.treebank.tagged_sents()

In [None]:
train_split = int(len(DATA_TAGGED) * 0.80)
DATA_TRAIN = DATA_TAGGED[:train_split]
DATA_TEST = DATA_TAGGED[train_split:]

In [None]:
len(DATA), len(DATA_TAGGED), len(DATA_TRAIN), len(DATA_TEST)

## N-Gram Taggers

N-Gram taggers counts the number of N consecutive tokens and assigns the most common occurence to resolve tagging ambiguity.

### Unigram Taggers

A unigram tagger is an N-Gram with N = 1. This is similar to the baseline implementation of resolving tagging ambiguity

In [None]:
unigram_tagger = nltk.tag.UnigramTagger(DATA_TRAIN)
unigram_tagger.accuracy(DATA_TEST)

In [None]:
unigram_pos_tags = unigram_tagger.tag_sents(DATA)
unigram_pos_tags[0]

### Backoffs

To improve the unigram tagger, a backoff can be defined to handle unknown words. Given that most open class words are nouns, a tagger that sets everything into a noun can be used as a backoff

In [None]:
default_tagger = nltk.tag.DefaultTagger('NN')
unigram_tagger_backoff = nltk.tag.UnigramTagger(DATA_TRAIN, backoff=default_tagger)
unigram_tagger_backoff.accuracy(DATA_TRAIN), unigram_tagger_backoff.accuracy(DATA_TEST)

### Bigram and Trigram Taggers

Bigram and Trigram taggers are just generalization of the Unigram tagger. However since it looks for more number of words, they may perform worse on smaller training data sizes.

In [None]:
bigram_tagger = nltk.tag.BigramTagger(DATA_TRAIN)
bigram_tagger.accuracy(DATA_TRAIN), bigram_tagger.accuracy(DATA_TEST)

In [None]:
trigram_tagger = nltk.tag.TrigramTagger(DATA_TRAIN)
trigram_tagger.accuracy(DATA_TRAIN), trigram_tagger.accuracy(DATA_TEST)

### Chaining Backoffs

To handle the words that Bigrams and Trigrams can not see, similar to Unigram tagger, a backoff can be defined. The backoffs can also have backoffs, creating a chain of model backoffs.

In [None]:
bigram_tagger_backoff = nltk.tag.BigramTagger(DATA_TRAIN, backoff=unigram_tagger_backoff)
bigram_tagger_backoff.accuracy(DATA_TRAIN), bigram_tagger_backoff.accuracy(DATA_TEST)

In [None]:
trigram_tagger_backoff = nltk.tag.TrigramTagger(DATA_TRAIN, backoff=bigram_tagger_backoff)
trigram_tagger_backoff.accuracy(DATA_TRAIN), trigram_tagger_backoff.accuracy(DATA_TEST)

## Averaged Perceptron Tagger

The averaged perceptron tagger is based on an [article by Matthew Honnibal](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python).

As of `3.5` NLTK uses the Averaged Perceptron Tagger as its default tagger. Thus, using `nltk.pos_tag` and `nltk.pos_tag_sents` defaults to it. However, the algorithm can also be invoked explicitly.

### Pretrained Model

NLTK provides a trained model for Averaged Perceptron Tagger which means it can be used without any training. The pretrained model uses the Penn Treebank tagset, thus to evaluate, make sure that the test data has the same tagset.

In [None]:
default_pos_tags = nltk.pos_tag_sents(DATA)
default_pos_tags[0]

In [None]:
perceptron_pretrained = nltk.perceptron.PerceptronTagger()
perceptron_pretrained.accuracy(DATA_TEST)

In [None]:
perceptron_post_tags = perceptron_pretrained.tag_sents(DATA)
perceptron_post_tags[0]

## Training

NLTK also provides a way to train the Average Perceptron Tagger

In [None]:
perceptron_trained = nltk.perceptron.PerceptronTagger(load=False)
perceptron_trained.train(DATA_TRAIN, nr_iter=5)
perceptron_trained.accuracy(DATA_TRAIN), perceptron_trained.accuracy(DATA_TEST)

## Hidden Markov Models

Hidden Markov Models or HMM fits the labels of a tagging problem into the states of a Markov Model.

NLTK's implementation allows you not only to train from data but also to provide the matrices from the HMM

In [None]:
hmm_trainer = nltk.hmm.HiddenMarkovModelTrainer()
hmm = hmm_trainer.train_supervised(DATA_TRAIN)

In [None]:
hmm.accuracy(DATA_TRAIN), hmm.accuracy(DATA_TEST)

In [None]:
hmm_pos_tags = hmm.tag_sents(DATA)
hmm_pos_tags[0]

## Conditional Random Fields

Conditional Random Field or CRF is a generalization of the logistic regression on sequence data. Similar to logistic regression, it allows the creation of different features as a way to predict the label of an element of a sequence.

This feature requires the installation of [`python-crfsuite`](https://github.com/scrapinghub/python-crfsuite).

### Predefined Features

Out of the box, NLTK provides its own CRF features if you did not provide any.

In [None]:
crf_default = nltk.crf.CRFTagger()
crf_default.train(DATA_TRAIN, '../models/crf_default.tag')
crf_default.accuracy(DATA_TRAIN), crf_default.accuracy(DATA_TEST)

In [None]:
crf_pos_tags = crf_default.tag_sents(DATA)
crf_pos_tags[0]

## Custom Features

While the NLTK allows providing custom functions to generate features, the API does not allow using the previous states (tags).

The feature function must accept two arguments, the word list `tokens` and the index of the current word `idx`. It should return a list of strings. The list of strings act as a flag to determine if that feature is on for a word.

For example if a word has a feature list `['CAPS', 'SUF_ly']` then this indicate that features `CAPS` and `SUF_ly` is true for the word. In practice, this can mean that the word is all caps and that it ends in ly

In [None]:
def custom_crf_features(tokens, idx):
    feature_list = []
    
    # WORDS
    feature_list.append(f'WORD_{tokens[idx]}')
    try:
        feature_list.append(f'WORD-1_{tokens[idx-1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'WORD+1_{tokens[idx+1]}')
    except IndexError:
        pass
    
    # SUFFIX
    token = tokens[idx]  
    if len(token) > 1:
        feature_list.append("SUF_" + token[-1:])
    if len(token) > 2:
        feature_list.append("SUF_" + token[-2:])
    if len(token) > 3:
        feature_list.append("SUF_" + token[-3:])
                
    return feature_list

In [None]:
crf_custom = nltk.crf.CRFTagger(feature_func=custom_crf_features)
crf_custom.train(DATA_TRAIN, '../models/crf_custom.tag')
crf_custom.accuracy(DATA_TRAIN), crf_custom.accuracy(DATA_TEST)