# Hidden Markov Model for Part-of-Speech Tagging 

This Jupyter notebook illustrates how to use the developed module in this repository. The goal is to train a HMM-based POS tagger on German data. The developed code provides clean and elegant solutions to deal with common issues in this approach such as dealing with zero-value emission and transition probabilities as well as unknown words handling.  Lets start. 

In [1]:
# general imports
import nltk
from nltk.corpus.reader import ConllCorpusReader

# import tagger module
from HMMTagger import *

Indicate the directory of the data and some other metadata. 

In [2]:
directory    = 'data/'
train_fileid = 'de-train.tt'
test_fileid  = 'de-test.t'
columntypes  = ['words', 'pos'] 

## (1) Navigating through the HMM probabilisitc components

Lets create an object of the HMMTager class and check the POS tags it can recognize.

In [3]:
# create a CoNLL corpus reader object
train_corpus = nltk.corpus.ConllCorpusReader(
    directory, train_fileid, columntypes, tagset='universal', encoding='utf8'
)

The constructor of the class HMMTagger takes as an input an object of ConllCorpusReader, which is a class to systematically process files in the CoNLL format.

In [4]:
POSTagger = HMMTagger(train_corpus, smoothing='add_one')

POSTagger.tagset

{'.',
 'ADJ',
 'ADP',
 'ADV',
 'CONJ',
 'DET',
 'NOUN',
 'NUM',
 'PRON',
 'PRT',
 'VERB',
 'X'}

Here, we choose to use add-one (a.k.a Laplace) smoothing in order to avoid zero-value probabilities. Thus, the parameters of the HMM components (that is; initial, transition, and emission probabilities) will be smoothed by adding one count to each of the events. That is, we will pretend that each event occurs one more time than its actual occurance count.  

Moreover, the tagger can recognize 12 POS tags, which were used in the training data. 

Now, lets check some of the HMM parameters in depths. Lets start with the initial probabilities, or the probability of the tag being in the start position of the sentence. 

In [5]:
for tag in POSTagger.tagset:
    print("P({0:>5}|<S>) = {1:.10f}".format(tag, POSTagger.initials.P[tag]))

P( PRON|<S>) = 0.1429582449
P(  ADV|<S>) = 0.1176928521
P(  NUM|<S>) = 0.0288747346
P( NOUN|<S>) = 0.1462137297
P(  PRT|<S>) = 0.0015569710
P(  DET|<S>) = 0.2613588110
P( VERB|<S>) = 0.0148619958
P(    .|<S>) = 0.0037508846
P(    X|<S>) = 0.0009200283
P(  ADJ|<S>) = 0.0428874735
P(  ADP|<S>) = 0.2121019108
P( CONJ|<S>) = 0.0268223638


We can see that P( DET |< S >) is the highest probobility, which is quite reasonable since it very common to start a sentence with a determiner in German. It is also common in German to start a sentence with a preposition, noun or pronoun. 

The initial probabilities come from a single distribution, therefore if the implementation is correct, these probability values should sum up to one. 

In [6]:
sum(POSTagger.initials.P[tag] for tag in POSTagger.tagset)

1.0

Great! That means everything regarding initial probabilities (including smoothing) has been done correctly. 

Now lets move to peek into the transition probabilities. HMM transitions are conditioned on the POS tags. That is, we can think of the transition probabilities as 12 different independent distributions (one per each tag), where the event space for each distribution is the POS tags.  

Lets check the transition probabilities for the determiner tag 'DET'.

In [7]:
for tag in POSTagger.tagset:
    print("P({0:>5}|DET) = {1:.10f}".format(tag, POSTagger.transitions['DET'].P[tag]))

P( PRON|DET) = 0.0055608624
P(  ADV|DET) = 0.0118381550
P(  NUM|DET) = 0.0107123362
P( NOUN|DET) = 0.7300764192
P(  PRT|DET) = 0.0004776201
P(  DET|DET) = 0.0008528930
P( VERB|DET) = 0.0002046943
P(    .|DET) = 0.0084606987
P(    X|DET) = 0.0002046943
P(  ADJ|DET) = 0.2179312227
P(  ADP|DET) = 0.0134415939
P( CONJ|DET) = 0.0002388100


It is not surprising the conditional probability P(NOUN|DET) is the highest, since it is very natural that a noun would follow a determiner. 

Lets check the transition probabilities for a the tag 'VERB'. 

In [8]:
for tag in POSTagger.tagset:
    print("P({0:>5}|VERB) = {1:.10f}".format(tag, POSTagger.transitions['VERB'].P[tag]))

P( PRON|VERB) = 0.1428773969
P(  ADV|VERB) = 0.0838266048
P(  NUM|VERB) = 0.0190692234
P( NOUN|VERB) = 0.0504377415
P(  PRT|VERB) = 0.0101017262
P(  DET|VERB) = 0.1250132917
P( VERB|VERB) = 0.0805656967
P(    .|VERB) = 0.3084393719
P(    X|VERB) = 0.0002835572
P(  ADJ|VERB) = 0.0323964130
P(  ADP|VERB) = 0.1088505299
P( CONJ|VERB) = 0.0381384468


Here, the conditional probability P(.|VERB) is the highest, which might be surprising. Perhaps the reason is that many German verbs end up positioned at the end of the sentence, before the final period. 

Now lets move to emission probabilities, which estimat the probability of an observation being emitted from each state. In POS tagging, the observations are the words and the states are the POS tags. Like transition probabilities, we can think of it as 12 different probability distributions per each POS tags. But the difference is that the event space for the emission probabilities is the set of all possible observations, which is the word vocabulary in our case.  

Lets check the probability of the word 'der' being emitted from the tag 'DET'. 

In [9]:
print("P('der'|DET) = {0:.10f}".format(POSTagger.emissions['DET'].P['der']))

P('der'|DET) = 0.0920676830


This value is actually quite high if we consider that the event space is very large (the size of the word vocabulary of the training data). If we check the probability of another word being emitted from the tag 'DET', it would be much smaller. 

In [10]:
print("P('klein'|DET) = {0:.10f}".format(POSTagger.emissions['DET'].P['klein']))

P('klein'|DET) = 0.0000126937


This is certainly the case because the word 'klein' has never been tagged as a 'DET' in the training data. However, our add-one smoothing technique has prevented a zero-value emission probability here, which is very desirable to avoid run time errors. Lets see what value we get for the word 'klein' given the tag 'ADJ', or adjective. 

In [11]:
print("P('klein'|ADJ) = {0:.10f}".format(POSTagger.emissions['ADJ'].P['klein']))

P('klein'|ADJ) = 0.0001885999


It makes sense! The word 'klein' is actually an adjective, thus the probability P('klein'|ADJ) should be way higher than  P('klein'|DET).

Another source for zero-value emission probabilies is the unknown words. These words were never observed in the training data thus their emission probability with each tag is zero. To solve this problem, the HMMTagger class has a procedure that adds a pseudo word, '< UNK >', to the word vocabulary before applying smoothing.

To make sure that the pseodu-word '< UNK >' is actually in the vocab, lets do the following.

In [12]:
'<UNK>' in POSTagger.word_vocab

True

 During decoding, every unknown word will be replace by this ' < UNK >' token. Therefore, all emission probabilities for an unknown word will the value has been allocated to the token '< UNK >'. Lets check the emission probability of '< UNK >' being emitted from the tag 'DET'. 

In [13]:
print("P('<UNK>'|DET) = {0:.10f}".format(POSTagger.emissions['DET'].P['<UNK>']))

P('<UNK>'|DET) = 0.0000126937


It can be observed that the probabilities P('< UNK >'|DET) and P('klein'|DET)  have equal values. This can be justified by the fact that both the word 'klein' and the pseudo word '< UNK >', which has been added to the word vocabulary before smoothing, have never been observed with the tag 'DET' in the training data. Thus, it makes sense that these two probability values are equal, since they both receive the probability mass allocated for unseen events. 

## (2) Decoding word sequences 

Now, we can use the POS Tagger that has been trained to tag actual sentences. 

In [14]:
words = 'die Hölle ist leer , alle Teufel sind hier.  .'.split()
tags = POSTagger.decode(words)

for word, tag in zip(words, tags):
    print("{0:15} {1:5}".format(word, tag))

die             DET  
Hölle           NOUN 
ist             VERB 
leer            VERB 
,               .    
alle            PRON 
Teufel          NOUN 
sind            VERB 
hier.           VERB 
.               .    


That's cool. But it seems that the tagger makes some errors. Lets investigate the taggers's performance on another sentence. 

In [15]:
words = 'Im Juli 2012 erhielt GitHub eine Investition von 100 Millionen .'.split()
tags = POSTagger.decode(words)

for word, tag in zip(words, tags):
    print("{0:15} {1:5}".format(word, tag))

Im              ADP  
Juli            NOUN 
2012            NUM  
erhielt         VERB 
GitHub          ADP  
eine            DET  
Investition     NOUN 
von             ADP  
100             NUM  
Millionen       NOUN 
.               .    


It seems like all words have been tagged correctly, except for the proper noun 'GitHub', which has been incorrectly tagged as 'ADP, or adposition (this categroy group postpositions and prepositions).  

Why was this word confusing? We can check if the word 'GitHub' is actually an unknown word. 

In [16]:
'Github' in POSTagger.word_vocab

False

The word 'Github' is indeed an unknown word, emission probabilities won't help much in deciding the correct tag here.

However, add-one smoothing is not an optimal smoothing technique. A better smoothing technique is the so-called absolute discounting. The developed HMMTagger class enables absolute discounting as a smoothing technique as well, but we need to decide the discounting parameters d, which a floating-point value between [0, 1]. Lets create another POS tagger with absolute discounting as a smoothing technique.  

In [17]:
POSTagger_disc = HMMTagger(train_corpus, smoothing='abs_disc', d=0.4)

We will use the new tagger object to tag the same sentece as before and display the tags

In [18]:
tags_disc = POSTagger_disc.decode(words)

for word, tag in zip(words, tags_disc):
    print("{0:15} {1:5}".format(word, tag))

Im              ADP  
Juli            NOUN 
2012            NUM  
erhielt         VERB 
GitHub          NOUN 
eine            DET  
Investition     NOUN 
von             ADP  
100             NUM  
Millionen       NOUN 
.               .    


This time the word 'GitHub' has been tagged correctly as a 'NOUN'. Therefore, it seems that smoothing is a very crucial aspect with respect to the performance of the HMM-based POS tagger.  Lets check our first sentence with the new absolute discounting-based tagger. 

In [19]:
words = 'die Hölle ist leer , alle Teufel sind hier  .'.split()
tags = POSTagger_disc.decode(words)

for word, tag in zip(words, tags):
    print("{0:15} {1:5}".format(word, tag))

die             DET  
Hölle           NOUN 
ist             VERB 
leer            ADJ  
,               .    
alle            PRON 
Teufel          NOUN 
sind            VERB 
hier            ADV  
.               .    


Much better indeed! Absolute discounting as a smoothing technique might take some time to understand. But it really works very well compared to add-one smoothing. 

## (3) POS Taggers Evaluation

Now we have two POS taggers, one uses add-one smoothing and the other uses absolute discounting. We can quantitatively evaluate the performace of the two taggers using an disjoint test set. Lets read the test data.

In [20]:
test_corpus = nltk.corpus.ConllCorpusReader(
    directory, test_fileid, ['words'], tagset='universal', encoding='utf8'
)

Use the tagger on the test data, and save the tagged words in file. 

In [21]:
with open('tagged_add_one.tt', 'w') as op_file:
    for sent in test_corpus.sents():
        predicted_tags = POSTagger.decode(sent)

        for w, tag in zip(sent, predicted_tags):
            op_file.write(w + '\t' + tag + '\n')

        op_file.write('\n')

To evaluate the POS tagger, the evaluation script can be used as follows. The file 'data/de-eval.tt' the gold annotations of the test data.

In [22]:
%run eval.py data/de-eval.tt tagged_add_one.tt


Comparing gold file "data/de-eval.tt" and system file "tagged_add_one.tt"

Precision, recall, and F1 score:

  DET 0.7190 0.9808 0.8297
 NOUN 0.8934 0.8442 0.8681
 VERB 0.8613 0.8630 0.8621
  ADP 0.8598 0.9852 0.9182
    . 0.9243 0.9803 0.9515
 CONJ 0.9464 0.8652 0.9040
 PRON 0.8111 0.7862 0.7985
  ADV 0.8847 0.7506 0.8121
  ADJ 0.7809 0.6111 0.6857
  NUM 1.0000 0.5593 0.7173
  PRT 0.9468 0.8111 0.8737

Accuracy: 0.8608



The tagger based on add-one smoothing gives about 86.1% accuracy, which not impressive for the POS tagging problem. Now lets check the performance of the tagger with absolute discounting as smoothing. 

In [23]:
with open('tagged_abs_disc.tt', 'w') as op_file:
    for sent in test_corpus.sents():
        predicted_tags = POSTagger_disc.decode(sent)

        for w, tag in zip(sent, predicted_tags):
            op_file.write(w + '\t' + tag + '\n')

        op_file.write('\n')

And there we go .. 

In [24]:
%run eval.py data/de-eval.tt tagged_abs_disc.tt


Comparing gold file "data/de-eval.tt" and system file "tagged_abs_disc.tt"

Precision, recall, and F1 score:

  DET 0.9092 0.9761 0.9415
 NOUN 0.8476 0.9835 0.9105
 VERB 0.9605 0.8712 0.9137
  ADP 0.9632 0.9762 0.9697
    . 0.9983 0.9992 0.9987
 CONJ 0.9544 0.8974 0.9250
 PRON 0.9391 0.8309 0.8817
  ADV 0.9234 0.7893 0.8511
  ADJ 0.7993 0.6485 0.7160
  NUM 0.9906 0.7778 0.8714
  PRT 0.8730 0.8730 0.8730
    X 0.2000 0.0909 0.1250

Accuracy: 0.9136



Et voila!!! Absolute discounting is way better. The accuracy went up to 91.4%. 

## (4) Final remarks 

In this Jupyter notebook I shared some ideas about the HMM-based POS tagger I developed for German. There are a few things that I am willing to investigate or improve in the tagger.  

* Investigate the impact of the discounting parameter on the performance. 
* Develope a trigram HMM. That is, transition probabilities should be conditioned on the two previous states instead of a single previous state.
* Visualise the Viterbi matrix.
