*The following is a very naive implementation of a hidden markov model for Part of Speech tagging*<br/>
*Several improvements are possible: word preprocessing, handling OOV words etc.*<br/>
*The current state of the art transformers do POS tagging with an accuracy of ~98%*<br/>
*I was able to acheive 70% matching (approx.) with this statistical model*<br/>

In [1]:
import nltk

In [2]:
dataset = nltk.corpus.treebank

In [3]:
tagged_sents = list(dataset.tagged_sents())

In [4]:
tagged_sents[:5]

[[('Pierre', 'NNP'),
  ('Vinken', 'NNP'),
  (',', ','),
  ('61', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('will', 'MD'),
  ('join', 'VB'),
  ('the', 'DT'),
  ('board', 'NN'),
  ('as', 'IN'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('Nov.', 'NNP'),
  ('29', 'CD'),
  ('.', '.')],
 [('Mr.', 'NNP'),
  ('Vinken', 'NNP'),
  ('is', 'VBZ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Elsevier', 'NNP'),
  ('N.V.', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('Dutch', 'NNP'),
  ('publishing', 'VBG'),
  ('group', 'NN'),
  ('.', '.')],
 [('Rudolph', 'NNP'),
  ('Agnew', 'NNP'),
  (',', ','),
  ('55', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  ('and', 'CC'),
  ('former', 'JJ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Consolidated', 'NNP'),
  ('Gold', 'NNP'),
  ('Fields', 'NNP'),
  ('PLC', 'NNP'),
  (',', ','),
  ('was', 'VBD'),
  ('named', 'VBN'),
  ('*-1', '-NONE-'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('of', 'IN'),
  ('this'

In [5]:
len(tagged_sents)

3914

In [6]:
n = len(tagged_sents)
train_data = tagged_sents[:int(0.75*n)]
test_data = tagged_sents[int(0.75*n):]
print(len(train_data), len(test_data))

2935 979


In [7]:
word_label_pair_train = [i for sent in train_data for i in sent]
print(len(word_label_pair_train))

75784


In [8]:
vocab_train = list(set([pair[0] for pair in word_label_pair_train]))
tags = list(set([pair[1] for pair in word_label_pair_train]))
print(f"Number of tokens: {len(vocab_train)}, and the number of tags are: {len(tags)}")

Number of tokens: 10634, and the number of tags are: 46


In [9]:
print(tags)

['VBP', '-LRB-', 'JJ', '#', 'SYM', ',', '``', 'DT', 'WP$', '$', 'POS', 'UH', 'RB', 'NN', 'MD', "''", 'VB', 'CC', 'VBN', 'NNP', 'JJR', 'PRP', '-NONE-', 'PDT', 'TO', '.', 'WDT', 'RBS', 'IN', 'JJS', ':', 'CD', 'RP', 'RBR', '-RRB-', 'EX', 'NNPS', 'VBD', 'FW', 'VBG', 'NNS', 'VBZ', 'WP', 'WRB', 'LS', 'PRP$']


In [72]:
word_to_index = {word: i for i, word in enumerate(vocab_train)}
index_to_word = {i: word for i, word in enumerate(vocab_train)}
tags_to_index = {tag: i for i, tag in enumerate(tags)}
index_to_tags = {i: tag for i, tag in enumerate(tags)}

print(word_to_index["will"], index_to_word[word_to_index["will"]])

847 will


**The tags are as follows: <br/>
[link](https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf)**

**Emmission probabilities**: The P(w|t) = C(w, t)/C(t)

In [67]:
import numpy as np

tags_count = len(tags)
vocab_size = len(vocab_train)
emmission_counts = np.ones((tags_count, vocab_size)) # add-one smoothing

In [68]:
def emmision_count(word, tag, data=word_label_pair_train):
    """
    Parameters:
        word: (string),
        tag: (string),
        data: (list of tuples {(word, tag)})
    Returns:
        (count_word_as_tag, count_tag): (int, int)
    """

    tag_appearance = [tup for tup in data if tup[1] == tag]
    count_tag = len(tag_appearance)
    word_as_tag = [tup for tup in tag_appearance if tup[0] == word]
    count_word_as_tag = len(word_as_tag)
    return count_word_as_tag, count_tag

In [69]:
count_will_as_MD, count_MD = emmision_count("will", "MD")
print(f"Count of word 'will' as tag 'MD' is {count_will_as_MD} and count of tag 'MD' is {count_MD} | percentage: {count_will_as_MD/count_MD * 100}")

Count of word 'will' as tag 'MD' is 202 and count of tag 'MD' is 681 | percentage: 29.662261380323052


In [73]:
# updating the emmission_counts matrix
for i, tag in enumerate(tags):
    for j, word in enumerate(vocab_train):
        emmission_counts[i, j] += emmision_count(word, tag)[0]
    print(f"{index_to_tags[i]} is done")

VBP is done
-LRB- is done
JJ is done
# is done
SYM is done
, is done
`` is done
DT is done
WP$ is done
$ is done
POS is done
UH is done
RB is done
NN is done
MD is done
'' is done
VB is done
CC is done
VBN is done
NNP is done
JJR is done
PRP is done
-NONE- is done
PDT is done
TO is done
. is done
WDT is done
RBS is done
IN is done
JJS is done
: is done
CD is done
RP is done
RBR is done
-RRB- is done
EX is done
NNPS is done
VBD is done
FW is done
VBG is done
NNS is done
VBZ is done
WP is done
WRB is done
LS is done
PRP$ is done


In [74]:
emmission_counts[tags_to_index["MD"], word_to_index["will"]]

203.0

In [104]:
print(index_to_word[np.argmax(emmission_counts[tags_to_index["MD"]])], 
      index_to_word[np.argmax(emmission_counts[tags_to_index["DT"]])],
      index_to_word[np.argmax(emmission_counts[tags_to_index["JJ"]])])

will the new


In [109]:
# to obtain the probabilities 
count_tags = np.zeros(tags_count)
for i, tag in enumerate(tags):
    count_tags[i] = emmision_count("", tag)[1]

In [122]:
count_tags = count_tags.reshape(-1, 1)

emmission_probs = emmission_counts/count_tags

In [123]:
emmission_counts[tags_to_index["MD"], word_to_index["will"]], emmission_probs[tags_to_index["MD"], word_to_index["will"]]

(203.0, 0.29809104258443464)

**Transition probabilities** P(t2|t1) = C(t1, t2)/C(t1): probability of a tag given a previous tag

In [29]:
def get_transition_proba(tag2, tag1, data=word_label_pair_train):
    """
    Parameters:
        tag1: (string),
        tag2: (string),
    Returns:
        percentage: (float) P(t2|t1)*100
    """
    count_tag1 = 0
    count_tag1_tag2 = 0
    tag_list = [tup[1] for tup in data]
    for t1, t2 in zip(tag_list, tag_list[1:]):
        if t1 == tag1:
            count_tag1 += 1
            if t2 == tag2:
                count_tag1_tag2 += 1
    return count_tag1_tag2/count_tag1 * 100

In [30]:
print("Transition probability of VB to MD is: ", get_transition_proba("VB", "MD")) # prob of a verb following a modal verb

Transition probability of VB to MD is:  80.76358296622614


In [31]:
# build the transition matrix
transition_probs = np.zeros((tags_count, tags_count)) # transition_probs[i, j] = P(t_j|t_i)
for i in range(len(tags)):
    for j in range(len(tags)):
        transition_probs[i, j] = get_transition_proba(tags[j], tags[i]) / 100

In [58]:
transition_probs[tags_to_index["MD"], tags_to_index["VB"]]

0.8076358296622614

![viterbi](./images/viterbi.png)

**VITERBI ALGORITHM**

In [44]:
initial_state_probs = [] # initial_state_probs[i] = prob that a sentence starts with tag i
for tag in tags:
    initial_state_probs.append(get_transition_proba(tag, ".")/100)

In [48]:
print(np.argmax(initial_state_probs), tags_to_index["DT"])

7 7


In [138]:
def Viterbi(sentence):
    """
        Parameters:
            sentence: (list of strings)
        Returns:
            tags: (list of strings)
    """
    viterbi = np.zeros((tags_count, len(sentence)))
    backpointer = np.zeros((tags_count, len(sentence)), dtype=int)
    for i in range(tags_count):
        viterbi[i, 0] = initial_state_probs[i] * emmission_probs[i, word_to_index[sentence[0]]]
        backpointer[i, 0] = 0
    for t in range(1, len(sentence)):
        for s in range(tags_count):
            # error if word is not in train vocab
            if sentence[t] not in vocab_train:
                viterbi[s, t] = np.max(viterbi[:, t-1] * transition_probs[:, s])
                backpointer[s, t] = np.argmax(viterbi[:, t-1] * transition_probs[:, s])
                continue
            viterbi[s, t] = np.max(viterbi[:, t-1] * transition_probs[:, s]) * emmission_probs[s, word_to_index[sentence[t]]]
            backpointer[s, t] = np.argmax(viterbi[:, t-1] * transition_probs[:, s])
    best_path_prob = np.max(viterbi[:, -1])
    best_path_pointer = np.argmax(viterbi[:, -1])
    best_path = [best_path_pointer]
    for i in range(len(sentence)-1, 0, -1):
        best_path_pointer = backpointer[best_path_pointer, i]
        best_path.append(best_path_pointer)
    best_path.reverse()
    return best_path, best_path_prob

In [139]:
# let us test it on the first train sentence
sentence = tagged_sents[78]
sentence_words = [tup[0] for tup in sentence]
sentence_tags = [tup[1] for tup in sentence]

best_path, _ = Viterbi(sentence_words)

print("The sentence is: ", sentence_words)
print(f"act.:{sentence_tags}")
print(f"pre.:{[index_to_tags[i] for i in best_path]}")
print(f"The percentage match is: {sum([1 if i==j else 0 for i, j in zip(sentence_tags, [index_to_tags[i] for i in best_path])])/len(sentence_tags) * 100}")

The sentence is:  ['The', 'governor', 'could', "n't", 'make', 'it', ',', 'so', 'the', 'lieutenant', 'governor', 'welcomed', 'the', 'special', 'guests', '.']
act.:['DT', 'NN', 'MD', 'RB', 'VB', 'PRP', ',', 'IN', 'DT', 'NN', 'NN', 'VBD', 'DT', 'JJ', 'NNS', '.']
pre.:['DT', 'NN', 'MD', 'RB', 'VB', 'PRP', ',', 'RB', 'DT', 'FW', '-RRB-', ':', 'DT', 'JJ', 'NNS', '.']
The percentage match is: 75.0


In [140]:
# testing 
import random
rnd = [random.randint(0, len(tagged_sents)) for i in range(5)]
for i in rnd:
    sentence = tagged_sents[i]
    sentence_words = [tup[0] for tup in sentence]
    sentence_tags = [tup[1] for tup in sentence]
    best_path, _ = Viterbi(sentence_words)
    print("The sentence is: ", sentence_words)
    print(f"act.:{sentence_tags}")
    print(f"pre.:{[index_to_tags[i] for i in best_path]}")
    print(f"The percentage match is: {sum([1 if i==j else 0 for i, j in zip(sentence_tags, [index_to_tags[i] for i in best_path])])/len(sentence_tags) * 100}")
    print("\n")

The sentence is:  ['Commonwealth', 'Edison', 'now', 'faces', 'an', 'additional', 'court-ordered', 'refund', 'on', 'its', 'summer\\/winter', 'rate', 'differential', 'collections', 'that', 'the', 'Illinois', 'Appellate', 'Court', 'has', 'estimated', '*T*-1', 'at', '$', '140', 'million', '*U*', '.']
act.:['NNP', 'NNP', 'RB', 'VBZ', 'DT', 'JJ', 'JJ', 'NN', 'IN', 'PRP$', 'JJ', 'NN', 'JJ', 'NNS', 'IN', 'DT', 'NNP', 'NNP', 'NNP', 'VBZ', 'VBN', '-NONE-', 'IN', '$', 'CD', 'CD', '-NONE-', '.']
pre.:['LS', '-RRB-', ':', 'LS', '-RRB-', ':', 'LS', '-RRB-', ':', 'LS', '-RRB-', ':', 'LS', '-RRB-', 'IN', 'DT', 'NNP', 'SYM', 'NNP', 'VBZ', 'VBN', '-NONE-', 'IN', '$', 'CD', 'CD', '-NONE-', '.']
The percentage match is: 46.42857142857143


The sentence is:  ['Analysts', 'were', 'disappointed', '*-112', 'that', 'the', 'enthusiasm', '*ICH*-2', '0', 'investors', 'showed', '*T*-1', 'for', 'stocks', 'in', 'the', 'wake', 'of', 'Georgia-Pacific', "'s", '$', '3.18', 'billion', '*U*', 'bid', 'for', 'Great', 'North