## Parts Of Speech [POS] Tagging

Below code is developed to predict the Part of Speech (POS) tag for each word in a provided sentence.
I have build a model using Hidden Markov Models which predicts the POS tags for all words.

### What is a POS tag?

In corpus linguistics, part-of-speech tagging (grammatical tagging or word-category disambiguation), is the process of marking up a word in a corpus as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. 

### Dataset

File name is data.txt, attached with code.

#### Dataset Description
##### Sample Tuple
b100-5507

Mr.	NOUN
<br>
Podger	NOUN
<br>
had	VERB
<br>
thanked	VERB
<br>
him	PRON
<br>
gravely	ADV
<br>
,	.
<br>
and	CONJ
<br>
now	ADV
<br>
he	PRON
<br>
made	VERB
<br>
use	NOUN
<br>
of	ADP
<br>
the	DET
<br>
advice	NOUN
<br>
.	.
<br>
##### Explanation
The first token "b100-5507" is just a key and acts like an identifier to indicate the beginning of a sentence.
<br>
The other tokens have a (Word, POS Tag) pairing.

__List of POS Tags are:__
.
<br>
ADJ
<br>
ADP
<br>
ADV
<br>
CONJ
<br>
DET
<br>
NOUN
<br>
NUM
<br>
PRON
<br>
PRT
<br>
VERB
<br>
X

__Note__
<br>
__.__ is used to indicate special characters such as '.', ','
<br>
__X__ is used to indicate vocab not part of Enlish Language mostly.
Others are Standard POS tags.

In [1]:
# Importing all required libraries

import numpy as np
from itertools import chain
from collections import defaultdict, namedtuple, OrderedDict
from pomegranate import State, HiddenMarkovModel, DiscreteDistribution
import random

In [2]:
# Reading all data from given dataset.
# All method and class inisde this block, for data reading from both file data and tags files.
# These are make our data processing (reading, parsing) and data spliting work easy.

Sentence = namedtuple("Sentence", "words tags")

def read_data(filename):
    with open(filename, 'r') as f:
        sentence_lines = [l.split("\n") for l in f.read().split("\n\n")]
    return OrderedDict(((s[0], Sentence(*zip(*[l.strip().split("\t")
                        for l in s[1:]]))) for s in sentence_lines if s[0]))

def read_tags(filename):
    with open(filename, 'r') as f:
        tags = f.read().split("\n")
    return frozenset(tags)

class Subset(namedtuple("BaseSet", "sentences keys vocab X tagset Y N stream")):
    def __new__(cls, sentences, keys):
        word_sequences = tuple([sentences[k].words for k in keys])
        tag_sequences = tuple([sentences[k].tags for k in keys])
        wordset = frozenset(chain(*word_sequences))
        tagset = frozenset(chain(*tag_sequences))
        N = sum(1 for _ in chain(*(sentences[k].words for k in keys)))
        stream = tuple(zip(chain(*word_sequences), chain(*tag_sequences)))
        return super().__new__(cls, {k: sentences[k] for k in keys}, keys, wordset, word_sequences,
                               tagset, tag_sequences, N, stream.__iter__)

    def __len__(self):
        return len(self.sentences)

    def __iter__(self):
        return iter(self.sentences.items())

class Dataset(namedtuple("_Dataset", "sentences keys vocab X tagset Y training_tags_set testing_tags_set N stream")):
    def __new__(cls, tagfile, datafile, train_test_split=0.8, seed=112890):
        sentences = read_data(datafile)
        keys = tuple(sentences.keys())
        wordset = frozenset(chain(*[s.words for s in sentences.values()]))
        word_sequences = tuple([sentences[k].words for k in keys])
        tag_sequences = tuple([sentences[k].tags for k in keys])
        N = sum(1 for _ in chain(*(s.words for s in sentences.values())))
        
        # split data into train/test sets
        _keys = list(keys)
        if seed is not None: random.seed(seed)
        random.shuffle(_keys)
        split = int(train_test_split * len(_keys))
        training_tags_data = Subset(sentences, _keys[:split])
        testing_tags_data = Subset(sentences, _keys[split:])
        stream = tuple(zip(chain(*word_sequences), chain(*tag_sequences)))
        return super().__new__(cls, dict(sentences), keys, wordset, word_sequences, tagset,
                               tag_sequences, training_tags_data, testing_tags_data, N, stream.__iter__)

    def __len__(self):
        return len(self.sentences)

    def __iter__(self):
        return iter(self.sentences.items())

In [3]:
# All required Pre-process data 
# tags_data will store data and tage after parsing with train ration 80%.
tagset = {'PRON', 'ADJ', 'NUM', 'DET', 'NOUN', 'VERB', 'X', 'CONJ', 'ADV', 'PRT', 'ADP', '.'}
tags_data = Dataset(tagset, "data.txt", train_test_split=0.8)
tags_data.training_tags_set[0]

{'b100-35433': Sentence(words=('Whenever', 'artists', ',', 'indeed', ',', 'turned', 'to', 'actual', 'representations', 'or', 'molded', 'three-dimensional', 'figures', ',', 'which', 'were', 'rare', 'down', 'to', '800', 'B.C.', ',', 'they', 'tended', 'to', 'reflect', 'reality', '(', 'see', 'Plate', '6a', ',', '9b', ')', ';', ';'), tags=('ADV', 'NOUN', '.', 'ADV', '.', 'VERB', 'ADP', 'ADJ', 'NOUN', 'CONJ', 'VERB', 'ADJ', 'NOUN', '.', 'DET', 'VERB', 'ADJ', 'PRT', 'ADP', 'NUM', 'NOUN', '.', 'PRON', 'VERB', 'PRT', 'VERB', 'NOUN', '.', 'VERB', 'NOUN', 'NUM', '.', 'NUM', '.', '.', '.')),
 'b100-16721': Sentence(words=('For', 'almost', 'two', 'months', ',', 'the', 'defendant', 'and', 'the', 'world', 'heard', 'from', 'individuals', 'escaped', 'from', 'the', 'grave', 'about', 'fathers', 'and', 'mothers', ',', 'graybeards', ',', 'adolescents', ',', 'babies', ',', 'starved', ',', 'beaten', 'to', 'death', ',', 'strangled', ',', 'machine-gunned', ',', 'gassed', ',', 'burned', '.'), tags=('ADP', 'ADV'

In [4]:
# See, how data appear after processing. Just checking with first record of dataset.
# X contains words and Y contains respective tags.

print( tags_data.X[0])
print( tags_data.Y[0])

('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')
('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')


In [5]:
# Tag to most frequenlty assigned to the word in sequence
# To find most freqent class

def pair_counts(tags, words):
    d = defaultdict(lambda: defaultdict(int))
    for tag, word in zip(tags, words):
        d[tag][word] += 1
    return d

# input sequence where each unknown word is replaced by the literal string value 'nan'
def replace_unknown(sequence):
    return [w if w in tags_data.training_tags_set.vocab else 'nan' for w in sequence]

# Taking care for one dimension observations.
def simplify_decoding(X, model):
    _, state_path = model.viterbi(replace_unknown(X))
    return [state[1].name for state in state_path[1:-1]]

# Returned Dictionary keyed to each unique value in the input sequence list that
# counts the number of occurrences of the value in the sequences list.

import itertools
def unigram_counts(sequences):
    sequences = itertools.chain.from_iterable(sequences)
    dictionary = {}
    
    for i_seq in sequences:
        if i_seq in dictionary.keys():
            dictionary[i_seq] += 1
        else:
            dictionary[i_seq] = 1
    
    return dictionary


tag_unigrams = unigram_counts(tags_data.training_tags_set.Y)


# Returned Dictionary keyed to each unique value in the input sequence list that
# counts the number of occurrences of the value in the sequences list for value at the staring of sequence.

def starting_counts(sequences):
    start_tags = []
    for i_seq in sequences:
        start_tags.append(i_seq[0])
    
    dictionary = {}
    for i_tag in start_tags:
        if i_tag in dictionary.keys():
            dictionary[i_tag] += 1
        else:
            dictionary[i_tag] = 1
    
    return dictionary


tag_starts = starting_counts(tags_data.training_tags_set.Y)

# Returned Dictionary keyed to each unique value in the input sequence list that
# counts the number of occurrences of the value in the sequences list for value at the end of sequence.

def ending_counts(sequences):
    start_tags = []
    for i_seq in sequences:
        start_tags.append(i_seq[-1])
    
    dictionary = {}
    for i_tag in start_tags:
        if i_tag in dictionary.keys():
            dictionary[i_tag] += 1
        else:
            dictionary[i_tag] = 1
    
    return dictionary

tag_ends = ending_counts(tags_data.training_tags_set.Y)

import nltk

# Returned unique pair of value and number of occurance of pair in the sequence list.
def bigram_counts(sequences):
    dictionary = {}
    bigrams = list(nltk.bigrams(tags))
    for i_bigram in bigrams:
        if i_bigram in dictionary.keys():
            dictionary[i_bigram] += 1
        else:
            dictionary[i_bigram] = 1
            
    return dictionary


tags = []
for i in tags_data.training_tags_set.Y:
    for j in i:
        tags.append(j)
        
tag_bigrams = bigram_counts(tags)

In [6]:
#HMM Model is contructing here

# Creating base HMM base model.
basic_hmm_model = HiddenMarkovModel(name="base-hmm-tagger")

# fetch tags and words from data stream.
tags = [tag for _, tag in tags_data.stream()]
words = [word for word, _ in tags_data.stream()]

# Finding emission counts
emission_counts = pair_counts(tags, words)

states = {}


for i_tag in tags_data.tagset:
    
    emission_probabilities = dict()
    for i_word, i_occurance in emission_counts[i_tag].items(): 
        emission_probabilities[i_word] = i_occurance / tag_unigrams[i_tag] 
    
    tag_distribution = DiscreteDistribution(emission_probabilities) 
    state = State(tag_distribution, name=i_tag)
    states[i_tag] = state
    
    # Assigning issue state to our model
    basic_hmm_model.add_state(state)


for tag in tags_data.tagset:
    state = states[tag]
    
    # Calculate the start tag probability
    start_probability = tag_starts[tag] / sum(tag_starts.values())
    
    # Probability - states
    basic_hmm_model.add_transition(basic_hmm_model.start, state, start_probability)
    
    # End tag probability
    end_probability = tag_ends[tag] / sum(tag_ends.values())
    
    # Probability in between states
    basic_hmm_model.add_transition(state, basic_hmm_model.end, end_probability)



for tag_1 in tags_data.tagset:
    
    state_1 = states[tag_1]
    
    # Initialze the sum of probabilities to 0
    sum_of_probabilities = 0


    for tag_2 in tags_data.tagset:
        state_2 = states[tag_2]
        bigram = (tag_1, tag_2)
        
        # Transition probability
        transition_probability = tag_bigrams[bigram] / tag_unigrams[tag_1]
        
        # Transition probability to sum_of_probabilities
        sum_of_probabilities += transition_probability
        
        # Transition to our model
        basic_hmm_model.add_transition(state_1, state_2, transition_probability)



basic_hmm_model.bake()
print("Nodes or States: ", basic_hmm_model.node_count())
print("Number of Edges: ", basic_hmm_model.edge_count())


Nodes or States:  14
Number of Edges:  168


In [7]:
# Here is the Model Accuracy Evaluation
# To evaluate our model

def accuracy(X, Y, model):
    correct = total_predictions = 0
    for observations, actual_tags in zip(X, Y):
        try:
            most_likely_tags = simplify_decoding(observations, model)
            correct += sum(p == t for p, t in zip(most_likely_tags, actual_tags))
        except:
            pass
        total_predictions += len(observations)
    return correct / total_predictions

hmm_training_acc = accuracy(tags_data.training_tags_set.X, tags_data.training_tags_set.Y, basic_hmm_model)
print("Äccuracy basic hmm model: {:.2f}%".format(100 * hmm_training_acc))

hmm_testing_acc = accuracy(tags_data.testing_tags_set.X, tags_data.testing_tags_set.Y, basic_hmm_model)
print("Testing accuracy basic hmm model: {:.2f}%".format(100 * hmm_testing_acc))

Äccuracy basic hmm model: 97.53%
Testing accuracy basic hmm model: 96.16%


In [8]:
# For testing

key = "b100-935"
print(format(key))
print("Predicted Tags:\n-----------------")
print(simplify_decoding(tags_data.sentences[key].words, basic_hmm_model))
print()
print("Actual TAgs:\n--------------")
print(tags_data.sentences[key].tags)
print("\n")

b100-935
Predicted Tags:
-----------------
['CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'PRT', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.']

Actual TAgs:
--------------
('CONJ', 'PRT', 'VERB', 'PRT', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADP', 'ADP', 'ADV', 'ADV', 'DET', 'NOUN', 'VERB', 'VERB', '.', 'CONJ', 'DET', 'NOUN', 'PRON', 'VERB', 'VERB', '.')




### Binod Suman Academy at YouTube