## Part of Speech (POS) Tagging using Hidden Markov Model
#### This Activaty is Made by Mark Andrei Encanto and Ethan Gabriel Soncio for our NLP Course


1. Importing Libraries
2. Dataset Description
3. Training the Model
4. Testing and Evaluation


### Importing Libraries
We will import the necessary libraries for implementing the Hidden Markov Model (HMM). These libraries include:
- `nltk` for natural language processing tasks, including tokenization and tagging.
- `defaultdict` from the `collections` module will help us efficiently manage and count occurrences of tags and words.

In [3]:
import nltk
from collections import defaultdict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import brown
from nltk.corpus import treebank
from nltk.corpus import conll2000


### Dataset
We will use data set from the NLTK library:
- `treebank`, `brown`, `conll2000` dataset that contains part-of-speech tagged sentences
- `universal_tagset` for simplified POS Tagging

In [4]:
# For treebank corpus
nltk.download('treebank')
nltk.download('universal_tagset')

# For brown corpus
nltk.download('brown')
nltk.download('universal_tagset')

# For conll2000 corpus
nltk.download('conll2000')
nltk.download('universal_tagset')

[nltk_data] Downloading package treebank to C:\Users\Mark Andrei
[nltk_data]     Encanto\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to C:\Users\Mark
[nltk_data]     Andrei Encanto\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.
[nltk_data] Downloading package brown to C:\Users\Mark Andrei
[nltk_data]     Encanto\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package universal_tagset to C:\Users\Mark
[nltk_data]     Andrei Encanto\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package conll2000 to C:\Users\Mark Andrei
[nltk_data]     Encanto\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2000.zip.
[nltk_data] Downloading package universal_tagset to C:\Users\Mark
[nltk_data]     Andrei Encanto\AppData\Roaming\nltk_data...
[nltk_data

True

In [5]:
# Load dataset
treebank_corpus = treebank.tagged_sents(tagset='universal')
brown_corpus = brown.tagged_sents(tagset='universal')
conll_corpus = conll2000.tagged_sents(tagset='universal')
sentences = treebank_corpus + brown_corpus + conll_corpus

In [6]:
from collections import defaultdict

transition = defaultdict(lambda: defaultdict(int))
emission = defaultdict(lambda: defaultdict(int))
start_tag_count = defaultdict(int)
tag_count = defaultdict(int)

for sent in sentences:
    prev_tag = None
    for i, (word, tag) in enumerate(sent):
        word = word.lower()
        emission[tag][word] += 1
        tag_count[tag] += 1
        if i == 0:
            start_tag_count[tag] += 1
        if prev_tag:
            transition[prev_tag][tag] += 1
        prev_tag = tag

def normalize(d):
    total = sum(d.values())
    return {k: v / total for k, v in d.items()}

start_prob = normalize(start_tag_count)
transition_prob = {t1: normalize(t2) for t1, t2 in transition.items()}
emission_prob = {tag: normalize(words) for tag, words in emission.items()}
all_tags = list(tag_count.keys())


In [7]:
def viterbi(words, tags, start_p, trans_p, emit_p):
    V = [{}]
    path = {}

    for tag in tags:
        V[0][tag] = start_p.get(tag, 1e-6) * emit_p.get(tag, {}).get(words[0].lower(), 1e-6)
        path[tag] = [tag]

    for t in range(1, len(words)):
        V.append({})
        new_path = {}
        for curr_tag in tags:
            prob, prev_tag = max((V[t-1][pt] * trans_p.get(pt, {}).get(curr_tag, 1e-6) *
                                  emit_p.get(curr_tag, {}).get(words[t].lower(), 1e-6), pt)
                                 for pt in tags)
            V[t][curr_tag] = prob
            new_path[curr_tag] = path[prev_tag] + [curr_tag]
        path = new_path

    prob, final_tag = max((V[-1][tag], tag) for tag in tags)
    return path[final_tag]


In [9]:
test_sentences = sentences[-100:]
X_test = [[w for w, t in sent] for sent in test_sentences]
y_true = [[t for w, t in sent] for sent in test_sentences]

y_pred = []
for sent in X_test:
    pred_tags = viterbi(sent, all_tags, start_prob, transition_prob, emission_prob)
    y_pred.append(pred_tags)

y_true_flat = [tag for sent in y_true for tag in sent]
y_pred_flat = [tag for sent in y_pred for tag in sent]

acc = accuracy_score(y_true_flat, y_pred_flat)
print(f"HMM POS Tagging Accuracy: {acc:.4f}")

print(classification_report(y_true_flat, y_pred_flat, zero_division=0))

labels = sorted(list(set(y_true_flat + y_pred_flat)))

HMM POS Tagging Accuracy: 0.9508
              precision    recall  f1-score   support

           .       1.00      1.00      1.00       315
         ADJ       0.90      0.96      0.93       167
         ADP       0.91      0.94      0.92       250
         ADV       0.93      0.91      0.92        55
        CONJ       1.00      1.00      1.00        59
         DET       0.85      0.98      0.91       205
        NOUN       0.99      0.96      0.97       779
         NUM       1.00      1.00      1.00       151
        PRON       1.00      0.53      0.70        58
         PRT       0.76      0.76      0.76        78
        VERB       0.96      0.97      0.96       321

    accuracy                           0.95      2438
   macro avg       0.94      0.91      0.92      2438
weighted avg       0.95      0.95      0.95      2438

