# NLP Answers

- **Questions**: [Here](../data/exercise_3/HW3.docx)
- **Answer Set**: No. 03
- **Full Name**: Mohammad Hosein Nemati
- **Student Code**: `610300185`

---

## Basics

In this section we will done some basic steps:

### Libraries

Before begin, we must import these required libraries:

In [3]:
import warnings

import re as re

import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt

import nltk as nltk
import nltk.corpus.reader.conll as nltkconll

import sklearn.base as skbase
import sklearn.utils as skutils
import sklearn.pipeline as skpipeline
import sklearn.preprocessing as skprocessing
import sklearn.model_selection as skselection
import sklearn.feature_extraction.text as sktext

import hmm.hmm as hmm

warnings.filterwarnings("ignore", category=UserWarning)
sk.set_config(display="diagram")

### Dataset

Now we will load `ConLL Format` corpus and store `TestSet` and `TrainSet`  
Next, we will define some functions inorder to replace **OOV** with less frequent words:

In [4]:
train_reader = nltkconll.ConllCorpusReader("../lib", ["Train.txt"], ("words", "pos"))
test_reader = nltkconll.ConllCorpusReader("../lib", ["Test.txt"], ("words", "pos"))

words_frequency = nltk.FreqDist(train_reader.words())

def sents(reader):
    for sent in reader.sents():
        yield [
            word if (words_frequency.get(word) or 0) > 1 else "OOV"
            for word in sent
        ]

def words(reader):
    for word in reader.words():
        if (words_frequency.get(word) or 0) > 1:
            yield word
        else:
            yield "OOV"

def tags(reader):
    for sent in reader.tagged_sents():
        for token in sent:
            yield token[1]

---

## Problem

Now, we will use the implemented `HMM` class and train it using **Sequence of Words** and **Sequence of Tags** of `TrainSet`  
Then, we will use implemented `predict` method using `viterbi` algorithm to decode the **Sequence of Words** of `TestSet` and getting the predicted **Sequence of Tags**  
Next, we will find the predicted accuracy of **Sequence of Tags**:

In [5]:
word_encoder = skprocessing.LabelEncoder().fit(list(words(train_reader)))
tag_encoder = skprocessing.LabelEncoder().fit(list(tags(train_reader)))

In [10]:
model = hmm.HMMEstimator(n_iter=0).fit(
    np.array(word_encoder.transform(list(words(train_reader)))),
    np.array(tag_encoder.transform(list(tags(train_reader)))),
    np.array([len(sent) for sent in sents(train_reader)])
)

predicts = model.predict(
    np.array(word_encoder.transform(list(words(test_reader)))),
    np.array([len(sent) for sent in sents(test_reader)])
)

real_tags = list(tags(test_reader))
predicted_tags = tag_encoder.inverse_transform(predicts)

count = 0
for i in range(len(real_tags)):
    if real_tags[i] == predicted_tags[i]:
        count += 1

print(f"Accuracy: {(count / len(real_tags)) * 100}")

Accuracy: 90.44935602823776


---