# NLP Answers

- **Answer Set**: No. 06
- **Full Name**: Mohammad Hosein Nemati
- **Student Code**: `610300185`

---

## Introduction

In this problem, we are going to use **ParsBERT** pre-trained model for **Token Classification (NER)** task on **Custom** dataset.  
Then we will compare the reported metrics to previously trained models in `exercise 3`

In the first step, we will import some useful libraries.

In [None]:
import warnings

import math as math
import hazm as hazm
import nltk as nltk
import nltk.corpus.reader.conll as nltkconll

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn.preprocessing as skprocessing

warnings.filterwarnings("ignore", category=UserWarning)

---

## Loading

First of all, we must load our dataset and then shuffle the records.

In [None]:
train_reader = nltkconll.ConllCorpusReader("../lib", ["Train.txt"], ("words", "pos"))
test_reader = nltkconll.ConllCorpusReader("../lib", ["Test.txt"], ("words", "pos"))

def rows(reader):
    for sent in reader.tagged_sents():
        words, tags = zip(*sent)
        yield [list(words), list(tags)]

train_frame = pd.DataFrame(rows(train_reader), columns=["words", "tags"])
train_frame = train_frame.sample(frac=1, random_state=313)

test_frame = pd.DataFrame(rows(test_reader), columns=["words", "tags"])
test_frame = train_frame.sample(frac=1, random_state=313)

train_frame

---

## Preprocessing

In [None]:
def tags(reader):
    for sent in reader.tagged_sents():
        for token in sent:
            yield token[1]
            
tag_encoder = skprocessing.LabelEncoder().fit(list(tags(train_reader)))

In [None]:
train_frame["tags"] = train_frame["tags"].apply(lambda tags : tag_encoder.transform(tags))
test_frame["tags"] = test_frame["tags"].apply(lambda tags : tag_encoder.transform(tags))

In [None]:
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification

def tokenize(ds):
    tokens = tokenizer(ds["words"], truncation=True, padding=True, max_length=512, is_split_into_words=True)
    labels = []

    for i, label in enumerate(ds["tags"]):
        word_ids = tokens.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    
    tokens["labels"] = labels
    return tokens

tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")

trainset = Dataset.from_pandas(train_frame).map(tokenize, batched=True)
testset = Dataset.from_pandas(test_frame).map(tokenize, batched=True)

collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

---

## Splitting

In [None]:
train_set = trainset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    batch_size=8,
    collate_fn=collator,
)

test_set = testset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    batch_size=8,
    collate_fn=collator,
)

---

## Training

In [None]:
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained("HooshvareLab/bert-fa-base-uncased", num_labels=len(tag_encoder.classes_))

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    metrics=["accuracy"],
)

In [None]:
model.fit(train_set, epochs=4)

---

## Testing

In [None]:
result = model.evaluate([1,2])

print(result)

As we can see, the **Pars BERT** pre-trained model returend much better accuracy after fine-tuning.

---