# NLP Answers

- **Answer Set**: No. 06
- **Full Name**: Mohammad Hosein Nemati
- **Student Code**: `610300185`

---

## Introduction

In this problem, we are going to use **ParsBERT** pre-trained model for **Token Classification (NER)** task on **Custom** dataset.  
Then we will compare the reported metrics to previously trained models in `exercise 3`

In the first step, we will import some useful libraries.

In [2]:
import warnings

import math as math
import hazm as hazm
import nltk as nltk
import nltk.corpus.reader.conll as nltkconll

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import sklearn.preprocessing as skprocessing

warnings.filterwarnings("ignore", category=UserWarning)

---

## Loading

First of all, we must load our dataset and then shuffle the records.

In [42]:
train_reader = nltkconll.ConllCorpusReader("../lib", ["Train.txt"], ("words", "pos"))
test_reader = nltkconll.ConllCorpusReader("../lib", ["Test.txt"], ("words", "pos"))

def rows(reader):
    for sent in reader.tagged_sents():
        words, tags = zip(*sent)
        yield [list(words), list(tags)]

train_frame = pd.DataFrame(rows(train_reader), columns=["words", "tags"])
train_frame = train_frame.sample(frac=1, random_state=313)

test_frame = pd.DataFrame(rows(test_reader), columns=["words", "tags"])
test_frame = train_frame.sample(frac=1, random_state=313)

train_frame

Unnamed: 0,words,tags
64,"[مثالهاي, بالا, از, سميعيان, است, .]","[N, ADJ, P, N, V, DELM]"
4861,"[او, "", راو, شانكار, "", هندي, بود, كه, سهتار, ...","[PRO, DELM, N, N, DELM, ADJ, V, CON, N, V, DELM]"
99,"[در, واژگان, تنها, اطلاعات, منحصربه, هر, مدخل,...","[P, N, ADV, N, ADJ, QUA, N, ADJ, ADJ, V, CON, ..."
6015,"[مثلاً, فردي, را, در, نظر, بگيريد, كه, صبح, از...","[CON, N, P, P, N, V, CON, N, P, N, V, DELM, P,..."
1396,"[بهطور, سرمايهگذاران, به, دنبال, يك, محيط, تجا...","[CON, N, P, N, N, N, ADJ, ADJ, P, N, V, V, CON..."
...,...,...
4122,"[اين, كه, شدني, نيست, ،, يعني, نميتوانيم, بگوي...","[PRO, CON, ADJ, V, DELM, CON, V, V, DET, N, N,..."
2632,"[2, -, الكتريكي, .]","[N, DELM, ADJ, DELM]"
6183,"[ساوال, :, شكاف, لب, و, كام, يك, بيماري, مادرز...","[N, DELM, N, N, CON, N, N, N, ADJ, V, DELM, AD..."
6570,"[در, دوره, مظفرالدينشاه, هر, كدام, از, محلههاي...","[P, N, N, QUA, N, P, N, ADJ, N, P, SPEC, N, CO..."


---

## Preprocessing

In [43]:
def tags(reader):
    for sent in reader.tagged_sents():
        for token in sent:
            yield token[1]
            
tag_encoder = skprocessing.LabelEncoder().fit(list(tags(train_reader)))

In [44]:
train_frame["tags"] = train_frame["tags"].apply(lambda tags : tag_encoder.transform(tags))
test_frame["tags"] = test_frame["tags"].apply(lambda tags : tag_encoder.transform(tags))

In [45]:
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification

def tokenize(ds):
    tokens = tokenizer(ds["words"], truncation=True, padding=True, max_length=512, is_split_into_words=True)
    labels = []

    for i, label in enumerate(ds["tags"]):
        word_ids = tokens.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    
    tokens["labels"] = labels
    return tokens

tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")

trainset = Dataset.from_pandas(train_frame).map(tokenize, batched=True)
testset = Dataset.from_pandas(test_frame).map(tokenize, batched=True)

collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

---

## Splitting

In [46]:
train_set = trainset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    batch_size=8,
    collate_fn=collator,
)

test_set = testset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    batch_size=8,
    collate_fn=collator,
)

---

## Training

In [12]:
from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained("HooshvareLab/bert-fa-base-uncased", num_labels=len(tag_encoder.classes_))

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

All model checkpoint layers were used when initializing TFBertForTokenClassification.

Some layers of TFBertForTokenClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.fit(train_set, epochs=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fc5d461b550>

---

## Testing

In [13]:
result = model.evaluate([1,2])

print(result)

AttributeError: in user code:

    File "c:\Users\KoLiBer\Documents\Workspace\nlpexercises\.venv\lib\site-packages\keras\engine\training.py", line 1557, in test_function  *
        return step_function(self, iterator)
    File "c:\Users\KoLiBer\Documents\Workspace\nlpexercises\.venv\lib\site-packages\keras\engine\training.py", line 1546, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "c:\Users\KoLiBer\Documents\Workspace\nlpexercises\.venv\lib\site-packages\keras\engine\training.py", line 1535, in run_step  **
        outputs = model.test_step(data)
    File "c:\Users\KoLiBer\Documents\Workspace\nlpexercises\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1492, in test_step
        y = {key: val for key, val in x.items() if key in label_kwargs}

    AttributeError: 'Tensor' object has no attribute 'items'


As we can see, the **Pars BERT** pre-trained model returend much better accuracy after fine-tuning.

---