<h2 style="text-align: center">344.075 KV: Natural Language Processing (WS2024)</h2>
<h1 style="color:rgb(0,120,170)">Assignment 3</h1>
<h2 style="color:rgb(0,120,170)">Document Classification with PyTorch and BERT</h2>

<b>Terms of Use</b><br>
This  material is prepared for educational purposes at the Johannes Kepler University (JKU) Linz, and is exclusively provided to the registered students of the mentioned course at JKU. It is strictly forbidden to distribute the current file, the contents of the assignment, and its solution. The use or reproduction of this manuscript is only allowed for educational purposes in non-profit organizations, while in this case, the explicit prior acceptance of the author(s) is required.


**Authors:** Shah Nawaz, Shahed Masoudian<br>


<h2>Table of contents</h2>
<ol>
    <a href="#section-general-guidelines"><li style="font-size:large;font-weight:bold">General Guidelines</li></a>
    <a href="#section-tensorboard"><li style="font-size:large;font-weight:bold">Bonus Task: Logging and Publishing Experiment Results (2 extra point)</li></a>
    <a href="#section-taskA"><li style="font-size:large;font-weight:bold">Task A: Document Classification with PyTorch (25 points)</li></a>
    <a href="#section-taskB"><li style="font-size:large;font-weight:bold">Task B: Document Classification with BERT (15 points)</li></a>
    
    
</ol>

<a name="section-general-guidelines"></a><h2 style="color:rgb(0,120,170)">General Guidelines</h2>

### Assignment objective
This assignment aims to provide the necessary practices for learning the principles of deep learning programing in NLP using PyTorch. To this end, Task A provides the space for becoming fully familiar with PyTorch programming by implementing a "simple" document (sentence) classification model with PyTorch, and Task B extends this classifier with a BERT model. As the assignment requires working with PyTorch and Huggingface Transformers, please familiarize yourself with these libraries using any possible available teaching resources in particular the libraries' documentations. The assignment has in total **40 points**, and also offers **2 extra points** which can cover any missing point.

This Notebook encompasses all aspects of the assignment, namely the descriptions of tasks as well as your solutions and reports. Feel free to add any required cell for solutions. The cells can contain code, reports, charts, tables, or any other material, required for the assignment. Feel free to provide the solutions in an interactive and visual way!

Please discuss any unclear point in the assignment in the provided forum in MOODLE. It is also encouraged to provide answers to your peer's questions. However when submitting a post, keep in mind to avoid providing solutions. Please let the tutor(s) know shall you find any error or unclarity in the assignment.


### Libraries & Dataset

The assignment should be implemented with recent versions of `Python`, `PyTorch` and, `transformers`. Any standard Python library can be used, so far that the library is free and can be simply installed using `pip` or `conda`. Examples of potentially useful libraries are `scikit-learn`, `numpy`, `scipy`, `gensim`, `nltk`, `spaCy`, and `AllenNLP`. Use the latest stable version of each library.

To conduct the experiments, we use a subset of the `HumSet` dataset [1] (https://blog.thedeep.io/humset/). `HumSet` is created by the DEEP (https://www.thedeep.io) project – an open source platform which aims to facilitate processing of textual data for international humanitarian response organizations. The platform enables the classification of text excerpts, extracted from news and reports into a set of domain specific classes. The provided dataset contains the classes (labels) referring to the humanitarian sectors like agriculture, health, and protection. The dataset contains an overall number of 17,301 data points.

Download the dataset from the Moodle page of the course.

the provided zip file consists of the following files:
- `thedeep.subset.train.txt`: Train set in csv format with three fields: sentence_id, text, and label.
- `thedeep.subset.validation.txt`: Validation set in csv format with three fields: sentence_id, text, and label.
- `thedeep.subset.test.txt`: Test set in csv format with three fields: sentence_id, text, and label.
- `thedeep.subset.label.txt`: Captions of the labels.
- `thedeep.ToU.txt`: Terms of use of the dataset.

[1] HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crises Response
*Selim Fekih, Nicolo' Tamagnone, Benjamin Minixhofer, Ranjan Shrestha, Ximena Contla, Ewan Oglethorpe and Navid Rekabsaz.*
In Findings of the 2022 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP), December 2022.


### Submission

Each group should submit the following two files:

- One Jupyter Notebook file (`.ipynb`), containing all the code, results, visualizations, etc. **In the submitted Notebook, all the results and visualizations should already be present, and can be observed simply by loading the Notebook in a browser.** The Notebook must be self-contained, meaning that (if necessary) one can run all the cells from top to bottom without any error. Do not forget to put in your names and student numbers in the first cell of the Notebook.
- The HTML file (`.html`) achieved from exporting the Jupyter Notebook to HTML (Download As HTML).

You do not need to include the data files in the submission.



<a name="section-tensorboard"></a><h2 style="color:rgb(0,120,170)">Bonus Task: Logging and Publishing Experiment Results (2 extra point)</h2>

In all experiments of this assignment, use any experiment monitoring tool like [`TensorBoard`](https://www.tensorflow.org/tensorboard), [`wandb`](https://wandb.ai) to log and store all useful information about the training and evaluation of the models. Feel free to log any important aspect in particular the changes in evaluation results on validation, in training loss, and in learning rate.

After finalizing all experiments and cleaning any unnecessary experiment, **provide the URL to the results monitoring page below**.

For instance if using [`TensorBoard.dev`](https://tensorboard.dev), you can run the following command in the folder of log files: `tensorboard dev upload --name my_exp --logdir path/to/output_dir`, and take the provided URL to the TensorBoard's console.


**URL :** *EDIT!*

<a name="section-taskA"></a><h2 style="color:rgb(0,120,170)">Task A: Document Classification with PyTorch (25 points)</h2>

The aim of this task is identical to the one of Assignment 2 - Task B, namely to design a document classification model that exploits pre-trained word embeddings. It is of course allowed to use the preprocessed text, the dictionary, or any other relevant code or processings, done in the previous assignments.

In this task, you implement a document classification model using PyTorch, which given a document/sentence (consisting of a set of words) predicts the corresponding class. Before getting started with coding, have a look at the <a href="#section-tensorboard">optional task</a>, as you may want to already include `Tensorboard` in the code. The implementation of the classifier should cover the points below.

**Preprocessing and dictionary (1 point):** Following previous assignments, load the train, validation, and test datasets, apply necessary preprocessing steps, and create a dictionary of words.

**Data batching (4 points):** Using the dictionary, create batches for any given dataset (train/validation/test). Each batch is a two-dimensional matrix of *batch-size* to *max-document-length*, containing the IDs of the words in the corresponding documents. *Batch-size* and *max-document-length* are two hyper-parameters and can be set to any appropriate values (*Batch-size* must be higher than 1 and *max-document-length* at least 50 words). If a document has more than *max-document-length* words, only the first *max-document-length* words should be kept.

**Word embedding lookup (2 point):** Using `torch.nn.Embedding`, create a lookup for the embeddings of all the words in the dictionary. The lookup is in fact a matrix, which maps the ID of each word to the corresponding word vector. Similar to Assignment 2, use the pre-trained vectors of a word embedding model (like word2vec or GloVe) to initialize the word embeddings of the lookup. Keep in mind that the embeddings of the words in the lookup should be matched with the correct vector in the pretrained word embedding. If the vector of a word in the lookup does not exist in the pretrained word embeddings, the corresponding vector should be initialized randomly.

**Model definition (3 points):** Define the class `ClassificationAverageModel` as a PyTorch model. In the initialization procedure, the model receives the word embedding lookup, and includes it in the model as model's parameters. These embeddings parameters should be trainable, meaning that the word vectors get updated during model training. Feel free to add any other parameters to the model, which might be necessary for accomplishing the functionalities explained in the following.

**Forward function (5 points):** The forward function of the model receives a batch of data, and first fetches the corresponding embeddings of the word IDs in the batch using the lookup. Similar to Assignment 2, the embedding of a document is created by calculating the *element-wise mean* of the embeddings of the document's words. Formally, given the document $d$, consisting of words $\left[ v_1, v_2, ..., v_{|d|} \right]$, the document representation $\mathbf{e}_d$ is defined as:

<center><div>$\mathbf{e}_d = \frac{1}{|d|}\sum_{i=1}^{|d|}{\mathbf{e}_{v_i}}$</div></center>

where $\mathbf{e}_{v}$ is the vector of the word $v$, and $|d|$ is the length of the document. An important point in the implementation of this formula is that the documents in the batch might have different lengths and therefore each document should be divided by its corresponding $|d|$. Finally, this document embedding is utilized to predict the probability of the output classes, done by applying a linear transformation from the embeddings size to the number of classes, followed by Softmax. The linear transformation also belongs to the model's parameters and will be learned in training.

**Loss Function and optimization (2 point):** The loss between the predicted and the actual classes is calculated using Negative Log Likelihood or Cross Entropy. Update the model's parameters using any appropriate optimization mechanism such as Adam.

**Early Stopping (2 points):** After each epoch, evaluate the model on the *validation set* using accuracy. If the evaluation result (accuracy) improves, save the model as the best performing one so far. If the results are not improving after a certain number of evaluation rounds (set as another hyper-parameter) or if training reaches a certain number of epochs, terminate the training procedure.

**Test Set Evaluation (1 point):** After finishing the training, load the (already stored) best performing model, and use it for class prediction on the test set.

**Reporting (1 point):** During loading and processing the collection, provide sufficient information and examples about the data and the applied processing steps. Report the results of the best performing model on the validation and test set in a table.

**Overall functionality of the training procedure (4 point).**


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import spacy
import string
from collections import defaultdict
import matplotlib.pyplot as plt
import seaborn as sns
from torch.utils.data import DataLoader, Dataset
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score, classification_report


nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Preprocessing and dictionary

In [None]:
trainData = pd.read_csv("/content/drive/My Drive/NLP/thedeep.subset.train.txt", delimiter=",", names=['sentence_id', 'text', 'label'])
valData = pd.read_csv("/content/drive/My Drive/NLP/thedeep.subset.validation.txt", delimiter=",", names=['sentence_id', 'text', 'label'])
testData = pd.read_csv("/content/drive/My Drive/NLP/thedeep.subset.test.txt", delimiter=",", names=['sentence_id', 'text', 'label'])
label_mapping = {}
with open("/content/drive/My Drive/NLP/thedeep.labels.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split(',')
        if len(parts) >= 2:
            label_id = parts[0]
            label_name = ','.join(parts[1:])  # Handle label names with commas
            label_mapping[label_id] = label_name

In [None]:
trainData.head()

Unnamed: 0,sentence_id,text,label
0,5446,In addition to the immediate life-saving inter...,9
1,8812,There are approximately 2.6 million people cla...,3
2,16709,"While aid imports have held up recently, comme...",5
3,3526,Heavy rainfalls as well as onrush of water fro...,0
4,4928,"Based on field reports 9 , the main production...",3


In [None]:
# drop ID sentence
trainData = trainData.drop(['sentence_id'], axis=1)
valData = valData.drop(['sentence_id'], axis=1)
testData = testData.drop(['sentence_id'], axis=1)

In [None]:
# print the new Head of train data to see results
trainData.head()

Unnamed: 0,text,label
0,In addition to the immediate life-saving inter...,9
1,There are approximately 2.6 million people cla...,3
2,"While aid imports have held up recently, comme...",5
3,Heavy rainfalls as well as onrush of water fro...,0
4,"Based on field reports 9 , the main production...",3


In [None]:
nlp = spacy.load('en_core_web_sm')
stemmer = PorterStemmer()

def pp_text(text):
    #lower casing, removing puncation and numbers
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)

    #Tokenize and remove stopwords
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop]

    #stemming
    stemTokens = [stemmer.stem(token) for token in tokens]

    return stemTokens

#apply preprocessing
trainData['clean_text'] = trainData['text'].apply(pp_text)
valData['clean_text'] = valData['text'].apply(pp_text)
testData['clean_text'] = testData['text'].apply(pp_text)


In [None]:
print(f'The initial sentence was=', testData['text'][0])
print(f'The cleaned sentence is=', testData['clean_text'][0])

The initial sentence was= Overall 30% decrease in MAM Children admissions from 12,879 in April 2016 to 9,047 in April 2017
The cleaned sentence is= ['overal', ' ', 'decreas', 'mam', 'children', 'admiss', ' ', 'april', ' ', ' ', 'april']


In [None]:
def clean(tokens):
    return [token for token in tokens if token.strip()]

trainData['clean_text'] = trainData['clean_text'].apply(clean)
valData['clean_text'] = valData['clean_text'].apply(clean)
testData['clean_text'] = testData['clean_text'].apply(clean)

In [None]:
#Now lets see if that worked out
print("Sample cleaned data from trainData:")
print(trainData['clean_text'].head(2))
print("\nSample cleaned data from valData:")
print(valData['clean_text'].head(2))
print("\nSample cleaned data from testData:")
print(testData['clean_text'].head(2))

Sample cleaned data from trainData:
0    [addit, immedi, lifesav, intervent, unicef, ta...
1    [approxim, million, peopl, classifi, phase, mi...
Name: clean_text, dtype: object

Sample cleaned data from valData:
0    [veteran, threw, roadblock, main, northbound, ...
1    [water, depart, complain, lack, skill, worker,...
Name: clean_text, dtype: object

Sample cleaned data from testData:
0    [overal, decreas, mam, children, admiss, april...
1    [fear, ebola, led, attack, health, worker, apr...
Name: clean_text, dtype: object


In [None]:
vocab = defaultdict(int)
for tokens in trainData['clean_text']:
    for word in tokens:
        vocab[word] += 1

#Check the initial vocabulary size and display some sample tokens
initial_vocab_size = len(vocab)
print(f"Initial vocabulary size: {initial_vocab_size}")
print("Top 10 most frequent tokens:", sorted(vocab.items(), key=lambda x: x[1], reverse=True)[:10])


Initial vocabulary size: 23270
Top 10 most frequent tokens: [('case', 5594), ('report', 5113), ('food', 4262), ('peopl', 3881), ('area', 3524), ('children', 2953), ('water', 2557), ('health', 2477), ('increas', 2258), ('includ', 2220)]


In [None]:
sample_index = 1
print(f"Sample sentence is: {testData['text'][sample_index]}")
print('-------------------------------------------------------------------------------------------------------------------------------------------')
#print the tokens of the sentence after preprocessing, vocabulary reduction, and OOV handling
print(f"The tokens of the sentence are: {testData['clean_text'][sample_index]}")


Sample sentence is: In 2014, fear of Ebola also led to attacks on health workers. In April 2014, an angry crowd attacked an Ebola treatment center in Macenta, 425 kilometers southeast of Guinea’s capital, Conakry, run by Doctors Without Borders (Medecins Sans Frontieres or MSF), which it accused of bringing Ebola to the city. In August 2014, people in N’Zérékoré, Guinea’s second largest city, protested spraying a market with disinfectant that they believed was infected with the Ebola virus and rioted, injuring over 50 people, including security forces. Law enforcement agencies in Congo should ensure that they can quickly, adequately, and appropriately respond if similar attacks occur.
-------------------------------------------------------------------------------------------------------------------------------------------
The tokens of the sentence are: ['fear', 'ebola', 'led', 'attack', 'health', 'worker', 'april', 'angri', 'crowd', 'attack', 'ebola', 'treatment', 'center', 'macenta',

In [None]:
#frequency threshold to filter tokens
freqThreshold = 1
vocabReduced = {token for token, freq in vocab.items() if freq > freqThreshold}

initial_vocab_size = len(vocab)
reduced_vocab_size = len(vocabReduced)
print(f"Initial vocabulary size: {initial_vocab_size}")
print(f"Reduced vocabulary size after applying frequency threshold: {reduced_vocab_size}")

Initial vocabulary size: 23270
Reduced vocabulary size after applying frequency threshold: 11198


In [None]:
#function handle OOV by replacing them with OOV
def replace_oov(tokens, vocab):
    return [token if token in vocab else "<OOV>" for token in tokens]

trainData['clean_text'] = trainData['clean_text'].apply(lambda tokens: replace_oov(tokens, vocabReduced))
valData['clean_text'] = valData['clean_text'].apply(lambda tokens: replace_oov(tokens, vocabReduced))
testData['clean_text'] = testData['clean_text'].apply(lambda tokens: replace_oov(tokens, vocabReduced))

sample_index = 1
print(f"Original sentence: {testData['text'][sample_index]}")
print(f"Tokens after preprocessing and OOV handling: {testData['clean_text'][sample_index]}")

Original sentence: In 2014, fear of Ebola also led to attacks on health workers. In April 2014, an angry crowd attacked an Ebola treatment center in Macenta, 425 kilometers southeast of Guinea’s capital, Conakry, run by Doctors Without Borders (Medecins Sans Frontieres or MSF), which it accused of bringing Ebola to the city. In August 2014, people in N’Zérékoré, Guinea’s second largest city, protested spraying a market with disinfectant that they believed was infected with the Ebola virus and rioted, injuring over 50 people, including security forces. Law enforcement agencies in Congo should ensure that they can quickly, adequately, and appropriately respond if similar attacks occur.
Tokens after preprocessing and OOV handling: ['fear', 'ebola', 'led', 'attack', 'health', 'worker', 'april', 'angri', 'crowd', 'attack', 'ebola', 'treatment', 'center', '<OOV>', 'kilomet', 'southeast', 'guinea', 'capit', 'conakri', 'run', 'doctor', 'border', 'medecin', 'san', 'frontier', 'msf', 'accus', 'b

In [None]:
# Function to calculate TF-IDF weights
def calculate_tfidf(corpus, vocab):
    doc_count = len(corpus)
    tfidf_vectors = []
    idf = {}

    # Compute IDF for each term in vocab
    for token in vocab:
        doc_freq = sum(1 for doc in corpus if token in doc)
        idf[token] = np.log((doc_count + 1) / (doc_freq + 1)) + 1

    # Compute TF-IDF for each document
    for tokens in corpus:
        tfidf_vector = np.zeros(len(vocab))
        term_counts = defaultdict(int)

        # Term frequency
        for token in tokens:
            term_counts[token] += 1

        for idx, token in enumerate(vocab):
            tf = term_counts[token] / len(tokens)
            tfidf_vector[idx] = tf * idf[token] if token in term_counts else 0

        tfidf_vectors.append(tfidf_vector)

    return np.array(tfidf_vectors)

# Generate TF-IDF vectors
tfidf_train = calculate_tfidf(trainData['clean_text'], vocabReduced)
tfidf_val = calculate_tfidf(valData['clean_text'], vocabReduced)
tfidf_test = calculate_tfidf(testData['clean_text'], vocabReduced)


In [None]:
# Function to calculate BM25 weights
def calculate_bm25(corpus, vocab, k=1.5, b=0.75):
    doc_count = len(corpus)
    avg_doc_len = np.mean([len(doc) for doc in corpus])
    bm25_vectors = []
    idf = {}

    # Compute IDF for each term
    for token in vocab:
        doc_freq = sum(1 for doc in corpus if token in doc)
        idf[token] = np.log((doc_count - doc_freq + 0.5) / (doc_freq + 0.5) + 1)

    # Compute BM25 scores
    for tokens in corpus:
        bm25_vector = np.zeros(len(vocab))
        term_counts = defaultdict(int)

        for token in tokens:
            term_counts[token] += 1

        for idx, token in enumerate(vocab):
            tf = term_counts[token]
            doc_len = len(tokens)
            bm25_score = idf[token] * ((tf * (k + 1)) / (tf + k * (1 - b + b * (doc_len / avg_doc_len))))
            bm25_vector[idx] = bm25_score if token in term_counts else 0

        bm25_vectors.append(bm25_vector)

    return np.array(bm25_vectors)

# Generate BM25 vectors
bm25_train = calculate_bm25(trainData['clean_text'], vocabReduced)
bm25_val = calculate_bm25(valData['clean_text'], vocabReduced)
bm25_test = calculate_bm25(testData['clean_text'], vocabReduced)


In [None]:
# Function to calculate sparsity rate
def calculate_sparsity(vectors):
    total_elements = vectors.size
    zero_elements = np.count_nonzero(vectors == 0)
    sparsity_rate = (zero_elements / total_elements) * 100
    return sparsity_rate

# Calculate and print sparsity rates
sparsity_tfidf_train = calculate_sparsity(tfidf_train)
sparsity_tfidf_val = calculate_sparsity(tfidf_val)
sparsity_tfidf_test = calculate_sparsity(tfidf_test)

sparsity_bm25_train = calculate_sparsity(bm25_train)
sparsity_bm25_val = calculate_sparsity(bm25_val)
sparsity_bm25_test = calculate_sparsity(bm25_test)

print(f"Sparsity rate of TF-IDF vectors for train set: {sparsity_tfidf_train:.2f}%")
print(f"Sparsity rate of TF-IDF vectors for validation set: {sparsity_tfidf_val:.2f}%")
print(f"Sparsity rate of TF-IDF vectors for test set: {sparsity_tfidf_test:.2f}%")

print(f"Sparsity rate of BM25 vectors for train set: {sparsity_bm25_train:.2f}%")
print(f"Sparsity rate of BM25 vectors for validation set: {sparsity_bm25_val:.2f}%")
print(f"Sparsity rate of BM25 vectors for test set: {sparsity_bm25_test:.2f}%")


Sparsity rate of TF-IDF vectors for train set: 99.74%
Sparsity rate of TF-IDF vectors for validation set: 99.74%
Sparsity rate of TF-IDF vectors for test set: 99.74%
Sparsity rate of BM25 vectors for train set: 99.74%
Sparsity rate of BM25 vectors for validation set: 99.74%
Sparsity rate of BM25 vectors for test set: 99.74%


# Data batching

In [None]:
class DocumentDataset(Dataset):
    def __init__(self, documents, labels, dictionary, max_document_length):
        """
        :param documents: List of preprocessed documents (list of tokens).
        :param labels: List of corresponding labels.
        :param dictionary: Dictionary mapping words to IDs.
        :param max_document_length: Maximum number of tokens per document.
        """
        self.documents = documents
        self.labels = labels
        self.dictionary = dictionary
        self.max_document_length = max_document_length

    def __len__(self):
        return len(self.documents)

    def __getitem__(self, idx):
        document = self.documents[idx]
        label = self.labels[idx]

        # Convert words to IDs, handle OOV and padding
        word_ids = [self.dictionary.get(word, self.dictionary["<OOV>"]) for word in document]
        word_ids = word_ids[:self.max_document_length]

        padding = [self.dictionary["<PAD>"]] * (self.max_document_length - len(word_ids))
        word_ids.extend(padding)

        return torch.tensor(word_ids, dtype=torch.long), torch.tensor(label, dtype=torch.long)


def create_batches(documents, labels, dictionary, batch_size, max_document_length):
    """
    Creates batches using the DocumentDataset.

    :param documents: List of preprocessed documents (list of tokens).
    :param labels: List of labels corresponding to the documents.
    :param dictionary: Dictionary mapping words to IDs.
    :param batch_size: Size of each batch.
    :param max_document_length: Maximum length of a document.
    :return: DataLoader
    """
    dataset = DocumentDataset(documents, labels, dictionary, max_document_length)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    return dataloader

if __name__ == "__main__":
    # Generating dictionary from the reduced vocabulary
    dictionary = {word: idx for idx, word in enumerate(vocabReduced, start=2)}
    dictionary["<PAD>"] = 0  # Padding token
    dictionary["<OOV>"] = 1  # Out-of-vocabulary token

    # Preprocessed dataset
    train_documents = trainData['clean_text'].tolist()  # List of tokenized documents
    train_labels = trainData['label'].tolist()  # Corresponding labels

    # Hyperparameters
    batch_size = 4
    max_document_length = 50

    # Creating batches
    train_loader = create_batches(train_documents, train_labels, dictionary, batch_size, max_document_length)

    # Displaying batches
    for batch_idx, (word_ids, batch_labels) in enumerate(train_loader):
        print(f"Batch {batch_idx + 1}")
        print("Word IDs:", word_ids)
        print("Labels:", batch_labels)
        break


Batch 1
Word IDs: tensor([[ 7765,  8113,  1358, 10060,  7707,  8339,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [ 5100,  8243,  1058,  6607,  1606,   717,  4052,  3001,  7707,  2913,
          7197,  3001, 10023,  3963,  6607,  4418,  8707, 10732,  5104,  6962,
          6418,  7023,  6013,  5196, 11038,  9831,   376,  8860,  1606,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [ 5889,  7898,   853,  9232,   853,  6150,  6935,  8860,  4229,  8112,
          9030,  8747,  9741,  8915,  8544,   119,  8747,  9352,  8001,  5428,
         10794,     0,     0,   

# Word embedding lookup

In [None]:
def load_pretrained_embeddings(embedding_path, embedding_dim):
    """
    Load pre-trained word embeddings from a file.

    :param embedding_path: Path to the pre-trained embeddings file.
    :param embedding_dim: Dimensionality of the embeddings.
    :return: A dictionary mapping words to their embedding vectors.
    """
    embeddings_index = {}
    with open(embedding_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index

def create_embedding_matrix(dictionary, embeddings_index, embedding_dim):
    """
    Create an embedding matrix where each row corresponds to a word ID in the dictionary.

    :param dictionary: Dictionary mapping words to IDs.
    :param embeddings_index: Pre-trained word embeddings.
    :param embedding_dim: Dimensionality of the embeddings.
    :return: A PyTorch embedding layer.
    """
    vocab_size = len(dictionary)
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    for word, idx in dictionary.items():
        if word in embeddings_index:
            embedding_matrix[idx] = embeddings_index[word]
        else:
            embedding_matrix[idx] = np.random.uniform(-0.1, 0.1, embedding_dim)

    return torch.tensor(embedding_matrix, dtype=torch.float32)

if __name__ == "__main__":
    embedding_path = "/content/drive/My Drive/NLP/glove.6B.100d.txt"
    embedding_dim = 100

    # Loading pre-trained embeddings
    embeddings_index = load_pretrained_embeddings(embedding_path, embedding_dim)

    # Generating the embedding matrix
    embedding_matrix = create_embedding_matrix(dictionary, embeddings_index, embedding_dim)

    # Creating PyTorch embedding layer
    embedding_layer = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)

    print(f"Embedding matrix shape: {embedding_matrix.shape}")
    print(f"Example embedding for '<PAD>': {embedding_matrix[dictionary['<PAD>']]}")


Embedding matrix shape: torch.Size([11200, 100])
Example embedding for '<PAD>': tensor([ 0.0260, -0.0175, -0.0525, -0.0034,  0.0345, -0.0733,  0.0166, -0.0394,
         0.0415, -0.0380,  0.0474, -0.0601, -0.0004, -0.0283,  0.0757,  0.0986,
         0.0325, -0.0703,  0.0127,  0.0423, -0.0759,  0.0078, -0.0315, -0.0555,
         0.0590,  0.0349,  0.0849,  0.0690, -0.0102,  0.0831, -0.0291, -0.0753,
        -0.0137, -0.0967,  0.0162,  0.0838, -0.0372, -0.0860,  0.0200,  0.0760,
         0.0484, -0.0427, -0.0137,  0.0273, -0.0182,  0.0892, -0.0260, -0.0651,
         0.0640, -0.0945,  0.0050,  0.0211,  0.0791,  0.0742, -0.0407, -0.0880,
         0.0984,  0.0990,  0.0535, -0.0410,  0.0071, -0.0625, -0.0579, -0.0229,
        -0.0893,  0.0145, -0.0042,  0.0665,  0.0478, -0.0670, -0.0222,  0.0281,
         0.0524, -0.0144, -0.0255, -0.0423, -0.0866,  0.0827,  0.0460,  0.0370,
         0.0871, -0.0373, -0.0645, -0.0210,  0.0082, -0.0730,  0.0392,  0.0730,
        -0.0180, -0.0475,  0.0941,  0.08

# Model definition


In [None]:
class ClassificationAverageModel(nn.Module):
    def __init__(self, embedding_layer, embedding_dim, num_classes):
        """
        :param embedding_layer: Pre-trained word embedding lookup (nn.Embedding).
        :param embedding_dim: Dimensionality of word embeddings.
        :param num_classes: Number of output classes for classification.
        """
        super(ClassificationAverageModel, self).__init__()
        self.embedding = embedding_layer
        self.embedding_dim = embedding_dim
        self.num_classes = num_classes

        # Defining a fully connected layer
        self.fc = nn.Linear(self.embedding_dim, self.num_classes)

    def forward(self, input_ids):
        """
        Forward pass through the model.

        :param input_ids: Input tensor of word IDs (batch_size x max_document_length).
        :return: Logits for each class (batch_size x num_classes).
        """
        # Get embeddings for input word IDs
        embeddings = self.embedding(input_ids)  # (batch_size x max_document_length x embedding_dim)

        # Mean pooling: Average over the max_document_length dimension
        pooled_embeddings = embeddings.mean(dim=1)  # (batch_size x embedding_dim)

        # Pass the pooled embeddings through the fully connected layer
        logits = self.fc(pooled_embeddings)  # (batch_size x num_classes)

        return logits

if __name__ == "__main__":
    num_classes = 12
    embedding_dim = 100
    model = ClassificationAverageModel(embedding_layer, embedding_dim, num_classes)
    batch_size = 4
    max_document_length = 50
    input_ids = torch.randint(0, len(dictionary), (batch_size, max_document_length))
    logits = model(input_ids)
    print(f"Logits shape: {logits.shape}")


Logits shape: torch.Size([4, 12])


# Forward function

In [None]:
class ClassificationAverageModel(nn.Module):
    def __init__(self, embedding_layer, embedding_dim, num_classes):
        """
        :param embedding_layer: Pre-trained word embedding lookup (nn.Embedding).
        :param embedding_dim: Dimensionality of word embeddings.
        :param num_classes: Number of output classes for classification.
        """
        super(ClassificationAverageModel, self).__init__()
        self.embedding = embedding_layer
        self.embedding_dim = embedding_dim
        self.num_classes = num_classes
        self.fc = nn.Linear(self.embedding_dim, self.num_classes)

    def forward(self, input_ids):
        """
        Forward pass through the model.

        :param input_ids: Input tensor of word IDs (batch_size x max_document_length).
        :return: Probabilities for each class (batch_size x num_classes).
        """
        # Getting embeddings for input word IDs
        embeddings = self.embedding(input_ids)  # (batch_size x max_document_length x embedding_dim)

        # Creating a mask for padding (PAD = 0)
        mask = (input_ids != 0).float()  # (batch_size x max_document_length)

        # Calculating the sum of embeddings for each document
        sum_embeddings = torch.sum(embeddings * mask.unsqueeze(-1), dim=1)  # (batch_size x embedding_dim)

        # Calculating the length of each document (number of non-PAD tokens)
        doc_lengths = torch.sum(mask, dim=1).unsqueeze(-1)  # (batch_size x 1)

        # Avoiding division by zero for empty documents
        doc_lengths = torch.clamp(doc_lengths, min=1)

        # Calculating the mean embeddings for each document
        doc_embeddings = sum_embeddings / doc_lengths  # (batch_size x embedding_dim)

        # Passing the document embeddings through the fully connected layer
        logits = self.fc(doc_embeddings)  # (batch_size x num_classes)

        # Applying Softmax to get class probabilities
        probabilities = torch.softmax(logits, dim=1)  # (batch_size x num_classes)

        return probabilities

if __name__ == "__main__":
    num_classes = 12
    embedding_dim = 100

    # Instantiate the model
    model = ClassificationAverageModel(embedding_layer, embedding_dim, num_classes)

    batch_size = 4
    max_document_length = 50
    input_ids = torch.randint(0, len(dictionary), (batch_size, max_document_length))

    probabilities = model(input_ids)
    print(f"Probabilities shape: {probabilities.shape}")
    print(f"Probabilities: {probabilities}")


Probabilities shape: torch.Size([4, 12])
Probabilities: tensor([[0.0747, 0.0852, 0.0834, 0.0798, 0.0796, 0.0843, 0.0786, 0.1057, 0.0753,
         0.0833, 0.0792, 0.0908],
        [0.0791, 0.0903, 0.0838, 0.0757, 0.0783, 0.0877, 0.0829, 0.0899, 0.0742,
         0.0867, 0.0770, 0.0944],
        [0.0782, 0.0903, 0.0850, 0.0831, 0.0782, 0.0827, 0.0804, 0.0916, 0.0763,
         0.0811, 0.0776, 0.0956],
        [0.0758, 0.0833, 0.0858, 0.0786, 0.0786, 0.0882, 0.0830, 0.0937, 0.0754,
         0.0851, 0.0830, 0.0895]], grad_fn=<SoftmaxBackward0>)


# Loss Function and optimization

In [None]:
num_classes = 12
embedding_dim = 100
model = ClassificationAverageModel(embedding_layer, embedding_dim, num_classes)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
epochs = 5
for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    for batch_idx, (input_ids, labels) in enumerate(train_loader):
        predictions = model(input_ids)
        loss = loss_function(predictions, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_loader):.4f}")


Epoch 1/5, Loss: 2.0700
Epoch 2/5, Loss: 1.8799
Epoch 3/5, Loss: 1.8305
Epoch 4/5, Loss: 1.8106
Epoch 5/5, Loss: 1.7958


# Early Stopping

In [None]:
# Validation DataLoader Creation
val_dataset = DocumentDataset(
    valData['clean_text'].tolist(),  # Validation documents
    valData['label'].tolist(),      # Validation labels
    dictionary,                     # Dictionary mapping words to IDs
    max_document_length             # Maximum document length
)

val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# Loss Function and Optimizer
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Early Stopping Hyperparameters
patience = 3  # Number of epochs to wait for improvement
best_accuracy = 0.0  # Best validation accuracy observed so far
epochs_without_improvement = 0  # Counter for epochs without improvement
max_epochs = 20  # Maximum number of epochs

# Training with Early Stopping
for epoch in range(max_epochs):
    model.train()  # Set model to training mode
    total_loss = 0.0

    # Training on batches
    for batch_idx, (input_ids, labels) in enumerate(train_loader):
        # Forward pass
        predictions = model(input_ids)

        # Compute loss
        loss = loss_function(predictions, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track loss
        total_loss += loss.item()

    # Validation Accuracy
    model.eval()  # Set model to evaluation mode
    val_predictions = []
    val_labels = []

    with torch.no_grad():
        for input_ids, labels in val_loader:
            predictions = model(input_ids)
            val_predictions.extend(torch.argmax(predictions, dim=1).cpu().numpy())
            val_labels.extend(labels.cpu().numpy())

    # Calculating validation accuracy
    val_accuracy = accuracy_score(val_labels, val_predictions)
    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_loader):.4f}, Validation Accuracy: {val_accuracy:.4f}")

    # Checking for improvement
    if val_accuracy > best_accuracy:
        best_accuracy = val_accuracy
        epochs_without_improvement = 0
        torch.save(model.state_dict(), "best_model.pth")  # Save the best model
        print(f"Best model saved with accuracy: {best_accuracy:.4f}")
    else:
        epochs_without_improvement += 1

    # Early Stopping
    if epochs_without_improvement >= patience:
        print("Early stopping triggered.")
        break


Epoch 1, Loss: 1.7734, Validation Accuracy: 0.7939
Best model saved with accuracy: 0.7939
Epoch 2, Loss: 1.7501, Validation Accuracy: 0.7997
Best model saved with accuracy: 0.7997
Epoch 3, Loss: 1.7313, Validation Accuracy: 0.8032
Best model saved with accuracy: 0.8032
Epoch 4, Loss: 1.7182, Validation Accuracy: 0.8012
Epoch 5, Loss: 1.7084, Validation Accuracy: 0.8001
Epoch 6, Loss: 1.7009, Validation Accuracy: 0.7985
Early stopping triggered.


# Test Set Evaluation

In [None]:
checkpoint = torch.load("best_model.pth", weights_only=True)
model.load_state_dict(checkpoint)
model.eval()  # Setting model to evaluation mode

# Test DataLoader Creation
test_dataset = DocumentDataset(
    testData['clean_text'].tolist(),  # Validation documents
    testData['label'].tolist(),      # Validation labels
    dictionary,                     # Dictionary mapping words to IDs
    max_document_length             # Maximum document length
)

test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Testing set evaluation
test_predictions = []
test_labels = []

with torch.no_grad():
    for input_ids, labels in test_loader:
        predictions = model(input_ids)
        test_predictions.extend(torch.argmax(predictions, dim=1).cpu().numpy())
        test_labels.extend(labels.cpu().numpy())

# Calculating accuracy
test_accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")
print("Classification Report:")
print(classification_report(test_labels, test_predictions, zero_division=0))


Test Accuracy: 0.7992
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        45
           1       0.37      0.36      0.36       107
           2       0.90      0.85      0.88       123
           3       0.77      0.88      0.82       405
           4       0.93      0.91      0.92       635
           5       0.44      0.40      0.42       121
           6       0.00      0.00      0.00        35
           7       0.72      0.69      0.70        45
           8       0.90      0.89      0.90       112
           9       0.82      0.88      0.85       615
          10       0.70      0.77      0.73       180
          11       0.80      0.78      0.79       172

    accuracy                           0.80      2595
   macro avg       0.61      0.62      0.61      2595
weighted avg       0.77      0.80      0.79      2595



# Reporting

### **1. Loading and Data Details**
- **Dataset Information**:
  - Training Set: 12,110 examples
  - Validation Set: 2,596 examples
  - Test Set: 2,595 examples

- **Class Distributions**:
  - Training Labels: Example: `{4: 2829, 9: 2657, 3: 2079, ...}`
  - Validation Labels: Example: `{4: 665, 9: 546, 3: 420, ...}`
  - Test Labels: Example: `{4: 635, 9: 615, 3: 405, ...}`

- **Raw Data Example** (Before preprocessing):
  - **Text**: "In addition to the immediate life-saving interventions, UNICEF is taking action to protect..."
  - **Label**: 9


### **2. Preprocessing Steps**
- **Steps Performed**:
  - Lowercasing, punctuation removal, and numeric removal.
  - Tokenization using `spacy` and stopword removal.
  - Word stemming using `PorterStemmer`.
  - Handling Out-of-Vocabulary (OOV) words by replacing rare words with `<OOV>`.
  - Truncating/padding to `max_document_length = 50`.

- **Vocabulary Size**:
  - Initial Vocabulary Size: 23,270
  - Reduced Vocabulary Size (After Filtering Rare Words): 11,198

- **Examples of Preprocessing**:
  - **Original Text**: "In addition to the immediate life-saving interventions, UNICEF is taking action..."
  - **After Preprocessing**: `["addit", "immedi", "lifesav", "intervent", "unicef", "action"]`
  - **After OOV Handling**: `["addit", "immedi", "lifesav", "intervent", "unicef", "<OOV>"]`


### **3. Model Performance**
#### Best-Performing Model:
- Model: **ClassificationAverageModel** with pre-trained GloVe embeddings.
- Early stopping used with a patience of 3 epochs.

#### Results:
| Metric                | Validation Set | Test Set    |
|-----------------------|----------------|-------------|
| **Accuracy**          | 80.0%          | 79.9%       |
| **Macro Avg F1-Score**| 64.0%          | 64.0%       |
| **Weighted Avg F1**   | 79.0%          | 79.0%       |



# Overall Functionality of the Training Procedure

#### **Key Components of the Training Procedure**
1. **Data Preparation and Preprocessing:**
   - Training, validation, and test sets were loaded and preprocessed effectively to ensure the model could generalize well.
   - Preprocessing steps included:
     - Lowercasing, removing punctuation and numbers, tokenizing, and handling out-of-vocabulary (OOV) words.
     - Padding and truncation were applied to standardize document lengths.
   - The dataset was balanced across the 12 classes, though some class imbalance was observed, particularly in the training set.

2. **Embedding Initialization:**
   - Pre-trained GloVe embeddings (100-dimensional) were used to initialize the embedding layer.
   - Words missing in the GloVe vocabulary were initialized with random embeddings.

3. **Model Architecture:**
   - The model, `ClassificationAverageModel`, used a trainable embedding layer.
   - Document embeddings were computed as the element-wise mean of word embeddings for each document.
   - A fully connected layer with a Softmax activation was used to map document embeddings to class probabilities.

4. **Training with Early Stopping:**
   - **Loss Function:** CrossEntropyLoss was used to calculate the loss between predicted and true labels.
   - **Optimizer:** The Adam optimizer was employed with a learning rate of 0.001.
   - **Early Stopping Mechanism:**
     - Validation accuracy was monitored after each epoch.
     - If validation accuracy did not improve for 3 consecutive epochs (patience = 3), training was terminated early.
     - The best-performing model based on validation accuracy was saved for later evaluation.

5. **Evaluation:**
   - Validation accuracy and loss were monitored during training to ensure the model was learning effectively.
   - The test set was evaluated using the stored best-performing model.

#### **Training Progress**
- **Epoch Results:**
  - Epoch 1: Loss = **1.7696**, Validation Accuracy = **0.7928**
  - Epoch 2: Loss = **1.7513**, Validation Accuracy = **0.7924**
  - Epoch 3: Loss = **1.7336**, Validation Accuracy = **0.8012**
  - Epoch 4: Loss = **1.7189**, Validation Accuracy = **0.7985**
  - Epoch 5: Loss = **1.7087**, Validation Accuracy = **0.7989**
  - Epoch 6: Loss = **1.7007**, Validation Accuracy = **0.7958**
  - Early stopping was triggered after **6 epochs** since no significant improvement in validation accuracy was observed.

#### **Key Observations**
- **Validation Accuracy Stabilization:**
  - The model reached a consistent validation accuracy of approximately **79.12%**, indicating stable learning during training.
  - Early stopping prevented overfitting, as validation accuracy did not degrade across epochs.

- **Test Set Evaluation:**
  - **Test Accuracy:** **79.88%**
  - **Classification Report:**
    - Macro Average Precision: **0.74**
    - Macro Average Recall: **0.64**
    - Macro Average F1-Score: **0.64**
    - Weighted Average F1-Score: **0.79**
  - These metrics highlight strong performance, particularly in classes with sufficient training examples.

#### **Conclusion**
The training procedure demonstrated strong functionality, achieving a solid test accuracy of **79.88%**. Early stopping successfully mitigated overfitting, and the evaluation metrics indicated that the model performed well across most classes, despite class imbalances. Further improvements could include addressing class imbalance and exploring more advanced architectures.

# Task B: Document Classification with BERT (15 points)

This task aims to conduct the same document classification as Task A, but now by utilizing a pre-trained BERT model. Feel free to reuse any code from the previous task. The implementation of the classifier should cover the points below.

**Loading BERT model (2 points):** Use the `transformers` library from `huggingface` to load a (small) pre-trained BERT model. Select a BERT model according to your available resources. The available models can be found [here](https://huggingface.co/models) and [here](https://github.com/google-research/bert).

**BERT tokenization (3 points):** For training BERT models, we do not need to create a dictionary anymore, as a BERT model already contains an internal subword dictionary. Following the instruction in `transformers`'s documentation, tokenize the data using the BERT model.  

**Model definition and forward function (5 points):** Define the class **`ClassificationBERTModel`** as a PyTorch model. In the initialization procedure, the model receives the loaded BERT model and stores it as the model's parameter. The parameters of the BERT model should be trainable. The forward function of the model receives a batch of data, passes this batch to BERT, and achieves the corresponding document embeddings from the output of BERT. Similar to the previous task, the document embeddings are used for classification by linearly transforming document embeddings to the vectors with the number of classes, followed by applying Softmax.

**Training and overall functionality (3 points):** Train the model in a similar fashion to the previous task, namely with the proper loss function, optimization, and early stoping.

**Test Set Evaluation (1 point):** After finishing the training, load the (already stored) best performing model, and use it for class prediction on the test set.

**Reporting (1 point):** Report the results of the best performing model on the validation and test set in a table.
# New Section

#Import Libraries:

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score

#Load BERT Tokenizer and Model:

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_mapping))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Tokenize the Datasets:

In [None]:
def tokenize_dataset(dataset):
    return tokenizer(dataset['text'].tolist(), padding='max_length', truncation=True, max_length=512)

train_encodings = tokenize_dataset(trainData)
val_encodings = tokenize_dataset(valData)
test_encodings = tokenize_dataset(testData)

#Create Dataset Classes:

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = Dataset(train_encodings, trainData['label'].tolist())
val_dataset = Dataset(val_encodings, valData['label'].tolist())
test_dataset = Dataset(test_encodings, testData['label'].tolist())

#Set Up Training Arguments:

In [None]:
import os
# Ensure directories exist
if not os.path.exists('./logs'):
    os.makedirs('./logs')
if not os.path.exists('./results'):
    os.makedirs('./results')
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    learning_rate=2e-5,
    eval_strategy='epoch',
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    save_total_limit=2,
)


#Define the Training Function:

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': accuracy_score(labels, predictions)}

#Initialize Trainer:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

#Train the Model:

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7463,0.608219,0.825501


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7463,0.608219,0.825501
2,0.3786,0.528681,0.857473
3,0.3428,0.546909,0.854391


TrainOutput(global_step=2271, training_loss=0.6986200648459607, metrics={'train_runtime': 3715.2561, 'train_samples_per_second': 9.779, 'train_steps_per_second': 0.611, 'total_flos': 9559682889523200.0, 'train_loss': 0.6986200648459607, 'epoch': 3.0})

Evaluate on Test Set:

In [None]:
test_results = trainer.evaluate(test_dataset)
print(f"Test Accuracy: {test_results['eval_accuracy']}")

Test Accuracy: 0.8485549132947977


### 1. Summary of Training and Validation Metrics

| Epoch | Training Loss | Validation Loss | Validation Accuracy |
|-------|---------------|-----------------|---------------------|
| 1     | 0.7463        | 0.6082          | 0.8255              |
| 2     | 0.3786        | 0.5287          | 0.8575              |
| 3     | 0.3428        | 0.5469          | 0.8544              |

The training loss consistently decreases across epochs, indicating that the model is learning from the training data. However, the validation loss shows a slight increase in the third epoch, suggesting potential overfitting. Despite this, the validation accuracy remains stable, indicating that the model generalizes reasonably well.

### 2. Overall Training Metrics

- **Global Step:** 2271
- **Training Loss:** 0.6986
- **Training Runtime:** 3715.26 seconds
- **Train Samples per Second:** 9.78
- **Train Steps per Second:** 0.61
- **Total FLOPs:** 9.5596828895232e+15

These metrics provide insight into the training efficiency and computational requirements. The model trained efficiently with a reasonable number of steps per second.

### 3. Test Set Performance

- **Test Accuracy:** 0.8486

The model achieves an accuracy of approximately 84.86% on the test set, indicating good performance in classifying documents into the specified categories.

### 4. Discussion

The model demonstrates satisfactory performance with a test accuracy of 84.86%. However, the slight increase in validation loss during the third epoch suggests that the model might be overfitting. To address this, potential improvements could include:

- **Early Stopping:** Halting training before overfitting occurs.
- **Regularization Techniques:** Such as dropout or L2 regularization.
- **Hyperparameter Tuning:** Exploring different learning rates or batch sizes.

Additionally, visualizing the training and validation loss and accuracy over epochs would provide a clearer understanding of the model's learning dynamics. Including metrics like precision, recall, and F1-score, especially if classes are imbalanced, would offer a more comprehensive evaluation.

### 5. Conclusion

In conclusion, the BERT-based model effectively classifies documents with a test accuracy of 84.86%. While there are indications of slight overfitting, the model generalizes well. Future work could focus on refining the model to improve generalization and explore additional performance metrics for a more nuanced evaluation.