# Finetuning XLM Roberta for Amharic Text Classification





#### Flow of the notebook

1. Importing Python Libraries and preparing the environment
2. Loading data
3. Preparing the Dataset and Dataloader
4. Creating the Neural Network for Fine Tuning
5. Fine Tuning the Model
6. Validating the Model Performance
7. Saving the model

#### Data Details

The Dataset used is the Amharic News Classification Dataset, followed the preprocessing step in the [notebook](https://github.com/IsraelAbebe/An-Amharic-News-Text-classification-Dataset/blob/main/Amharic-News-Text-classification-Baseline.ipynb) and saved it into another csv.
Note: there is a row where the category is missing, that row is dropped as well. The processed article column and category column are saved into a csv and loaded into this notebook.

The language model used XLM Roberta (you can also use the XLMR-large or other models). XLM stands for cross-lingual model, i.e. the model has been pre-trained on many languages.



### Importing Python Libraries and preparing the environment

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Importing the libraries needed
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import seaborn as sns
import transformers
import json
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
# from transformers import RobertaModel, RobertaTokenizer
from transformers import AutoTokenizer, XLMRobertaModel
import logging
logging.basicConfig(level=logging.ERROR)

from sklearn.metrics import confusion_matrix, classification_report

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
# load in the Amharic News Dataset preprocessed with steps from the github notebook
# train_df = pd.read_csv('normalised_data_2.csv')

dir = '/content/drive/MyDrive/omdenaEthiopianNLP/'
data_dir = dir + 'data/'
train_path = data_dir + 'normalised_data_2.csv'
train_df = pd.read_csv(train_path)

In [None]:
train_df.shape

(51482, 2)

In [None]:
train_df.head()

Unnamed: 0,article,category
0,ብርሀን ፈይሳየኢትዮጵያ ቦክስ ፌዴሬሽን በየአመቱ የሚያዘጋጀው የክለቦች ቻ...,ስፖርት
1,የአዲስ ዘመን ጋዜጣ ቀደምት ዘገባዎች በእጅጉ ተነባቢ ዛሬም ላገኛቸው በ...,መዝናኛ
2,ቦጋለ አበበየአዲስ አበባ ከተማ አስተዳደር ስፖርት ኮሚሽን ከኢትዮጵያ አረ...,ስፖርት
3,ብርሀን ፈይሳአዲስ አበባ የኢትዮጵያ ፕሪምየር ሊግ በሼር ካምፓኒ እንዲተዳ...,ስፖርት
4,ቦጋለ አበበ የኢትዮጵያ ኦሊምፒክ ኮሚቴ አርባ አምስተኛ መደበኛ ጠቅላላ ጉ...,ስፖርት


In [None]:
# 6 categories (nan removed)
train_df['category'].unique()

array(['ስፖርት', 'መዝናኛ', 'ሀገር አቀፍ ዜና', 'ቢዝነስ', 'ዓለም አቀፍ ዜና', 'ፖለቲካ'],
      dtype=object)

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51482 entries, 0 to 51481
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   article   51474 non-null  object
 1   category  51482 non-null  object
dtypes: object(2)
memory usage: 804.5+ KB


In [None]:
# apparently there are some articles that are just an empty string
# from the github preprocessing notebook. When an empty string is saved
# and reloaded, it becomes NA as above
train_df = train_df.dropna(subset='article').reset_index(drop=True)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51474 entries, 0 to 51473
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   article   51474 non-null  object
 1   category  51474 non-null  object
dtypes: object(2)
memory usage: 804.4+ KB


In [None]:
train_df.category = train_df.category.astype('category')
# get mappings from categories to indices
cat_to_idx = dict(enumerate(train_df['category'].cat.categories ) )
idx_to_cat = {v: k for k, v in cat_to_idx.items()}
idx_to_cat

{'ሀገር አቀፍ ዜና': 0, 'መዝናኛ': 1, 'ስፖርት': 2, 'ቢዝነስ': 3, 'ዓለም አቀፍ ዜና': 4, 'ፖለቲካ': 5}

In [None]:
# map the categories to indices according to the map above
train_df['label'] = train_df['category'].map(idx_to_cat)

In [None]:
new_df = train_df[['article', 'label']]
new_df

Unnamed: 0,article,label
0,ብርሀን ፈይሳየኢትዮጵያ ቦክስ ፌዴሬሽን በየአመቱ የሚያዘጋጀው የክለቦች ቻ...,2
1,የአዲስ ዘመን ጋዜጣ ቀደምት ዘገባዎች በእጅጉ ተነባቢ ዛሬም ላገኛቸው በ...,1
2,ቦጋለ አበበየአዲስ አበባ ከተማ አስተዳደር ስፖርት ኮሚሽን ከኢትዮጵያ አረ...,2
3,ብርሀን ፈይሳአዲስ አበባ የኢትዮጵያ ፕሪምየር ሊግ በሼር ካምፓኒ እንዲተዳ...,2
4,ቦጋለ አበበ የኢትዮጵያ ኦሊምፒክ ኮሚቴ አርባ አምስተኛ መደበኛ ጠቅላላ ጉ...,2
...,...,...
51469,በ2011 በጀት አመት የተከናወኑ የውጭ ዲፕሎማሲያዊ ተግባራት ስኬታማ እን...,5
51470,አቶ አገኘሁ ተሻገር የአማራ ክልል የሰላም ግንባታና የህዝብ ደህንነት ቢሮ...,5
51471,የአማራ ክልል ምክር ቤት የ230 ዳኞችን ሹመት አፀደቀየአማራ ክልል ምክር...,5
51472,በዘንድሮ በጀት አመት ከ4 ቢሊዮን ችግኝ በላይ ለመትከል እቅድ መያዙ ይታ...,0


In [None]:
new_df.article.str.split().str.len().describe()

count    51474.000000
mean       247.917609
std        241.103201
min          0.000000
25%        102.000000
50%        178.000000
75%        313.000000
max       6738.000000
Name: article, dtype: float64

### Preparing the Dataset and Dataloader

PyTorch ```Dataset``` allows you to use pre-loaded datasets as well as your own data. ```Dataset``` stores the samples and their corresponding labels, and ```DataLoader``` wraps an iterable around the Dataset to enable easy access to the samples. The Dataloader that will feed the data in batches to the neural network for training. ([Docs](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html))


#### *AmharicData* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the model for training. 
- the [tokenizer](https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaTokenizer) tokenizes the data in the `article` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- `target` is the encoded category. 
- The *AmharicData* class is used to create datasets for training and for validation.


#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 256
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 4
# EPOCHS = 1
LEARNING_RATE = 5e-6
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', truncation=True)

A custom Dataset class must implement three functions:``` __init__, __len__```, and ```__getitem__```

In [None]:
class AmharicData(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = self.data['article']
        self.targets = self.data['label']
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        # text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [None]:
# stratify split (keep the same % of each class in train & val set)
# set random state so that it is reproducible
train_data, test_data = train_test_split(new_df, test_size=0.2, random_state=0, stratify=new_df['label'])
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = AmharicData(train_data, tokenizer, MAX_LEN)
testing_set = AmharicData(test_data, tokenizer, MAX_LEN)

FULL Dataset: (51474, 2)
TRAIN Dataset: (41179, 2)
TEST Dataset: (10295, 2)


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `XLMRClass`. 
 - This network will have the XLMR Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs. 
 - Final layer outputs is what will be compared to the `News data category` to determine the accuracy of models prediction. (The size of this layer is chosen arbritarily) 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - `Loss Function` and `Optimizer` and defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output. 
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

In [None]:
### this is a sample of how to load

# tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
# model = XLMRobertaModel.from_pretrained("xlm-roberta-base")

# inputs = tokenizer("ብርሀን ፈይሳየኢትዮጵያ ቦክስ ፌዴሬሽን በየአመቱ የሚያዘጋጀው የክለቦች ", return_tensors="pt")
# outputs = model(**inputs)

In [None]:
class XLMRClass(torch.nn.Module):
    def __init__(self):
        super(XLMRClass, self).__init__()
        # XLMR model
        self.l1 = XLMRobertaModel.from_pretrained("xlm-roberta-base")
        # add a fully connected layer on top
        self.pre_classifier = torch.nn.Linear(768, 128)
        self.dropout = torch.nn.Dropout(0.3)
        # classification layer
        self.classifier = torch.nn.Linear(128, 6)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]              # [batch_size, seq_len, 768]
        pooler = hidden_state[:, 0]             # [batch_size, 768]                   
        pooler = self.pre_classifier(pooler)    # [batch_size, 128]
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [None]:
model = XLMRClass()
model.to(device)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


XLMRClass(
  (l1): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Laye

<a id='section05'></a>
### Fine Tuning the Model
 
Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual category are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 100 steps the loss value is printed in the console.

In [None]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
# optimizer = torch.optim.AdamW(params=model.parameters(), lr=LEARNING_RATE)

In [None]:
from transformers.optimization import get_linear_schedule_with_warmup

In [None]:
EPOCHS = 5
t_total = len(training_loader) * EPOCHS
pct_warmup_steps = 0.2
num_warmup_steps = int(t_total * pct_warmup_steps)
# learning rate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
                            optimizer, 
                            num_warmup_steps, 
                            t_total)

In [None]:
def calcuate_accuracy(preds, targets):
    n_correct = (preds == targets).sum().item()
    return n_correct

In [None]:
class EarlyStopping():
    """
    Early stopping to stop the training when the loss does not improve after
    certain epochs.
    """
    def __init__(self, patience=3, min_delta=0, best_loss=None):
        """
        :param patience: how many epochs to wait before stopping when loss is
               not improving
        :param min_delta: minimum difference between new loss and old loss for
               new loss to be considered as an improvement
        """
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
        self.early_stop = False

    def __call__(self, val_loss):

        if self.best_loss == None:
            self.best_loss = val_loss

        elif self.best_loss - val_loss > self.min_delta:
            self.best_loss = val_loss
            # reset counter if validation loss improves
            self.counter = 0

        elif self.best_loss - val_loss < self.min_delta:
            self.counter += 1
            print(f"INFO: Early stopping counter {self.counter} of {self.patience}")
            if self.counter >= self.patience:
                print('INFO: Early stopping')
                self.early_stop = True

In [None]:
def valid(model, testing_loader):
    # do not update model weights
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0; tr_loss=0; nb_tr_steps=0; nb_tr_examples=0

    preds = []
    labels = []

    # we don't do gradient descent
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            # get prediction
            outputs = model(ids, mask, token_type_ids).squeeze()
            # get loss
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            # get accuracy
            n_correct += calcuate_accuracy(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)

            preds.extend(big_idx.cpu().numpy())
            labels.extend(targets.cpu().numpy())
            
            if _%1000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per 1000 steps: {loss_step}")
                print(f"Validation Accuracy per 1000 steps: {accu_step}")

        
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")

    return epoch_loss, epoch_accu, preds, labels


In [None]:
# Defining the training function on the train dataset

def train_model(EPOCHS):
    
    min_val_loss = float('inf')
    estop = EarlyStopping(best_loss=min_val_loss, patience=1, min_delta=0)

    num_steps = 0

    for epoch in range(EPOCHS):
        print(f"epoch: {epoch}")

        model.train()

        tr_loss = 0
        n_correct = 0
        nb_tr_steps = 0
        nb_tr_examples = 0
        
        for _, data in tqdm(enumerate(training_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)

            # forward pass through the model
            outputs = model(ids, mask, token_type_ids)
            # calculate loss
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            # to get prediction accuracy
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accuracy(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)
            
            if _ > 0 and _%200==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples 
                print(f"Training Loss per 200 steps: {loss_step}")
                print(f"Training Accuracy per 200 steps: {accu_step}")

            # back prop
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            num_steps += 1

        print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
        epoch_loss = tr_loss/nb_tr_steps
        epoch_accu = (n_correct*100)/nb_tr_examples
        print(f"Training Loss Epoch: {epoch_loss}")
        print(f"Training Accuracy Epoch: {epoch_accu}")

        # evaluate model
        print("evluating model")
        eval_loss, acc, test_preds, test_labels = valid(model, testing_loader)
        # save best model
        # but continue training till end
        estop(eval_loss)
        if eval_loss <= estop.best_loss:
            print("Saving model...")
            torch.save(model, f"{dir}model6_{epoch}")
        else:
            if EPOCHS == 1:
                print("Saving model coz only one epoch...")
                torch.save(model, f"{dir}model6_{epoch}")
            else:
                print("loss up")
                if estop.early_stop:
                    break
    
    # note that we are returning the last model here
    # which is not necessarily the best
    return model

In [None]:
model = train_model(EPOCHS)

epoch: 0


200it [02:28,  1.33it/s]

Training Loss per 200 steps: 1.849469919109819
Training Accuracy per 200 steps: 13.059701492537313


279it [03:27,  1.34it/s]

In [None]:
import gc 
gc.collect()

0

<a id='section06'></a>
### Model evaluation/ validation

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data. 

This unseen data was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model. 

In [None]:
best_model = torch.load('/content/drive/MyDrive/omdenaEthiopianNLP/model3_2')
val_loss, acc, test_preds, test_labels = valid(best_model, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

2it [00:00,  8.24it/s]

Validation Loss per 1000 steps: 0.7176218032836914
Validation Accuracy per 1000 steps: 75.0


1003it [01:07, 14.49it/s]

Validation Loss per 1000 steps: 0.36253464922958334
Validation Accuracy per 1000 steps: 86.7132867132867


2004it [02:16, 15.42it/s]

Validation Loss per 1000 steps: 0.37871530492078403
Validation Accuracy per 1000 steps: 85.84457771114442


2574it [02:54, 14.71it/s]


Validation Loss Epoch: 0.3788331244441173
Validation Accuracy Epoch: 85.69208353569694
Accuracy on test data = 85.69%


Since there is class imbalance, we look at precision/ recall for each class as well

In [None]:
print("validation set")

print(confusion_matrix(test_preds, test_labels))
print(classification_report(test_preds, test_labels))

validation set
[[3585   21    4  145  110  313]
 [  26   99    0    6    7    4]
 [  39    3 2073    8    8   24]
 [ 199    4    0  595    9  180]
 [  95    0    3    4 1150   24]
 [ 191    0    2   20   24 1320]]
              precision    recall  f1-score   support

           0       0.87      0.86      0.86      4178
           1       0.78      0.70      0.74       142
           2       1.00      0.96      0.98      2155
           3       0.76      0.60      0.67       987
           4       0.88      0.90      0.89      1276
           5       0.71      0.85      0.77      1557

    accuracy                           0.86     10295
   macro avg       0.83      0.81      0.82     10295
weighted avg       0.86      0.86      0.86     10295



In [None]:
acc, train_preds, train_labels = valid(model, training_loader)
print("Accuracy on test data = %0.2f%%" % acc)

print("train set")

print(confusion_matrix(train_preds, train_labels))
print(classification_report(train_preds, train_labels))

<a id='section07'></a>
### Saving the Trained Model 

In [None]:
output_model_file = dir + 'model1'
output_vocab_file = dir

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')

All files saved


In [None]:
# sample to reload the model to make sure that it works
model2 = torch.load(output_model_file)
acc, _, _ = valid(model2, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

3it [00:00,  9.95it/s]

Validation Loss per 1000 steps: 2.6054577827453613
Validation Accuracy per 1000 steps: 25.0


1003it [01:14, 12.86it/s]

Validation Loss per 1000 steps: 0.46177849973543555
Validation Accuracy per 1000 steps: 82.31768231768231


2003it [02:26, 12.46it/s]

Validation Loss per 1000 steps: 0.4586494524340922
Validation Accuracy per 1000 steps: 82.27136431784108


2574it [03:05, 13.88it/s]

Validation Loss Epoch: 0.4612638504724159
Validation Accuracy Epoch: 82.10781932977173
Accuracy on test data = 82.11%





Note: this is just a sample flow of fine-tuning XLM-Roberta for text classification. Some of the choices for hyperparameters are quite arbitrary and I have not experimented with different settings. We should be able to get better results with more tuning.

Things we can potentially do:
- hyperparameter tuning for LR, epochs, network architecture for classification layer
- weighted CELoss?
- use different base models

# Results


model1: max_len=256, LR=5e-6 (no LR scheduler), epochs=1+ (stopped halfway), train_acc = 82.1%, val_acc = 83.3%

model2: max_len=256, LR=5e-6 (LR scheduler), epochs=2, train_acc = 81.6%, val_acc = 82.6%

model3: max_len=256, LR=1e-5 (LR scheduler), epochs=5 (best at 2), train_acc = 90.0%, val_acc = 85.7%

model4: max_len=128, LR=1e-5 (LR scheduler), epochs=5 (best at 4), train_acc = 90.4% val_acc = 85.3%

model5: max_len=256, LR=1e-5 (LR scheduler), epochs=5, classifier_hidden_size=256, train_acc=91.5%, val_acc = 86.3%  
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     at epoch=4, train_acc = 90.5%, val_acc = 86.1 %


# Old code

In [None]:
# Defining the training function on the train dataset

def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _, data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        # forward pass through the model
        outputs = model(ids, mask, token_type_ids)
        # calculate loss
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        # to get prediction accuracy
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accuracy(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%200==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 200 steps: {loss_step}")
            print(f"Training Accuracy per 200 steps: {accu_step}")

        # back prop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return model