<a href="https://colab.research.google.com/github/epcl2/omdenaEthiopiaNLP/blob/master/XLMR_for_Amharic_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine XLM Roberta for Amharic Text Classification





#### Flow of the notebook

1. Importing Python Libraries and preparing the environment
2. Loading data
3. Preparing the Dataset and Dataloader
4. Creating the Neural Network for Fine Tuning
5. Fine Tuning the Model
6. Validating the Model Performance
7. Saving the model

#### Data Details

The Dataset used is the Amharic News Classification Dataset, followed the preprocessing step in the [notebook](https://github.com/IsraelAbebe/An-Amharic-News-Text-classification-Dataset/blob/main/Amharic-News-Text-classification-Baseline.ipynb) and saved it into another csv.
Note: there is a row where the category is missing, that row is dropped as well. The processed article column and category column are saved into a csv and loaded into this notebook.

The language model used XLM Roberta (you can also use the XLMR-large or other models). XLM stands for cross-lingual model, i.e. the model has been pre-trained on many languages.



### Importing Python Libraries and preparing the environment

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Importing the libraries needed
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import seaborn as sns
import transformers
import json
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
# from transformers import RobertaModel, RobertaTokenizer
from transformers import AutoTokenizer, XLMRobertaModel
import logging
logging.basicConfig(level=logging.ERROR)

from sklearn.metrics import confusion_matrix, classification_report

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
# load in the Amharic News Dataset preprocessed with steps from the github notebook
# train_df = pd.read_csv('normalised_data_2.csv')

dir = '/content/drive/MyDrive/omdenaEthiopianNLP/'
data_dir = dir + 'data/'
train_path = data_dir + 'normalised_data_2.csv'
train_df = pd.read_csv(train_path)

In [None]:
train_df.shape

(51482, 2)

In [None]:
train_df.head()

Unnamed: 0,article,category
0,ብርሀን ፈይሳየኢትዮጵያ ቦክስ ፌዴሬሽን በየአመቱ የሚያዘጋጀው የክለቦች ቻ...,ስፖርት
1,የአዲስ ዘመን ጋዜጣ ቀደምት ዘገባዎች በእጅጉ ተነባቢ ዛሬም ላገኛቸው በ...,መዝናኛ
2,ቦጋለ አበበየአዲስ አበባ ከተማ አስተዳደር ስፖርት ኮሚሽን ከኢትዮጵያ አረ...,ስፖርት
3,ብርሀን ፈይሳአዲስ አበባ የኢትዮጵያ ፕሪምየር ሊግ በሼር ካምፓኒ እንዲተዳ...,ስፖርት
4,ቦጋለ አበበ የኢትዮጵያ ኦሊምፒክ ኮሚቴ አርባ አምስተኛ መደበኛ ጠቅላላ ጉ...,ስፖርት


In [None]:
# 6 categories (nan removed)
train_df['category'].unique()

array(['ስፖርት', 'መዝናኛ', 'ሀገር አቀፍ ዜና', 'ቢዝነስ', 'ዓለም አቀፍ ዜና', 'ፖለቲካ'],
      dtype=object)

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51482 entries, 0 to 51481
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   article   51474 non-null  object
 1   category  51482 non-null  object
dtypes: object(2)
memory usage: 804.5+ KB


In [None]:
# apparently there are some articles that are just an empty string
# from the github preprocessing notebook. When an empty string is saved
# and reloaded, it becomes NA as above
train_df = train_df.dropna(subset='article').reset_index(drop=True)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51474 entries, 0 to 51473
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   article   51474 non-null  object
 1   category  51474 non-null  object
dtypes: object(2)
memory usage: 804.4+ KB


In [None]:
train_df.category = train_df.category.astype('category')
# get mappings from categories to indices
cat_to_idx = dict(enumerate(train_df['category'].cat.categories ) )
idx_to_cat = {v: k for k, v in cat_to_idx.items()}
idx_to_cat

{'ሀገር አቀፍ ዜና': 0, 'መዝናኛ': 1, 'ስፖርት': 2, 'ቢዝነስ': 3, 'ዓለም አቀፍ ዜና': 4, 'ፖለቲካ': 5}

In [None]:
# map the categories to indices according to the map above
train_df['label'] = train_df['category'].map(idx_to_cat)

In [None]:
new_df = train_df[['article', 'label']]
new_df

Unnamed: 0,article,label
0,ብርሀን ፈይሳየኢትዮጵያ ቦክስ ፌዴሬሽን በየአመቱ የሚያዘጋጀው የክለቦች ቻ...,2
1,የአዲስ ዘመን ጋዜጣ ቀደምት ዘገባዎች በእጅጉ ተነባቢ ዛሬም ላገኛቸው በ...,1
2,ቦጋለ አበበየአዲስ አበባ ከተማ አስተዳደር ስፖርት ኮሚሽን ከኢትዮጵያ አረ...,2
3,ብርሀን ፈይሳአዲስ አበባ የኢትዮጵያ ፕሪምየር ሊግ በሼር ካምፓኒ እንዲተዳ...,2
4,ቦጋለ አበበ የኢትዮጵያ ኦሊምፒክ ኮሚቴ አርባ አምስተኛ መደበኛ ጠቅላላ ጉ...,2
...,...,...
51469,በ2011 በጀት አመት የተከናወኑ የውጭ ዲፕሎማሲያዊ ተግባራት ስኬታማ እን...,5
51470,አቶ አገኘሁ ተሻገር የአማራ ክልል የሰላም ግንባታና የህዝብ ደህንነት ቢሮ...,5
51471,የአማራ ክልል ምክር ቤት የ230 ዳኞችን ሹመት አፀደቀየአማራ ክልል ምክር...,5
51472,በዘንድሮ በጀት አመት ከ4 ቢሊዮን ችግኝ በላይ ለመትከል እቅድ መያዙ ይታ...,0


### Preparing the Dataset and Dataloader

PyTorch ```Dataset``` allows you to use pre-loaded datasets as well as your own data. ```Dataset``` stores the samples and their corresponding labels, and ```DataLoader``` wraps an iterable around the Dataset to enable easy access to the samples. The Dataloader that will feed the data in batches to the neural network for training. ([Docs](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html))


#### *AmharicData* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the model for training. 
- the [tokenizer](https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaTokenizer) tokenizes the data in the `article` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- `target` is the encoded category. 
- The *AmharicData* class is used to create datasets for training and for validation.


#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 256
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 4
# EPOCHS = 1
LEARNING_RATE = 5e-6 #1e-05
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', truncation=True)

A custom Dataset class must implement three functions:``` __init__, __len__```, and ```__getitem__```

In [None]:
class AmharicData(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = self.data['article']
        self.targets = self.data['label']
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        # text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [None]:
# stratify split (keep the same % of each class in train & val set)
# set random state so that it is reproducible
train_data, test_data = train_test_split(new_df, test_size=0.2, random_state=0, stratify=new_df['label'])
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = AmharicData(train_data, tokenizer, MAX_LEN)
testing_set = AmharicData(test_data, tokenizer, MAX_LEN)

FULL Dataset: (51474, 2)
TRAIN Dataset: (41179, 2)
TEST Dataset: (10295, 2)


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `XLMRClass`. 
 - This network will have the XLMR Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs. 
 - Final layer outputs is what will be compared to the `News data category` to determine the accuracy of models prediction. (The size of this layer is chosen arbritarily) 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - `Loss Function` and `Optimizer` and defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output. 
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

In [None]:
# this is a sample of how to load
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = XLMRobertaModel.from_pretrained("xlm-roberta-base")

inputs = tokenizer("ብርሀን ፈይሳየኢትዮጵያ ቦክስ ፌዴሬሽን በየአመቱ የሚያዘጋጀው የክለቦች ", return_tensors="pt")
outputs = model(**inputs)

Downloading pytorch_model.bin:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
class XLMRClass(torch.nn.Module):
    def __init__(self):
        super(XLMRClass, self).__init__()
        # XLMR model
        self.l1 = XLMRobertaModel.from_pretrained("xlm-roberta-base")
        # add a fully connected layer on top
        self.pre_classifier = torch.nn.Linear(768, 128)
        self.dropout = torch.nn.Dropout(0.3)
        # classification layer
        self.classifier = torch.nn.Linear(128, 6)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]              # [batch_size, seq_len, 768]
        pooler = hidden_state[:, 0]             # [batch_size, 768]                   
        pooler = self.pre_classifier(pooler)    # [batch_size, 128]
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [None]:
model = XLMRClass()
model.to(device)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


XLMRClass(
  (l1): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Laye

<a id='section05'></a>
### Fine Tuning the Model
 
Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual category are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 100 steps the loss value is printed in the console.

In [None]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
# optimizer = torch.optim.AdamW(params=model.parameters(), lr=LEARNING_RATE)

In [None]:
def calcuate_accuracy(preds, targets):
    n_correct = (preds == targets).sum().item()
    return n_correct

In [None]:
# Defining the training function on the train dataset

def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _, data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        # forward pass through the model
        outputs = model(ids, mask, token_type_ids)
        # calculate loss
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        # to get prediction accuracy
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accuracy(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%200==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 200 steps: {loss_step}")
            print(f"Training Accuracy per 200 steps: {accu_step}")

        # back prop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return model

In [None]:
EPOCHS = 2
for epoch in range(EPOCHS):
    model = train(epoch)



Training Loss per 200 steps: 1.7176302671432495
Training Accuracy per 200 steps: 31.25


200it [02:42,  1.25it/s]

Training Loss per 200 steps: 1.3745471580111566
Training Accuracy per 200 steps: 50.96393034825871


400it [05:23,  1.24it/s]

Training Loss per 200 steps: 1.2122218044024156
Training Accuracy per 200 steps: 56.37468827930174


600it [08:03,  1.24it/s]

Training Loss per 200 steps: 1.0911903603501407
Training Accuracy per 200 steps: 61.044093178036604


800it [10:44,  1.25it/s]

Training Loss per 200 steps: 1.0067466441983737
Training Accuracy per 200 steps: 64.36485642946317


1000it [13:25,  1.24it/s]

Training Loss per 200 steps: 0.9427728222145305
Training Accuracy per 200 steps: 66.5084915084915


1200it [16:05,  1.24it/s]

Training Loss per 200 steps: 0.8964338474949631
Training Accuracy per 200 steps: 68.16194837635304


1400it [18:46,  1.24it/s]

Training Loss per 200 steps: 0.86088426900412
Training Accuracy per 200 steps: 69.29871520342613


1600it [21:27,  1.24it/s]

Training Loss per 200 steps: 0.8268191566845539
Training Accuracy per 200 steps: 70.35056214865709


1800it [24:07,  1.25it/s]

Training Loss per 200 steps: 0.8024712898445289
Training Accuracy per 200 steps: 71.07856746252082


2000it [26:48,  1.26it/s]

Training Loss per 200 steps: 0.7782853730078758
Training Accuracy per 200 steps: 71.88905547226386


2200it [29:29,  1.25it/s]

Training Loss per 200 steps: 0.7579267015346394
Training Accuracy per 200 steps: 72.54373012267152


2400it [32:09,  1.24it/s]

Training Loss per 200 steps: 0.7387943864147993
Training Accuracy per 200 steps: 73.21688879633486


2574it [34:28,  1.24it/s]


The Total Accuracy for Epoch 0: 73.62490589863765
Training Loss Epoch: 0.7250420163532468
Training Accuracy Epoch: 73.62490589863765


0it [00:00, ?it/s]

Training Loss per 200 steps: 0.1657491773366928
Training Accuracy per 200 steps: 100.0


200it [02:40,  1.24it/s]

Training Loss per 200 steps: 0.49785100954089
Training Accuracy per 200 steps: 81.06343283582089


400it [05:21,  1.25it/s]

Training Loss per 200 steps: 0.4876916906111258
Training Accuracy per 200 steps: 81.51496259351622


600it [08:02,  1.24it/s]

Training Loss per 200 steps: 0.48663421440094756
Training Accuracy per 200 steps: 81.62437603993344


800it [10:42,  1.25it/s]

Training Loss per 200 steps: 0.4874851251307052
Training Accuracy per 200 steps: 81.68695380774032


1000it [13:23,  1.24it/s]

Training Loss per 200 steps: 0.49145743490515886
Training Accuracy per 200 steps: 81.49350649350649


1199it [16:03,  1.24it/s]


KeyboardInterrupt: ignored

In [None]:
import gc 
gc.collect()

8034

<a id='section06'></a>
### Model evaluation/ validation

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data. 

This unseen data was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model. 

In [None]:
def valid(model, testing_loader):
    # do not update model weights
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0; tr_loss=0; nb_tr_steps=0; nb_tr_examples=0

    preds = []
    labels = []

    # we don't do gradient descent
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            # get prediction
            outputs = model(ids, mask, token_type_ids).squeeze()
            # get loss
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            # get accuracy
            n_correct += calcuate_accuracy(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)

            preds.extend(big_idx.cpu().numpy())
            labels.extend(targets.cpu().numpy())
            
            if _%1000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per 1000 steps: {loss_step}")
                print(f"Validation Accuracy per 1000 steps: {accu_step}")
        
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")
    
    return epoch_accu, preds, labels


In [None]:
acc, test_preds, test_labels = valid(model, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

2it [00:00,  8.31it/s]

Validation Loss per 1000 steps: 0.42143529653549194
Validation Accuracy per 1000 steps: 75.0


1004it [01:13, 15.20it/s]

Validation Loss per 1000 steps: 0.45159824552213573
Validation Accuracy per 1000 steps: 82.64235764235764


2002it [02:21, 15.04it/s]

Validation Loss per 1000 steps: 0.4586626543891037
Validation Accuracy per 1000 steps: 82.3088455772114


2574it [02:59, 14.31it/s]

Validation Loss Epoch: 0.4612595424289065
Validation Accuracy Epoch: 82.10781932977173
Accuracy on test data = 82.11%





Since there is class imbalance, we look at precision/ recall for each class as well

In [None]:
print("validation set")

print(confusion_matrix(test_preds, test_labels))
print(classification_report(test_preds, test_labels))

validation set
[[3377   52   48  123  110  296]
 [   7   57    0    2    1    1]
 [  23    3 2024    1   19    9]
 [ 332    9    5  631   32  239]
 [ 131    6    4    2 1114   70]
 [ 265    0    1   19   32 1250]]
              precision    recall  f1-score   support

           0       0.82      0.84      0.83      4006
           1       0.45      0.84      0.58        68
           2       0.97      0.97      0.97      2079
           3       0.81      0.51      0.62      1248
           4       0.85      0.84      0.85      1327
           5       0.67      0.80      0.73      1567

    accuracy                           0.82     10295
   macro avg       0.76      0.80      0.76     10295
weighted avg       0.83      0.82      0.82     10295



In [None]:
acc, train_preds, train_labels = valid(model, training_loader)
print("Accuracy on test data = %0.2f%%" % acc)

print("train set")

print(confusion_matrix(train_preds, train_labels))
print(classification_report(train_preds, train_labels))

1it [00:00,  3.54it/s]

Validation Loss per 1000 steps: 0.6121636629104614
Validation Accuracy per 1000 steps: 68.75


1001it [04:28,  3.84it/s]

Validation Loss per 1000 steps: 0.4278519282100739
Validation Accuracy per 1000 steps: 83.61638361638362


2001it [08:57,  3.77it/s]

Validation Loss per 1000 steps: 0.4306458607967617
Validation Accuracy per 1000 steps: 83.23650674662669


2574it [11:30,  3.73it/s]

Validation Loss Epoch: 0.4303299752212395
Validation Accuracy Epoch: 83.31431069234318
Accuracy on test data = 83.31%
train set
[[13710   248   175   349   443  1081]
 [   15   166     1    11     1     3]
 [   89    16  8132     5    45    14]
 [ 1290    27     8  2639   117   942]
 [  440    50     7    16  4488   245]
 [  994     1     3    95   140  5173]]
              precision    recall  f1-score   support

           0       0.83      0.86      0.84     16006
           1       0.33      0.84      0.47       197
           2       0.98      0.98      0.98      8301
           3       0.85      0.53      0.65      5023
           4       0.86      0.86      0.86      5246
           5       0.69      0.81      0.75      6406

    accuracy                           0.83     41179
   macro avg       0.76      0.81      0.76     41179
weighted avg       0.84      0.83      0.83     41179






<a id='section07'></a>
### Saving the Trained Model 

In [None]:
output_model_file = dir + 'model1'
output_vocab_file = dir

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')

All files saved


In [None]:
# sample to reload the model to make sure that it works
model2 = torch.load(output_model_file)
acc, _, _ = valid(model2, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

3it [00:00,  9.95it/s]

Validation Loss per 1000 steps: 2.6054577827453613
Validation Accuracy per 1000 steps: 25.0


1003it [01:14, 12.86it/s]

Validation Loss per 1000 steps: 0.46177849973543555
Validation Accuracy per 1000 steps: 82.31768231768231


2003it [02:26, 12.46it/s]

Validation Loss per 1000 steps: 0.4586494524340922
Validation Accuracy per 1000 steps: 82.27136431784108


2574it [03:05, 13.88it/s]

Validation Loss Epoch: 0.4612638504724159
Validation Accuracy Epoch: 82.10781932977173
Accuracy on test data = 82.11%





Note: this is just a sample flow of fine-tuning XLM-Roberta for text classification. Some of the choices for hyperparameters are quite arbitrary and I have not experimented with different settings. We should be able to get better results with more tuning.

Things we can potentially do:
- hyperparameter tuning for LR, epochs, network architecture for classification layer
- weighted CELoss?
- use different base models