<a href="https://colab.research.google.com/github/claudiu14c/NLI-RoBERTa/blob/main/nlu_cw_transformer_training_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# README

## The model
This is a model for Natural Language Inference. It uses a pre-trained RoBERTa model. Roberta has the same architecture as Bert, but uses a BPE tokenizer instead. Some hyperparameters and the pre-training objectives are also changed. RoBERTa is not pre-trained using the Next Sentence Predition task.

A classification head consisting of a dense, a drop-out and another dense layer is added on top of this. The pre-trained RoBERTa model produces encodings of the input. The classification head takes them and procuses the probabilities of each class (class 0 -> no implication, class 1 -> hypothesis implies premise).

This entire model is fine-tuned on our data set. Both the parameters od the RoBERTa model and those of the Classification Head are changed.

## Credits

This architecture was selected based on a similar model's perfromance on the [RTE benchmark](https://paperswithcode.com/sota/natural-language-inference-on-rte). The code is inspired from [this article](https://pchanda.github.io/Roberta-FineTuning-for-Classification/), where a model with the same architecture was fine-tuned for classifying molecules.

## Fine-tuned model location

Post fine-tuning, the model has been stored on the Cloud at [this location](https://drive.google.com/file/d/1-IJSt2HGH9Dqbu6NBuHr61ndV1r4g-3H/view?usp=sharing).  It can be downloaded and used directly in the notebook, but uploading it to Colab takes more than 15 minutes. Hence a link to Google Drive was used in the code for loading the model during development. However, if one wants to test this notebook, this link needs to be replaced by the location of the downloaded model during the evaluation step.

## Evaluation
The model is evaluated both during and after training. Training is stopped if the model performance on the evaluation set stops improving. The evaluation metrics computed are:

*  Accuracy
*  Macro and Weighted Precision, Recall and F1 scores
*  Mathew's Correlation Coefficient
*  TP, TN, FP, FN

# Fine-tuning a RoBERTa model with a classification head on top
##Pre-requisites



*   The training and evaluation data should be in csv files with 3 columns  labeled 'premise', 'hypothesis', and 'label'.
*   The traning data csv should be called 'train.csv', and the evaluation data should be called 'dev.csv'. Both should be loaded into Colab. Alternatively, one can modify the *tr_location* and *dev_location* variables, which hold both the location and name of the csv files. Links to Google Drive can be used.
*   The final evaluation can be done either on the same fine-tuned model or on another model loaded from the file sytem or Google Drive. Since this notebook had already been run, the fine-tuned model is currently loaded from Google Drive. If one wants to test the evaluation step, please download the fine-tuned model from [here](https://drive.google.com/file/d/1-IJSt2HGH9Dqbu6NBuHr61ndV1r4g-3H/view?usp=sharing) and change the content of the variable *MODEL_PATH* to the location of the downloaded model.
* A GPU environment must be used for training.


Import the required libraries and configuration files for RoBERTa.

In [106]:
import os
import numpy as np
import pandas as pd
import transformers
import torch
from torch.utils.data import (
    Dataset,
    DataLoader,
    RandomSampler,
    SequentialSampler
)

import math
from transformers import  (
    BertPreTrainedModel,
    RobertaConfig,
    RobertaTokenizerFast
)

from transformers.optimization import (
    AdamW,
    get_linear_schedule_with_warmup
)

from scipy.special import softmax
from torch.nn import CrossEntropyLoss

from sklearn.metrics import (
    confusion_matrix,
    matthews_corrcoef,
    roc_curve,
    auc,
    average_precision_score,
    precision_score,
    recall_score,
    f1_score
)


from transformers.models.roberta.modeling_roberta import (
    RobertaClassificationHead,
    RobertaConfig,
    RobertaModel,
)

Define a function to read training (also dev) data into a csv file.
Important: the data is pre-processed and separation tokens (< s >, < /s >) are added. These tokens are the equivalemnts of < CLS > and < SEP > in a classic Bert model.
The data is saved into a dataframe of the following form:


*   Column 1: 'text': '< s > premise < s > hypothesis < /s >'
*   Column 2: 'label': label (0/1)


In [107]:
pd.set_option('display.max_colwidth', None)

def get_data(location):
  df = pd.read_csv(location)

  #show some data before processing
  print("\nBefore:\n")
  print("Columns: ", df.columns)
  print("\nFirst entry:\n ", df.iloc[1])

  #join the premise and hypothesis columns using separation tokens <s> and </s>
  df['text'] = " <s> " + df['premise'] + " </s> " + df['hypothesis'] + " </s> "
  df.drop(columns=['premise','hypothesis'], inplace=True)
  df = df[['text', 'label']]

  #show some data before processing
  print("\nAfter:\n")
  print("Columns: ", df.columns)
  print("\nFirst entry:\n ", df.iloc[1])

  return df

Read and process train and dev data.

The 'location' variable is the link from which the data is read. Drive links can be used.

In this examples, it is assumed that the train and evalutation data are called "train.csv" and "test.csv" and are loaded into Colab.

In [108]:
tr_location = 'train.csv'
dev_location = 'dev.csv'

print('Training data')
train_df = get_data(tr_location)

print('\n\n\nEvaluation data:')
dev_df = get_data(dev_location)

Training data

Before:

Columns:  Index(['premise', 'hypothesis', 'label'], dtype='object')

First entry:
  premise       Buchanan's  The Democrats and Republicans have become too similar and bland.
hypothesis                                              THe parties will never be similar.
label                                                                                    0
Name: 1, dtype: object

After:

Columns:  Index(['text', 'label'], dtype='object')

First entry:
  text      <s> Buchanan's  The Democrats and Republicans have become too similar and bland. </s> THe parties will never be similar. </s> 
label                                                                                                                                  0
Name: 1, dtype: object



Evaluation data:

Before:

Columns:  Index(['premise', 'hypothesis', 'label'], dtype='object')

First entry:
  premise       He really shook up my whole mindset, Broker says. 
hypothesis               His mindset never c

Define the hyperparameters.

In [109]:
model_name = 'FacebookAI/roberta-base'
num_labels = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer_name = model_name

max_seq_length = 128
train_batch_size = 64
dev_batch_size = 64
warmup_ratio = 0.06
weight_decay=10**(-5)
gradient_accumulation_steps = 1
num_train_epochs = 10
learning_rate = 1e-05
adam_epsilon = 1e-08

Add a classification head containing 1 linear layer, one drop-out and another linear layer on top of the Roberta encodings.

In [110]:
class RobertaClassifier(BertPreTrainedModel):
    def __init__(self, config):
        super(RobertaClassifier, self).__init__(config)
        self.num_labels = config.num_labels
        self.roberta = RobertaModel(config)
        self.classifier = RobertaClassificationHead(config)


    def forward(self, input_ids, attention_mask, labels):
        outputs = self.roberta(input_ids,attention_mask=attention_mask)
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output)

        outputs = (logits,) + outputs[2:]

        loss_fct = CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

Load a pre-trained Roberta model with the default configuration and its tokenizer.

In [111]:
config_class = RobertaConfig
model_class = RobertaClassifier
tokenizer_class = RobertaTokenizerFast

config = config_class.from_pretrained(model_name, num_labels=num_labels)

model = model_class.from_pretrained(model_name, config=config)
print('Model=\n',model,'\n')

tokenizer = tokenizer_class.from_pretrained(tokenizer_name, do_lower_case=False)
print('Tokenizer=',tokenizer,'\n')

Some weights of RobertaClassifier were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model=
 RobertaClassifier(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Layer

Class that tokenizes the input data and converts it to a PyTorch tensors

In [112]:
class NliDataset(Dataset):

    def __init__(self, data, tokenizer):
        text, labels = data
        self.examples = tokenizer(text=text,text_pair=None,truncation=True,padding="max_length",
                                  max_length=max_seq_length,return_tensors="pt")
        print(self.examples['input_ids'].shape)
        self.labels = torch.tensor(labels, dtype=torch.long)


    def __len__(self):
        return len(self.examples["input_ids"])

    def __getitem__(self, index):
        return {key: self.examples[key][index] for key in self.examples}, self.labels[index]

Apply it to our data.

In [113]:
train_examples = (train_df.iloc[:, 0].astype(str).tolist(), train_df.iloc[:, 1].tolist())
train_dataset = NliDataset(train_examples,tokenizer)

dev_examples = (dev_df.iloc[:, 0].astype(str).tolist(), dev_df.iloc[:, 1].tolist())
dev_dataset = NliDataset(dev_examples,tokenizer)

torch.Size([13, 128])
torch.Size([13, 128])


Group training and dev data into batches.

In [114]:
def get_inputs_dict(batch):
    inputs = {key: value.squeeze(1).to(device) for key, value in batch[0].items()}
    inputs["labels"] = batch[1].to(device)
    return inputs

In [115]:
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,sampler=train_sampler,batch_size=train_batch_size)

dev_sampler = SequentialSampler(dev_dataset)
dev_dataloader = DataLoader(dev_dataset, sampler=dev_sampler, batch_size=dev_batch_size)

Custom parameters for the Adam optimiser.

In [116]:
t_total = len(train_dataloader) // gradient_accumulation_steps * num_train_epochs
optimizer_grouped_parameters = []
custom_parameter_names = set()
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters.extend(
    [
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if n not in custom_parameter_names and not any(nd in n for nd in no_decay)
            ],
            "weight_decay": weight_decay,
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if n not in custom_parameter_names and any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
)

warmup_steps = math.ceil(t_total * warmup_ratio)
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)



Method to compute the desired evaluation metrics:


*  Accuracy
*  Macro and Weighted Precision, Recall and F1 scores
*  Mathew's Correlation Coefficient
*  TP, TN, FP, FN



In [117]:
def print_confusion_matrix(result):
    print('confusion matrix:')
    print('            predicted    ')
    print('          0     |     1')
    print('    ----------------------')
    print('   0 | ',format(result['tn'],'5d'),' | ',format(result['fp'],'5d'))
    print('gl -----------------------')
    print('   1 | ',format(result['fn'],'5d'),' | ',format(result['tp'],'5d'))
    print('---------------------------------------------------')

def compute_metrics(preds, labels):
    assert len(preds) == len(labels)

    #round everything to 2 decimals
    accuracy = round(np.mean(preds == labels) * 100, 2)
    mcc = round(matthews_corrcoef(labels, preds) * 100, 2)

    #Macro and weighted precision, recall and F1 scores
    macro_p = round(precision_score(labels, preds, average='macro') * 100, 2)
    macro_r = round(recall_score(labels, preds, average='macro') * 100, 2)
    macro_f1 = round(f1_score(labels, preds, average='macro') * 100, 2)
    w_macro_p = round(precision_score(labels, preds, average='weighted') * 100, 2)
    w_macro_r = round(recall_score(labels, preds, average='weighted') * 100, 2)
    w_macro_f1 = round(f1_score(labels, preds, average='weighted') * 100, 2)

    tn, fp, fn, tp = confusion_matrix(labels, preds, labels=[0, 1]).ravel()
    return (
        {
            **{"accuracy": accuracy,
               "Macro-P": macro_p,
              "Macro-R": macro_r,
              "Macro-F1": macro_f1,
              "W Macro-P": w_macro_p,
              "W Macro-R": w_macro_r,
              "W Macro-F1": w_macro_f1,
              "MCC": mcc,
              "tp": tp, "tn": tn, "fp": fp, "fn": fn},
        }
    )

Training loop for fine-tuning. Training was stopped once results started stagnating on the dev set (in this case, at the 4th epoch). You can specify where you want to save the model by changing the *location variable*.

In [118]:
model.to(device)

model.zero_grad()

print("epochs ", num_train_epochs)

for epoch in range(num_train_epochs):

    model.train()
    epoch_loss = []
    print("epoch ", epoch)
    print("batches: ", len(train_dataloader))

    for i,batch in enumerate(train_dataloader):
        print("batch nr ", i)
        batch = get_inputs_dict(batch)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        epoch_loss.append(loss.item())

    #save the model
    location = "test-model.pt"
    # location = "/content/drive/MyDrive/NLU/test-model" + str(epoch) + ".pt"
    torch.save(model, location)

    #evaluate model with dev_df at the end of the epoch.
    eval_loss = 0.0
    nb_eval_steps = 0
    n_batches = len(dev_dataloader)
    preds = np.empty((len(dev_dataset), num_labels))
    out_label_ids = np.empty((len(dev_dataset)))
    model.eval()

    print(len(dev_dataloader))
    for i,dev_batch in enumerate(dev_dataloader):
        with torch.no_grad():
            print(i)
            dev_batch = get_inputs_dict(dev_batch)
            input_ids = dev_batch['input_ids'].to(device)
            attention_mask = dev_batch['attention_mask'].to(device)
            labels = dev_batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            tmp_eval_loss, logits = outputs[:2]
            eval_loss += tmp_eval_loss.item()

        nb_eval_steps += 1
        start_index = dev_batch_size * i
        end_index = start_index + dev_batch_size if i != (n_batches - 1) else len(dev_dataset)
        preds[start_index:end_index] = logits.detach().cpu().numpy()
        out_label_ids[start_index:end_index] = dev_batch["labels"].detach().cpu().numpy()

    eval_loss = eval_loss / nb_eval_steps
    model_outputs = preds
    preds = np.argmax(preds, axis=1)
    result = compute_metrics(preds, out_label_ids)

    print('epoch',epoch,'Training avg loss',np.mean(epoch_loss))
    print('epoch',epoch,'Dev set avg loss',eval_loss)
    print(result)
    print_confusion_matrix(result)
    print('---------------------------------------------------\n')

epochs  10
epoch  0
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 0 Training avg loss 0.697169303894043
epoch 0 Dev set avg loss 0.7247381210327148
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  1
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 1 Training avg loss 0.6776463389396667
epoch 1 Dev set avg loss 0.7238956689834595
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  2
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 2 Training avg loss 0.6801583170890808
epoch 2 Dev set avg loss 0.7239802479743958
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  3
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 3 Training avg loss 0.6631960272789001
epoch 3 Dev set avg loss 0.7245372533798218
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  4
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 4 Training avg loss 0.6675573587417603
epoch 4 Dev set avg loss 0.7255657911300659
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  5
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 5 Training avg loss 0.6578407883644104
epoch 5 Dev set avg loss 0.7267816066741943
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  6
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 6 Training avg loss 0.6854851841926575
epoch 6 Dev set avg loss 0.7278547286987305
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  7
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 7 Training avg loss 0.6683212518692017
epoch 7 Dev set avg loss 0.7289440631866455
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  8
batches:  1
batch nr  0
1
0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


epoch 8 Training avg loss 0.7108211517333984
epoch 8 Dev set avg loss 0.7299705743789673
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------
   1 |      0  |      5
---------------------------------------------------
---------------------------------------------------

epoch  9
batches:  1
batch nr  0
1
0
epoch 9 Training avg loss 0.647945761680603
epoch 9 Dev set avg loss 0.7304792404174805
{'accuracy': 38.46, 'Macro-P': 19.23, 'Macro-R': 50.0, 'Macro-F1': 27.78, 'W Macro-P': 14.79, 'W Macro-R': 38.46, 'W Macro-F1': 21.37, 'MCC': 0.0, 'tp': 5, 'tn': 0, 'fp': 8, 'fn': 0}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      0  |      8
gl -----------------------

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Evaluate the model

Load evaluation data with labels.

In [119]:
#read and process the data
dev_location = 'dev.csv'
print('\n\n\nEvaluation data:')
test_df = get_data(dev_location)

#the batch size that gets processed at once during testing
test_batch_size = 64

#process the data in the required format
test_examples = (test_df.iloc[:, 0].astype(str).tolist(), test_df.iloc[:, 1].tolist())
test_dataset = NliDataset(test_examples,tokenizer)

#split into batches
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=test_batch_size)




Evaluation data:

Before:

Columns:  Index(['premise', 'hypothesis', 'label'], dtype='object')

First entry:
  premise       He really shook up my whole mindset, Broker says. 
hypothesis               His mindset never changed, Broker said.
label                                                          0
Name: 1, dtype: object

After:

Columns:  Index(['text', 'label'], dtype='object')

First entry:
  text      <s> He really shook up my whole mindset, Broker says.  </s> His mindset never changed, Broker said. </s> 
label                                                                                                             0
Name: 1, dtype: object
torch.Size([13, 128])


Load a model from Drive or from the file storage. (We can use the model from the previous step - just comment out this code in that case).

In [120]:
MODEL_PATH = "/content/drive/MyDrive/NLU/roberta-model4.pt"
model = torch.load(MODEL_PATH, map_location=torch.device('cpu'))
# model = torch.load(PATH)
model.eval()

RobertaClassifier(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): L

Make the predictions

In [121]:
model.to(device)

model.zero_grad()

eval_loss = 0.0
nb_eval_steps = 0
n_batches = len(test_dataloader)
preds = np.empty((len(test_dataset), num_labels))
out_label_ids = np.empty((len(test_dataset)))
model.eval()

print(len(test_dataloader))
for i,test_batch in enumerate(test_dataloader):
  with torch.no_grad():
      if i%10 ==0:
        print(i)
      test_batch = get_inputs_dict(test_batch)
      input_ids = test_batch['input_ids'].to(device)
      attention_mask = test_batch['attention_mask'].to(device)
      labels = test_batch['labels'].to(device)
      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
      tmp_eval_loss, logits = outputs[:2]
      eval_loss += tmp_eval_loss.item()

  nb_eval_steps += 1
  start_index = test_batch_size * i
  end_index = start_index + test_batch_size if i != (n_batches - 1) else len(test_dataset)
  preds[start_index:end_index] = logits.detach().cpu().numpy()
  out_label_ids[start_index:end_index] = test_batch["labels"].detach().cpu().numpy()

1
0


Evaluate the model

In [122]:
eval_loss = eval_loss / nb_eval_steps

preds = np.argmax(preds, axis=1)
print(preds)
print(out_label_ids)

result = compute_metrics(preds, out_label_ids)

print(result)
print_confusion_matrix(result)

[1 0 1 0 1 1 0 0 0 1 0 1 0]
[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
{'accuracy': 76.92, 'Macro-P': 76.19, 'Macro-R': 77.5, 'Macro-F1': 76.36, 'W Macro-P': 78.39, 'W Macro-R': 76.92, 'W Macro-F1': 77.2, 'MCC': 53.67, 'tp': 4, 'tn': 6, 'fp': 2, 'fn': 1}
confusion matrix:
            predicted    
          0     |     1
    ----------------------
   0 |      6  |      2
gl -----------------------
   1 |      1  |      4
---------------------------------------------------
