<a href="https://colab.research.google.com/github/claudiu14c/NLI-RoBERTa/blob/main/nlu_cw_transformer_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# README

## The model
This is a model for Natural Language Inference. It uses a pre-trained RoBERTa model. RoBERTa has the same architecture as Bert, but uses a BPE tokenizer instead. Some hyperparameters and the pre-training objectives are also changed. RoBERTa is not pre-trained using the Next Sentence Predition task.

A classification head consisting of a dense, a drop-out and another dense layer is added on top of this. The pre-trained RoBERTa model produces encodings of the input. The classification head takes them and procuses the probabilities of each class (class 0 -> no implication, class 1 -> hypothesis implies premise).

This entire model has-been fine-tuned on our data set. Both the parameters of the RoBERTa base model and those of the untrained Classification Head have been changed.

## Credits

The architecture was selected based on a similar model's perfromance on the [RTE benchmark](https://paperswithcode.com/sota/natural-language-inference-on-rte). The code is inspired from [this article](https://pchanda.github.io/Roberta-FineTuning-for-Classification/), where a model with the same architecture was fine-tuned for classifying molecules.

## Fine-tuned model location

Post fine-tuning, the model has been stored on the Cloud at [this location](https://drive.google.com/file/d/1-IJSt2HGH9Dqbu6NBuHr61ndV1r4g-3H/view?usp=sharing).  It can be downloaded and used directly in the notebook, but uploading it to Colab takes more than 15 minutes. Hence a link to Google Drive was used in the code for loading the model during testing. However, if one wants to test this notebook, this link needs to be replaced by the location of the downloaded model on their machine.

# Demo a fine-tuned RoBERTa model

##Pre-requisites



*   The test data should be a csv file with 2 columns with the labels 'premise' and 'hypothesis'.
*   The test data csv should be called 'test.csv' and should be loaded into Colab. Alternatively, one can modify the *test_location* variable, which holds both the location and name of the csv file. Links to Google Drive can be used.
*   The fine-tuned model is currently loaded from Google Drive. If one wants to test this notebook, please download the model from [here](https://drive.google.com/file/d/1-IJSt2HGH9Dqbu6NBuHr61ndV1r4g-3H/view?usp=sharing) and change the content of the variable *PATH* to the location of the downloaded model.
* The usage of a GPU runtime is encouraged for large test files, but it is not needed for less than 1000 test samples.








Import the required libraries.

In [24]:
import os
import numpy as np
import pandas as pd
import transformers
import torch
from torch.utils.data import (
    Dataset,
    DataLoader,
    RandomSampler,
    SequentialSampler
)

import math
from transformers import  (
    BertPreTrainedModel,
    RobertaConfig,
    RobertaTokenizerFast
)

from transformers.optimization import (
    AdamW,
    get_linear_schedule_with_warmup
)

from scipy.special import softmax
from torch.nn import CrossEntropyLoss

from sklearn.metrics import (
    confusion_matrix,
    matthews_corrcoef,
    roc_curve,
    auc,
    average_precision_score,
)

from transformers.models.roberta.modeling_roberta import (
    RobertaClassificationHead,
    RobertaConfig,
    RobertaModel,
)

Define some hyperparameters.

In [25]:
tokenizer_name = 'FacebookAI/roberta-base'
num_labels = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

max_seq_length = 128
train_batch_size = 64
test_batch_size = 64
warmup_ratio = 0.06
weight_decay=10**(-5)
gradient_accumulation_steps = 1
num_train_epochs = 10
learning_rate = 1e-05
adam_epsilon = 1e-08

Load tokenizer

In [26]:
tokenizer_class = RobertaTokenizerFast
tokenizer = tokenizer_class.from_pretrained(tokenizer_name, do_lower_case=False)
print('Tokenizer=',tokenizer,'\n')

Tokenizer= RobertaTokenizerFast(name_or_path='FacebookAI/roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
} 



Redefine the model class

In [27]:
class RobertaClassifier(BertPreTrainedModel):
    def __init__(self, config):
        super(RobertaClassifier, self).__init__(config)
        self.num_labels = config.num_labels
        self.roberta = RobertaModel(config)
        self.classifier = RobertaClassificationHead(config)


    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(input_ids,attention_mask=attention_mask)
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output)

        outputs = (logits,) + outputs[2:]

        outputs = outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

Class that tokenizes the input data and converts it to a PyTorch tensors

In [28]:
class NliDataset(Dataset):
    def __init__(self, text, tokenizer):
        self.examples = tokenizer(text=text,text_pair=None,truncation=True,padding="max_length",
                                  max_length=max_seq_length,return_tensors="pt")
        print(self.examples['input_ids'].shape)

    def __len__(self):
        return len(self.examples["input_ids"])

    def __getitem__(self, index):
        return {key: self.examples[key][index] for key in self.examples}

Load and process test data

Column 1: '< s > hypothesis < s > premise <\s>'

Column 2: label (0/1)

Group training data into batches.

In [29]:
pd.set_option('display.max_colwidth', None)

def get_data(location):
  df = pd.read_csv(location)

  #show some data before processing
  print("\nBefore:\n")
  print("Columns: ", df.columns)
  print("\nFirst entry:\n ", df.iloc[1])

  #join the premise and hypothesis columns using separation tokens <s> and </s>
  df['text'] = " <s> " + df['premise'] + " </s> " + df['hypothesis'] + " </s> "
  df.drop(columns=['premise','hypothesis'], inplace=True)
  df = df[['text']]

  #show some data before processing
  print("\nAfter:\n")
  print("Columns: ", df.columns)
  print("\nFirst entry:\n ", df.iloc[1])

  return df

In [30]:
#read and process the data
test_location = 'test.csv'
print('\n\nTest data:')
test_df = get_data(test_location)

#the batch size that gets processed at once during testing
test_batch_size = 64

test_examples = (test_df.iloc[:, 0].astype(str).tolist())
test_dataset = NliDataset(test_examples,tokenizer)

test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=test_batch_size)



Test data:

Before:

Columns:  Index(['premise', 'hypothesis'], dtype='object')

First entry:
  premise       He really shook up my whole mindset, Broker says. 
hypothesis               His mindset never changed, Broker said.
Name: 1, dtype: object

After:

Columns:  Index(['text'], dtype='object')

First entry:
  text     <s> He really shook up my whole mindset, Broker says.  </s> His mindset never changed, Broker said. </s> 
Name: 1, dtype: object
torch.Size([20, 128])


Load the fine-tuned model

In [31]:
PATH = "/content/drive/MyDrive/NLU/roberta-model4.pt"
model = torch.load(PATH, map_location=device)
# model = torch.load(PATH)
model.eval()

RobertaClassifier(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): L

Make the test predictions.

In [32]:
model.to(device)

model.zero_grad()

n_batches = len(test_dataloader)
preds = np.empty((len(test_dataset), num_labels))
out_label_ids = np.empty((len(test_dataset)))
model.eval()

print(len(test_dataloader))
for i,test_batch in enumerate(test_dataloader):
    with torch.no_grad():
        if i%10 ==0:
          print(i)

        input_ids = test_batch['input_ids'].to(device)
        attention_mask = test_batch['attention_mask'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs[:1]

    start_index = test_batch_size * i
    end_index = start_index + test_batch_size if i != (n_batches - 1) else len(test_dataset)
    preds[start_index:end_index] = np.array(tuple(t.cpu() for t in logits))

model_outputs = preds

preds = np.argmax(preds, axis=1)
print(preds)

1
0
[1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0]


Write predictions to a csv file.

In [33]:
preds_df = pd.DataFrame({'prediction':preds})
preds_df.to_csv('test-predictions.csv', index=False)
# preds_df.to_csv('drive/MyDrive/NLU/test-predictions.csv', index=False)