I have used two different pretrained language models fron huggingface to train on this dataset. 

*   bert-base-cased pretrained for the downstream task: sentence classification. The name for this model in HuggingFace repo is: **BertForSequenceClassification**. 
*   distillbert-base-cased. To this model, I have added 2 linear layers separated by a dropout layer and a RelU non-linearity.

I reveived the following metrics on the **test dataset**:

###  **BertForSequenceClassification**: 
>  1. Precision =  0.9159170903402425
>  2. Recall =  0.9349301397205588
>  3. F1 score =  0.9253259581193203
>  4. Accuracy = 0.9244

###  **Sentence Classifier: distillbert-base-cased + 2 Linear Layers + Dropout + ReLU**:
>  1. Precision =  0.9021823850350741
>  2. Recall =  0.9241516966067864
>  3. F1 score =  0.9130349043581149
>  4. Accuracy = 0.9118

Since both models achieve the required metric of 0.90, I am submitting the distillbert model as my submission. But, I am also attaching the notebook,model and config files for the BertForSequenceClassification in the extras folder, since it has better score and has helped me in making some decisions for my final submission.

Performace metrics of Sentence Classifier on the Validation Dataset:
>  1. Precision =  0.9033646322378717
>  2. Recall =  0.9184566428003182
>  3. F1 score =  0.9108481262327416
>  4. Accuracy = 0.9096

Performace metrics of Sentence Classifier on the entire Training Dataset:
>  1. Precision =  0.9993494144730257
>  2. Recall =  0.9993994294579851
>  3. F1 score =  0.9993744213397393
>  4. Accuracy = 0.999375

>**Note: I have chosen distillbert because of the ease of training(provided the constraint that the runtime ggets disconnected after ~6+ hours of training and the I wanted the loss to start converging. For distillbert and linear layer combination, I saw that the loss was converging after 5th epoch to the order of 1e-4. I could only train ~2-3 epochs for bert base model, but for distillbert, I could train and infer on more than 5 epochs, an was sure on the loss convergence. Distillbert also does not have token_type_ids encoding for the tokeniser.)**




The entire process can be divided into: 
# 1. **Dataloader design**: 
 
 I read the csv file as pandas dataframes and loaded them in the pytorch dataloader class as custom datasets. This also involves the design of tokenizer.
> **Tokeniser Design :** I chose the cased version of the tokeniser to incorporate extra information in reviews like "BAD" instead of "bad". 
> I chose a max-length of 256 because the average review length was 231.33 words on the training set.
> I chose a batch size of 16 (32 was giving an out of memory error.)
> To incorporate linear layer on CLS token, use_special_tokens=True.

# 2. **Classifier Design**:

* I have tried both, bert-base and Distillbert as the language model for
tokenizing and finetuning purpose. For two epochs, the F1 score on bert base cased was 0.92, while the F1 score was 0.9 on the Distilled Bert model. However, the error was still converging for both models. It was found that Distilled Bert with two Linear layers took approximately 30 minutes for one epoch of training and validation on 1600 samples, while BertForSequenceClassification took 150 minutes to train. This is a 4.44% upside in performance for 400% increase in training time. This is reason Distilled Bert has been chosen as the Language Model for fine tuning.

* The CLS token of the pretrained Language Model is the forst value in the encodings, and has a size of 16 X 768 for each batch (i.e. a size of 768 for each review if squeezed). I have chosen 1 extra hidden linear layer with a size of 20, a dropout of 0.2, and a ReLU non-linearity. (I followed: Link: [Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751.pdf)). I also confirmed this Dropout value of 0.2 as this is mentioned in the HuggingFace Distilled Bert config parameters for LM-based sequence classification task. 

# 3. **Optimizer, Learning Rate Scheduler And Loss Function Design**:

*  Using recommendation in the previous paper and the recommendation guide by Huggingface on sequence classification, the optimizer used is Adam. The learning rate decays linearly from a maximum value of 5e-5. [Huggingface article link on how to train a sequence classifier.](https://huggingface.co/transformers/training.html/) 

* Since this is a classification problem, standard PyTorch CrossEntropyLoss is used as the loss function.

# 4. **Training and Validation Loop:**

* It made sense to freeze the languange model and just train the hidden layer parameters for the fine tuning task, which is a standard machine learning policy. But it is mentioned in the HuggingFace sequence classification guide that entire model fine-tuning gives better results.

    > "Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model (so this is not an oversight on our side. If you’re not familiar with what “freezing the body” of the model means, forget you read this paragraph."
  
* For training the following steps are used:


```
       1. Read one batch of tokenized reviews and labels from the train dataloader. 
       2. Convert these values to the GPU.
       3. Zero the accumulated gradients (PyTorch limitation).
       4. Run model to get the predictions.
       5. Calculate the Loss.
       6. Compute back-propagation on the loss.
       7. Update the parameters.
       8. Update the learning rate.
       9. Repeat this process for each batch in the training dataloader (Covers the entire training space.)
       10. This is 1 epoch of trining. Calculate the F1 score and accuracy metrics on the validation set. Here, I use only 1600 samples of validation set to save time.
       11. Repeat this proess for 5 epochs. 
       12. Save the model after this process concludes.

```
It was noted that the loss for each batch of training samples falls to the order of 1e-4 for the fifth epoch.

# 5. **Performace of the final model on Test, Validation and Training Sets:**

* I have used sklearn to evaluate the precision, recall and F1 scores. The saved model is loaded and run on the entire test set to calculate F1 score. the results predictions are calculates and appended to a List. This list is finally converted to CPU and then sklearn API is called on the prediction and labels list to calculate Precision, Recall, F1 score and Accuracy.

Now, I will explain code in each section and put the important results in this PDF.




---



## Install Huggingface Dependencies. 

In [7]:
! pip install tokenizers
! pip install transformers



## Mount GDrive 

In [6]:
# Mount Google Drive to this notebook
# The purpose is to allow your code to access to your files
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
cd drive/MyDrive/nlp_assignments/assignment1/

/content/drive/MyDrive/nlp_assignments/assignment1


In [9]:
import pandas as pd
import numpy as np

### Load the Dataset as pandas df and calculate average word size for each review in the training dataset. This will help us in setting the maximum word size for the bert tokenizer.

In [8]:
train_df=pd.read_csv('/content/drive/My Drive/nlp_assignments/assignment1/imdbdataset/Train.csv')
test_df=pd.read_csv('/content/drive/My Drive/nlp_assignments/assignment1/imdbdataset/Test.csv')
val_df=pd.read_csv('/content/drive/My Drive/nlp_assignments/assignment1/imdbdataset/Valid.csv')
print(train_df.shape)
print(test_df.shape)
print(val_df.shape)
# print(train_df[:5])
print(train_df['text'].size)
words=0
for i in range(train_df['text'].size):
  words+=len(train_df['text'][i].split())
print('avg words in training dataset: ',words/train_df['text'].size)
print(train_df[0:5])

(40000, 2)
(5000, 2)
(5000, 2)
40000
avg words in training dataset:  231.33925
                                                text  label
0  I grew up (b. 1965) watching and loving the Th...      0
1  When I put this movie in my DVD player, and sa...      0
2  Why do people who do not know what a particula...      0
3  Even though I have great interest in Biblical ...      0
4  Im a die hard Dads Army fan and nothing will e...      1


###Load the pretrained Tokenizer. This will be the distilled bert cased.

In [11]:
from transformers import BertTokenizer, BertModel,DistilBertTokenizer,DistilBertModel
encoder = DistilBertTokenizer.from_pretrained('distilbert-base-cased')


### Visualize encodings. This will be present in our dataloader getitem function.

In [12]:
encodings=encoder(train_df['text'][1], add_special_tokens=True, padding='max_length', max_length=256,truncation=True)
encodings

{'input_ids': [101, 1332, 146, 1508, 1142, 2523, 1107, 1139, 4173, 1591, 117, 1105, 2068, 1205, 1114, 170, 1884, 2391, 1105, 1199, 13228, 117, 146, 1125, 1199, 11471, 119, 146, 1108, 4717, 1115, 1142, 2523, 1156, 4651, 1199, 1104, 1103, 2012, 118, 1827, 1104, 1103, 1148, 2523, 131, 138, 10732, 6758, 8794, 117, 1363, 8342, 1642, 117, 6548, 1490, 2641, 117, 6276, 3789, 1105, 170, 5642, 118, 3919, 5945, 119, 1252, 117, 1106, 1139, 10866, 117, 1136, 1251, 1104, 1142, 1110, 1106, 1129, 1276, 1107, 17793, 131, 22644, 112, 188, 11121, 119, 6467, 146, 2373, 1199, 3761, 1148, 117, 146, 1547, 1136, 1138, 1151, 1177, 1519, 1205, 119, 1109, 1378, 24950, 1209, 1129, 2002, 1106, 1343, 1150, 1138, 1562, 1103, 1148, 2523, 117, 1105, 1150, 4927, 1122, 3120, 1111, 1103, 1827, 3025, 119, 133, 9304, 120, 135, 133, 9304, 120, 135, 1332, 1103, 1148, 2741, 2691, 117, 1240, 1107, 1111, 170, 4900, 1191, 1128, 1198, 3015, 17793, 131, 22644, 112, 188, 11121, 1121, 1103, 3934, 118, 1692, 1120, 1240, 1469, 6581, 2

### Dataloader design:
##### For each item(__getitem__) we will return the tokenized encodings. The length operator will be standard.


In [13]:
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

class imdbdataset(Dataset):
  def __init__(self, df, encoder):
    self.text=df['text']
    self.labels=df['label']
    self.encoder=encoder
  
  def __len__(self):
    return self.text.size

  def __getitem__(self,item):
    text=self.text[item]
    label=self.labels[item]
    encoding=encoder(text, add_special_tokens=True, padding='max_length', max_length=256,truncation=True, return_tensors="pt")
    return encoding['input_ids'].squeeze(0), encoding['attention_mask'].squeeze(0),label
  





### Launch the Training and validation datset and dataloader

In [9]:
train_dataset=imdbdataset(train_df, encoder)
train_dataloader=DataLoader(train_dataset, batch_size=16)
val_dataset=imdbdataset(val_df, encoder)
val_dataloader=DataLoader(val_dataset, batch_size=16)

### Visualize one sample of outputs from the dataloader. Input Ids are IDs for each word, which will be extended to 256. The attention mask for reviews which are less than 256 will be 0. 16 items in each batch.

In [15]:
inputids, attentionmasks, label=next(iter(train_dataloader))
print(inputids)
print(attentionmasks)
print(label)

tensor([[ 101,  146, 2580,  ...,    0,    0,    0],
        [ 101, 1332,  146,  ..., 1210, 3426,  102],
        [ 101, 2009, 1202,  ...,    0,    0,    0],
        ...,
        [ 101, 6155, 3635,  ...,    0,    0,    0],
        [ 101,  138, 1544,  ...,    0,    0,    0],
        [ 101, 1409, 1128,  ...,    0,    0,    0]])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])
tensor([0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1])


### Load the model. Here the distilled lightweight version of bert will be used. Here, I checked the model configuration and found out that each tken has a size of 768 in the hidden layer and the sequence classification dropout is suggested as 0.2. The sape of CLS token is also printed in this cell.

In [16]:
model = DistilBertModel.from_pretrained('distilbert-base-cased')
print(model.config)
hidden_state=model(inputids,attentionmasks)
print(hidden_state[0][:,0,:].shape)

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.11.2",
  "vocab_size": 28996
}

torch.Size([16, 768])


### This is the Sentence Classifier Class. In the forward pass, the CLS token in distilled version of Bert is passed to a linear layer with hidden size 20 and a dropout of 0.2, a ReLU, and another Linear Layer of 2 output dimentions(1 for positive and 1 for negative)

In [17]:
import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
class sentence_classifier(nn.Module):
  def __init__(self):
    super(sentence_classifier,self).__init__()
    self.DBert = DistilBertModel.from_pretrained('distilbert-base-cased')
    self.classifier = nn.Sequential(
          nn.Linear(768, 20),
          nn.ReLU(),
          nn.Dropout(0.2),
          nn.Linear(20, 2)
        )
  def forward(self, input_ids, attention_masks):
    hidden_state=self.DBert(input_ids,attention_masks)
    cls=hidden_state[0][:,0,:]
    logits=self.classifier(cls)
    return logits
# d_model=model.to(device)

cuda:0


### This initialization function initialises the model to the sentence classifier class, the optimizer to standard Adam, and the learning rate scheduler to linearly decay from 5e-5. All 3 objects are returned.

In [None]:
from transformers import AdamW
from transformers import get_scheduler
def init_model():
  imdb_classifier=sentence_classifier()
  imdb_classifier.to(device)
  optimizer = AdamW(imdb_classifier.parameters(), lr=5e-5)
  num_epochs = 5
  num_training_steps = num_epochs * len(train_dataloader)
  print(num_training_steps)
  lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
  )
  return imdb_classifier,optimizer,lr_scheduler

### Initialize the loss function to Cross Entropy Loss since this is a classification problem.

In [None]:
loss_fn=nn.CrossEntropyLoss()

### Use sklearn to compute metrics. This function expects 2 lists, and prints the Precision, Recall, F1 Score and Accuracy values: 

1.   Prediction List: Output of the model, with the prediction label as 1 for the predicted class.
2.   Labels List: Correct labels from the dataloader class. 

**Note: This operation is performed on the CPU and not the GPU, hence the lists are passed to the CPU before calling this function.**



In [18]:
!pip install sklearn
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
def compute_metrics(preds,labels):
  print(preds,labels)
  precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
  acc = accuracy_score(labels, preds)
  print("Precision = ", precision)
  print("recall = ", recall)
  print("F1 score = ", f1)
  print("accuracy = ", acc)



### Training Loop. The batch loss (Which is the Cross Entropy Loss added for 16 samples) is printed after every 1600 training samples. After every epoch, a validation cycle is performed on 1600 samples, and the result metrics are printed. There is a scope to perform grid search and tune the hyperparameters here, but the results generated after the first epoch are satisfactory.

In [None]:
import itertools
imdb_model,optimizer,lr_scheduler=init_model()
imdb_model.train()
num_epochs=5
for epoch in range(num_epochs):
  rev_predictions=[]
  rev_labels=[]
  iter=0
  for input_ids, attentionmasks,labels in train_dataloader:
    iter+=1
    d_inputids=input_ids.to(device)
    d_attentionmasks=attentionmasks.to(device)
    d_labels=labels.to(device)
    imdb_model.zero_grad()
    logits = imdb_model(d_inputids, d_attentionmasks)
    loss=loss_fn(logits, d_labels)
    if (iter%99==0):
      print(iter)
      print(loss)
    loss.backward()
    optimizer.step()
    lr_scheduler.step()
  print("1 epoch terminated. Evaluating on validation....")
  imdb_model.eval()
  for input_ids, attentionmasks,labels in itertools.islice(val_dataloader,100):
    d_inputids=input_ids.to(device)
    d_attentionmasks=attentionmasks.to(device)
    d_labels=labels.to(device)
    with torch.no_grad():
      logits = imdb_model(d_inputids, d_attentionmasks)
    predictions = torch.argmax(logits, dim=-1)
    rev_predictions.extend(predictions)
    rev_labels.extend(d_labels)
  compute_metrics(torch.stack(rev_predictions).cpu(), torch.stack(rev_labels).cpu())  
# print(rev_labels,rev_predictions)
# torch.save(imdb_model.state_dict(),'/content/drive/My Drive/nlp_assignments/assignment1/pytorch_model_distillbert.bin')



Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


12500
99
tensor(0.3428, device='cuda:0', grad_fn=<NllLossBackward>)
198
tensor(0.1465, device='cuda:0', grad_fn=<NllLossBackward>)
297
tensor(0.4405, device='cuda:0', grad_fn=<NllLossBackward>)
396
tensor(0.2062, device='cuda:0', grad_fn=<NllLossBackward>)
495
tensor(0.2421, device='cuda:0', grad_fn=<NllLossBackward>)
594
tensor(0.5980, device='cuda:0', grad_fn=<NllLossBackward>)
693
tensor(0.2064, device='cuda:0', grad_fn=<NllLossBackward>)
792
tensor(0.3631, device='cuda:0', grad_fn=<NllLossBackward>)
891
tensor(0.8403, device='cuda:0', grad_fn=<NllLossBackward>)
990
tensor(0.1605, device='cuda:0', grad_fn=<NllLossBackward>)
1089
tensor(0.1320, device='cuda:0', grad_fn=<NllLossBackward>)
1188
tensor(0.3393, device='cuda:0', grad_fn=<NllLossBackward>)
1287
tensor(0.3542, device='cuda:0', grad_fn=<NllLossBackward>)
1386
tensor(0.1650, device='cuda:0', grad_fn=<NllLossBackward>)
1485
tensor(0.2359, device='cuda:0', grad_fn=<NllLossBackward>)
1584
tensor(0.1936, device='cuda:0', grad_fn=

### This Command saves the trained model.

In [None]:
torch.save(imdb_model,'/content/drive/My Drive/nlp_assignments/assignment1/pytorch_model_distillbert.bin')

### Check accuracy and F1 score on the Test Set

In [7]:
test_dataset=imdbdataset(test_df, encoder)
test_dataloader=DataLoader(test_dataset, batch_size=16)
num_iters=16
load_model= torch.load('/content/drive/My Drive/nlp_assignments/assignment1/pytorch_model_distillbert.bin')
load_model.to(device)
import itertools
load_model.eval()
rev_predictions=[]
rev_labels=[]
for input_ids, attentionmasks,labels in test_dataloader:
  d_inputids=input_ids.to(device)
  d_attentionmasks=attentionmasks.to(device)
  d_labels=labels.to(device)
  with torch.no_grad():
    outputs = load_model(d_inputids, d_attentionmasks)
  predictions = torch.argmax(outputs, dim=-1)
  rev_predictions.extend(predictions)
  rev_labels.extend(d_labels)
print(len(rev_predictions))
print(len(rev_labels))
compute_metrics(torch.stack(rev_predictions).cpu(), torch.stack(rev_labels).cpu())  

5000
5000
tensor([0, 0, 0,  ..., 0, 0, 0]) tensor([0, 0, 0,  ..., 0, 0, 0])
Precision =  0.9021823850350741
recall =  0.9241516966067864
F1 score =  0.9130349043581149
accuracy =  0.9118


### Check accuracy and F1 score on the Validation set

In [None]:
load_model= torch.load('/content/drive/My Drive/nlp_assignments/assignment1/pytorch_model_distillbert.bin')
load_model.to(device)
import itertools
load_model.eval()
rev_predictions=[]
rev_labels=[]
for input_ids, attentionmasks,labels in val_dataloader:
  d_inputids=input_ids.to(device)
  d_attentionmasks=attentionmasks.to(device)
  d_labels=labels.to(device)
  with torch.no_grad():
    outputs = load_model(d_inputids, d_attentionmasks)
  predictions = torch.argmax(outputs, dim=-1)
  rev_predictions.extend(predictions)
  rev_labels.extend(d_labels)
print(len(rev_predictions))
print(len(rev_labels))
compute_metrics(torch.stack(rev_predictions).cpu(), torch.stack(rev_labels).cpu())  

5000
5000
tensor([0, 0, 0,  ..., 1, 1, 1]) tensor([0, 0, 0,  ..., 1, 1, 1])
Precision =  0.9033646322378717
recall =  0.9184566428003182
F1 score =  0.9108481262327416
accuracy =  0.9096


### Check accuracy and F1 score on the Train Set

In [10]:
load_model= torch.load('/content/drive/My Drive/nlp_assignments/assignment1/pytorch_model_distillbert.bin')
load_model.to(device)
import itertools
load_model.eval()
rev_predictions=[]
rev_labels=[]
for input_ids, attentionmasks,labels in train_dataloader:
  d_inputids=input_ids.to(device)
  d_attentionmasks=attentionmasks.to(device)
  d_labels=labels.to(device)
  with torch.no_grad():
    outputs = load_model(d_inputids, d_attentionmasks)
  predictions = torch.argmax(outputs, dim=-1)
  rev_predictions.extend(predictions)
  rev_labels.extend(d_labels)
print(len(rev_predictions))
print(len(rev_labels))
compute_metrics(torch.stack(rev_predictions).cpu(), torch.stack(rev_labels).cpu())  

40000
40000
tensor([0, 0, 0,  ..., 0, 1, 1]) tensor([0, 0, 0,  ..., 0, 1, 1])
Precision =  0.9993494144730257
recall =  0.9993994294579851
F1 score =  0.9993744213397393
accuracy =  0.999375


In [None]:
!apt-get -qq install texlive texlive-xetex texlive-latex-extra pandoc
!pip install --quiet pypandoc

In [None]:
!jupyter nbconvert --to PDF "/content/drive/MyDrive/Colab Notebooks/Assignment1_apoorvgarg_db.ipynb"

[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/Assignment1_apoorvgarg_db.ipynb to PDF


# Section 1 : Inference Block. Run Inference on this block. Just chcange the paths of model(line 15) and data csv reader(line 16). Please remove the drive mount command(line 6) if the data and model is not shared using google drive. On running the block, the accuracy, precision, recall and F1 score metrics will be printed in the stdout block..

In [11]:
! pip install tokenizers
! pip install transformers
!pip install sklearn
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel,DistilBertTokenizer,DistilBertModel
import torch
import torch.nn as nn
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
model_path='/content/drive/MyDrive/nlp_assignments/assignment1/pytorch_model_distillbert.bin'
test_file_path='/content/drive/My Drive/nlp_assignments/assignment1/imdbdataset/Test.csv'
test_df=pd.read_csv(test_file_path)
encoder = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
class imdbdataset(Dataset):
  def __init__(self, df, encoder):
    self.text=df['text']
    self.labels=df['label']
    self.encoder=encoder
  
  def __len__(self):
    return self.text.size

  def __getitem__(self,item):
    text=self.text[item]
    label=self.labels[item]
    encoding=encoder(text, add_special_tokens=True, padding='max_length', max_length=256,truncation=True, return_tensors="pt")
    return encoding['input_ids'].squeeze(0), encoding['attention_mask'].squeeze(0),label

def compute_metrics(preds,labels):
  print(preds,labels)
  precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
  acc = accuracy_score(labels, preds)
  print("Precision = ", precision)
  print("recall = ", recall)
  print("F1 score = ", f1)
  print("accuracy = ", acc)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
test_dataset=imdbdataset(test_df, encoder)
test_dataloader=DataLoader(test_dataset, batch_size=16)
class sentence_classifier(nn.Module):
  def __init__(self):
    super(sentence_classifier,self).__init__()
    self.DBert = DistilBertModel.from_pretrained('distilbert-base-cased')
    self.classifier = nn.Sequential(
          nn.Linear(768, 20),
          nn.ReLU(),
          nn.Dropout(0.2),
          nn.Linear(20, 2)
        )
  def forward(self, input_ids, attention_masks):
    hidden_state=self.DBert(input_ids,attention_masks)
    cls=hidden_state[0][:,0,:]
    logits=self.classifier(cls)
    return logits
load_model= torch.load(model_path)
load_model.to(device)
import itertools
load_model.eval()
rev_predictions=[]
rev_labels=[]
for input_ids, attentionmasks,labels in test_dataloader:
  d_inputids=input_ids.to(device)
  d_attentionmasks=attentionmasks.to(device)
  d_labels=labels.to(device)
  with torch.no_grad():
    outputs = load_model(d_inputids, d_attentionmasks)
  predictions = torch.argmax(outputs, dim=-1)
  rev_predictions.extend(predictions)
  rev_labels.extend(d_labels)
print(len(rev_predictions))
print(len(rev_labels))
compute_metrics(torch.stack(rev_predictions).cpu(), torch.stack(rev_labels).cpu())  

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
5000
5000
tensor([0, 0, 0,  ..., 0, 0, 0]) tensor([0, 0, 0,  ..., 0, 0, 0])
Precision =  0.9021823850350741
recall =  0.9241516966067864
F1 score =  0.9130349043581149
accuracy =  0.9118
