# Sentiment analysis - fine-tuning BERT

In this notebook we'll take a look at the process needed to fine-tine a pretrained [BERT](https://arxiv.org/abs/1810.04805) model to detect sentiment of a piece of text. Our goal will be to classify the polarity of IMDB movie reviews, we'll be working with a dataset from this [Kaggle source](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/notebooks). The techniques we'll discuss don't only apply for sentiment classification, but also for general text classification.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/classification.PNG" width="700"/>
</div>

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100!!):

In [None]:
!nvidia-smi

Let's install some additional libraries: *transformers* for BERT implementation and *gdown* for loading from Drive.

In [None]:
!wget https://github.com/andrejmiscic/NLP-workshop/raw/master/utils/text_classification_utils.py
!wget https://github.com/andrejmiscic/NLP-workshop/raw/master/utils/trainer.py

In [None]:
!pip install transformers
!pip install gdown

In [None]:
import gc
import os

import numpy as np
import pandas as pd
import torch
import torch.nn as nn

from text_classification_utils import TextClassificationDataset, collate_batch_to_tensors, seq_cls_evaluate
from sklearn.model_selection import train_test_split
from trainer import Trainer, RunConfig
from transformers import DistilBertConfig, DistilBertModel, DistilBertTokenizerFast

## Data

Let's take a look at our dataset of IMDB reviews:

In [None]:
path_to_train_csv = "https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/IMDB-reviews/imdb_train.csv"

df = pd.read_csv(path_to_train_csv)
class_list = sorted(df["label"].unique().tolist())
label2id = {label: i for i, label in enumerate(class_list)}
id2label = {i: label for i, label in enumerate(class_list)}

pd.set_option('display.max_colwidth', -1)
print(df.head())
print(f"Classes: {class_list}")

Notice that our reviews fall into two polarity classes: *positive* and *negative*. This is therefore a binary sequence classification task. Below we prepare the data for training.

In [None]:
df["label"] = df["label"].map(label2id)
train_df, val_df = train_test_split(df, test_size=0.2, stratify=df["label"])

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)

max_len = 512
train_dataset = TextClassificationDataset(train_df["text"].tolist(), train_df["label"].tolist(), tokenizer, max_len)
val_dataset = TextClassificationDataset(val_df["text"].tolist(), val_df["label"].tolist(), tokenizer, max_len)

## Model

Same as for Named Entity Recognition we are working with DistilBERT, a smaller model than base BERT, that is though by knowledge distillation and retains most of the performance. As mentioned during the lectures, BERT has a special token (*CLS*) whose representation we use as inputs to a classifier. During pretraining this is trained on the task of next sentence prediction therefore out of the box it is not useful as sequence representation. That's where finetuning comes in - we train a classifier together with pretrained BERT model to achieve good performance.

An architecture for sequence classification is already impemented in *transformers* library: [*DistilBertForSequenceClassification*](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification), but for demonstrational purposes we reimplement a DistilBERT with a classification head below.

In [None]:
class DistilBertTextClassificationModel(nn.Module):
    def __init__(self, bert_config, num_classes, dropout_prob=0.1):
        super(DistilBertTextClassificationModel, self).__init__()

        self.bert = DistilBertModel(bert_config)
        self.dropout = nn.Dropout(dropout_prob)
        self.classification_layer = nn.Linear(in_features=bert_config.hidden_size, out_features=num_classes)

    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = self.bert(input_ids, attention_mask)[0]
        cls = outputs[:, 0, :]  # [CLS] is the first token of the sequence
        cls = self.dropout(cls)  # to mitigate overfitting
        logits = self.classification_layer(cls)  # classify

        if labels is None:
          return logits

        loss = nn.CrossEntropyLoss()(logits, labels)
        return (loss, logits)

    def load(self, path_to_dir):
        self.bert = DistilBertModel.from_pretrained(path_to_dir)
        model_path = os.path.join(path_to_dir, "model.tar")
        if os.path.exists(model_path):
            checkpoint = torch.load(model_path)
            self.dropout.load_state_dict(checkpoint["dropout"])
            self.classification_layer.load_state_dict(checkpoint["cls"])
        else:
            print("No model.tar in provided directory, only loading bert model.")

    def save_pretrained(self, path_to_dir):
        self.bert.save_pretrained(path_to_dir)
        torch.save(
            {"dropout": self.dropout.state_dict(), "cls": self.classification_layer.state_dict()},
            os.path.join(path_to_dir, "model.tar")
        )

## Training

We have now implemented everything to start fine-tuning. We can save the fine-tuned models to our Colab instance (available under `/content/`) or we can connect our Google Drive to Colab and use it as external memory. If you want to do the latter, run the cell below and follow instructions.

In [None]:
# optional if you want to save your models to Google Drive
from google.colab import drive
drive.mount("/content/drive/")

In [None]:
run_config = RunConfig(
    learning_rate = 3e-5,
    batch_size = 32,  # start with 32 and decrease if you get CUDA out of memory exception
    num_epochs = 3,
    output_dir = "/content/drive/MyDrive/NLP-workshop/BERT-sentiment/",
    collate_fn = collate_batch_to_tensors
)

Instatiate the model and start training!

In [None]:
model = DistilBertTextClassificationModel(
    DistilBertConfig.from_pretrained("distilbert-base-uncased"), 
    num_classes=len(class_list)
)
model.load("distilbert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
trainer = Trainer(model)
trainer.train(train_dataset, val_dataset, device, run_config)

If you happen to get a CUDA out of memory exception, do the following:
- cause another exception so python doesn't hold any references to trainer or model, e.g. run the bottom cell causing ZeroDivisionError
- run the cell below that empties GPU cache
- decrease the batch_size in run_config and rerun that cell
- reinstantiate the model and rerun training

In [None]:
1 / 0

In [None]:
model = None
trainer = None
gc.collect()
torch.cuda.empty_cache()

## Evaluation

With this procedure we've now fine-tuned a model to predict the polarity of a review. For the purposes of this workshop we've pretrained a model so we can analyze it. Load it by running the cell below.

In [None]:
!mkdir /content/bert-imdb
!gdown -O /content/bert-imdb/config.json https://drive.google.com/uc?id=1-5Z1EvvyYdXr73fQf7_nTyeltewWt5WV
!gdown -O /content/bert-imdb/model.tar https://drive.google.com/uc?id=1-S8Ii5SeazeqOWtI0Wx9JiSap5-mV4GQ
!gdown -O /content/bert-imdb/pytorch_model.bin https://drive.google.com/uc?id=1-R-lyZL53rY5wdfW2IWKUPFNDznc61QP

Let's instantiate all the objects we need for evaluation: model, dataset, tokenizer, etc.

In [None]:
path_to_test_csv = "https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/IMDB-reviews/imdb_test.csv"

df = pd.read_csv(path_to_test_csv)
class_list = sorted(df["label"].unique().tolist())
label2id = {label: i for i, label in enumerate(class_list)}
id2label = {i: label for i, label in enumerate(class_list)}
df["label"] = df["label"].map(label2id)

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)
max_len = 512
test_dataset = TextClassificationDataset(df["text"].tolist(), df["label"].tolist(), tokenizer, max_len)

In [None]:
# only run if you want to use the model we've already fine-tuned for you
model = DistilBertTextClassificationModel(
    DistilBertConfig.from_pretrained("distilbert-base-uncased"), 
    num_classes=len(class_list)
)
model.load("/content/bert-imdb/")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

Evaluating our fine-tuned model on the test dataset:

In [None]:
log_loss, accuracy = seq_cls_evaluate(model, test_dataset, device, batch_size=64)
print(f"\nTest log loss = {log_loss:.4f}\nTest accuracy = {accuracy:.4f}")

Nice, we achieve a relatively good accuracy of 0.93. We can now experiment with the model and write some custom reviews.

In [None]:
def predict_review_sentiment(review: str):
  enc = tokenizer(review)
  inputs = {"input_ids": torch.tensor(enc["input_ids"], dtype=torch.long).unsqueeze(0).to(device),
            "attention_mask": torch.tensor(enc["attention_mask"], dtype=torch.long).unsqueeze(0).to(device)}
  with torch.no_grad():
    prediction = np.argmax(nn.functional.softmax(model(**inputs), dim=1).cpu().numpy())
  print(review)
  print(f"Sentiment: {id2label.get(prediction)}")

In [None]:
predict_review_sentiment("I think this movie is good.")
predict_review_sentiment("I don't think this movie is good.")