# Sentiment analysis - fine-tuning BERT

In this notebook we'll take a look at the process needed to fine-tine a pretrained [BERT](https://arxiv.org/abs/1810.04805) model to detect sentiment of a piece of text. Our goal will be to classify the polarity of IMDB movie reviews. For this purpose we'll be working with a dataset from this [Kaggle source](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/notebooks). The techniques we'll discuss also apply to text classification.

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100!!):

In [None]:
!nvidia-smi

Let's install some additional libraries: *transformers* for BERT implementation and *gdown* for loading from Drive.

In [None]:
!pip install transformers
!pip install gdown

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.tensorboard import SummaryWriter
from transformers import BertModel, BertConfig, BertTokenizerFast, get_linear_schedule_with_warmup, AdamW
from transformers import DistilBertConfig, DistilBertModel, DistilBertTokenizerFast
from sklearn.model_selection import train_test_split
import os
from tqdm import tqdm, trange
import numpy as np
import pandas as pd
from dataclasses import dataclass
import gc

## Data

Let's take a look at our dataset of IMDB reviews:

In [None]:
path_to_train_csv = "https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/imdb_train.csv"

df = pd.read_csv(path_to_train_csv)
pd.set_option('display.max_colwidth', -1)
print(df.head())

Notice that our reviews fall into two polarity classes: *positive* and *negative*. This is therefore a binary sequence classification task. Below we implement a dataset holder and a collate function to combine the batch.

In [None]:
class TextClassificationDataset(Dataset):
    def __init__(self, inputs, labels, tokenizer, max_len):
        super(TextClassificationDataset, self).__init__()

        # encodes the inputs to input_ids and attention_mask
        encoded_inputs = tokenizer(inputs, max_length=max_len, padding="max_length", truncation=True)
        self.data = list(zip(encoded_inputs["input_ids"], encoded_inputs["attention_mask"], labels))

    def __getitem__(self, i):
        return self.data[i]

    def __len__(self):
        return len(self.data)

def collate_batch_to_tensors(inputs):
    batch = {"input_ids": torch.tensor([dat[0] for dat in inputs], dtype=torch.long),
            "attention_mask": torch.tensor([dat[1] for dat in inputs], dtype=torch.long),
            "labels": torch.tensor([dat[2] for dat in inputs], dtype=torch.long)}
    return batch

## Model

Same as for Named Entity Recognition we are working with DistilBERT, a smaller model than base BERT, that is though by knowledge distillation and retains most of the performance. As mentioned during the lectures, BERT has a special token (*CLS*) whose representation we use as inputs to a classifier. During pretraining this is trained on the task of next sentence prediction therefore out of the box it is not useful as sequence representation. That's where finetuning comes in - we train a classifier together with pretrained BERT model to achieve good performance.

An architecture for sequence classification is already impemented in *transformers* library: [*DistilBertForSequenceClassification*](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification), but for demonstrational purposes we reimplement a DistilBERT with a classification head below.

In [None]:
class DistilBertTextClassificationModel(nn.Module):
    def __init__(self, bert_config, num_classes, dropout_prob=0.1):
        super(DistilBertTextClassificationModel, self).__init__()

        self.bert = DistilBertModel(bert_config)
        self.dropout = nn.Dropout(dropout_prob)
        self.classification_layer = nn.Linear(in_features=bert_config.hidden_size, out_features=num_classes)

    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = self.bert(input_ids, attention_mask)[0]
        cls = outputs[:, 0, :]  # cls is the first token of the sequence
        cls = self.dropout(cls)  # to mitigate overfitting
        logits = self.classification_layer(cls)  # classify

        if labels is None:
          return logits

        # compute the loss
        loss = nn.CrossEntropyLoss()(logits, labels)

        return (loss, logits)

    def load(self, path_to_dir):
        self.bert = DistilBertModel.from_pretrained(path_to_dir)
        model_path = os.path.join(path_to_dir, "model.tar")
        if os.path.exists(model_path):
            checkpoint = torch.load(model_path)
            self.dropout.load_state_dict(checkpoint["dropout"])
            self.classification_layer.load_state_dict(checkpoint["cls"])
        else:
            print("No model.tar in provided directory, only loading bert model.")

    def save_pretrained(self, path_to_dir):
        self.bert.save_pretrained(path_to_dir)
        torch.save(
            {"dropout": self.dropout.state_dict(), "cls": self.classification_layer.state_dict()},
            os.path.join(path_to_dir, "model.tar")
        )

Below we implement the Trainer class that contains the main train loop.

In [None]:
class Trainer:
  def __init__(self, model):
    self.model = model

  def train(self, train_dataset, val_dataset, device, run_config):
    self.model = self.model.to(device)
    # create output folder if it doesn't yet exist
    if not os.path.isdir(run_config.output_dir): 
      os.makedirs(run_config.output_dir)
    
    # train dataloader will serve us the training data in batches
    train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), 
                                  batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    
    # optimizer and scheduler that modifies the learning rate during the training
    optimizer = AdamW(self.model.parameters(), lr=run_config.learning_rate)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=run_config.num_warmup_steps,
                                                num_training_steps=len(train_dataloader)*run_config.num_epochs)
    
    print("Training started:")
    print(f"\tNum examples = {len(train_dataset)}")
    print(f"\tNum Epochs = {run_config.num_epochs}")

    global_step = 0  # to save after every save_steps if save_steps is >= 0

    train_iterator = trange(0, int(run_config.num_epochs), desc="Epoch")
    for epoch in train_iterator:
      epoch_iterator = tqdm(train_dataloader, desc="Iteration", position=0, leave=True)
      self.model.train()
      epoch_losses = []
      for step, inputs in enumerate(epoch_iterator):
        # move batch to GPU
        if isinstance(inputs, dict):
            for k, v in inputs.items():
                inputs[k] = v.to(device)
        else:
            inputs = inputs.to(device)

        # forward pass - model also outputs a computed loss
        outputs = self.model(**inputs)
        loss = outputs[0]

        epoch_losses.append(loss.item())

        # backward pass - backpropagation
        self.model.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        epoch_iterator.set_description(f"Training loss = {loss.item():.4f}")

        if run_config.save_steps > -1 and global_step > 0 and global_step % run_config.save_steps == 0:
          output_dir = os.path.join(run_config.output_dir, f"Step_{step}")
          self.model.save_pretrained(output_dir)
          test_loss = self.evaluate(self.model, val_dataset, device, run_config)
          print(f"After step {step + 1}: val loss ={test_loss}")

        global_step += 1
      
      if run_config.save_each_epoch:
        output_dir = os.path.join(run_config.output_dir, f"Epoch_{epoch + 1}")
        model.save_pretrained(output_dir)
        test_loss = self.evaluate(self.model, val_dataset, device, run_config)
        print(f"After epoch {epoch + 1}: val loss ={test_loss}")

  def evaluate(self, model, test_dataset, device, run_config):
    test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset),
                                 batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    self.model.eval()
    losses = []
    for inputs in tqdm(test_dataloader, desc="Evaluating", position=0, leave=True):
      # move batch to GPU
      if isinstance(inputs, dict):
        for k, v in inputs.items():
          inputs[k] = v.to(device)
      else:
        inputs = inputs.to(device)

      with torch.no_grad():
        loss = model(**inputs)[0]
      losses.append(loss.item())

    return np.mean(losses)

*RunConfig* holds the parameter for training/testing:

In [None]:
@dataclass
class RunConfig:
  learning_rate: float
  batch_size: int
  num_epochs: int
  num_warmup_steps: int = 1
  save_steps: int = -1
  save_each_epoch: bool = True
  output_dir: str = "/content/"
  collate_fn: None = None

## Training

We have now implemented everything to start fine-tuning. We can save the fine-tuned models to our Colab instance (available under `/content/`) or we can connect our Google Drive to Colab and use it as external memory. If you want to do the latter, run the cell below and follow instructions.

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

In [None]:
run_config = RunConfig(
    learning_rate = 3e-5,
    batch_size = 4,  # start with 32 and decrease if you get CUDA out of memory exception
    num_epochs = 3,
    save_each_epoch = True,
    output_dir = "/content/drive/My Drive/NLP-workshop/BERT-sentiment/",
    collate_fn = collate_batch_to_tensors
)

In [None]:
path_to_train_csv = "https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/imdb_train.csv"

sentiment_to_label = {"negative": 0, "positive": 1}
label_to_sentiment = {0: "negative", 1: "positive"}

df = pd.read_csv(path_to_train_csv)
df["label"] = df["sentiment"].map(sentiment_to_label)
train_df, val_df = train_test_split(df, test_size=0.2, stratify=df["label"])

In [None]:
max_len = 512  # max length of input, pretrained model only supports max_len up to 512, use smaller values for faster training

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)

train_dataset = TextClassificationDataset(train_df["review"].tolist(), train_df["label"].tolist(), tokenizer, max_len)
val_dataset = TextClassificationDataset(val_df["review"].tolist(), val_df["label"].tolist(), tokenizer, max_len)

Instatiate the model and start training!

In [None]:
model = DistilBertTextClassificationModel(DistilBertConfig.from_pretrained("distilbert-base-uncased"), num_classes=2)
model.load("distilbert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
trainer = Trainer(model)
trainer.train(train_dataset, val_dataset, device, run_config)

If you happen to get a CUDA out of memory exception, do the following:
- cause another exception so python doesn't hold any references to trainer or model, e.g. run the bottom cell causing ZeroDivisionError
- run the cell below that empties GPU cache
- decrease the batch_size in run_config and rerun that cell
- reinstantiate the model and rerun training

In [None]:
1/0

In [None]:
model = None
trainer = None
gc.collect()
torch.cuda.empty_cache()

## Evaluation

With this procedure we've now fine-tuned a model to predict the polarity of a review. For the purposes of this workshop we've pretrained a model so we can analyze it. Load it by running the cell below.

In [None]:
!mkdir /content/bert-imdb
!gdown https://drive.google.com/uc?id=10itTQ54hZd7G4t66KomAcjyMbfNJSSrm
!gdown https://drive.google.com/uc?id=10hEbs1zAeOBcG-wNLgSBS4iO3Aby2YoX
!gdown -O /content/bert-imdb/config.json https://drive.google.com/uc?id=1-vFT2MCAep0MHKgPOjpwYin0HOOsVhPt
!gdown -O /content/bert-imdb/model.tar https://drive.google.com/uc?id=1-xMaX3dYJ2cal1LsQhlnex9jXpHWHIL8
!gdown -O /content/bert-imdb/pytorch_model.bin https://drive.google.com/uc?id=1-pklw-2XLo3IKbl59ybuZnqz2FwOYaDn

Let's instantiate all the object we need for evaluation: model, dataset, tokenizer, etc.

In [None]:
path_to_test_csv = "https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/imdb_test.csv"

sentiment_to_label = {"negative": 0, "positive": 1}
label_to_sentiment = {0: "negative", 1: "positive"}

df = pd.read_csv(path_to_test_csv)
df["label"] = df["sentiment"].map(sentiment_to_label)

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)
test_dataset = TextClassificationDataset(df["review"].tolist(), df["label"].tolist(), tokenizer, 512)

model = DistilBertTextClassificationModel(DistilBertConfig.from_pretrained("distilbert-base-uncased"), num_classes=2)
model.load("/content/bert-imdb/")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

The following code will compute log loss and accuracy on the test set.

In [None]:
def evaluate(model, test_dataset, device, run_config):
    test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset),
                                 batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    model.eval()
    ce_losses, acc_losses = [], []
    with torch.no_grad():
      for inputs in tqdm(test_dataloader, desc="Evaluating", position=0, leave=True):
          # move batch to GPU
          if isinstance(inputs, dict):
            for k, v in inputs.items():
              inputs[k] = v.to(device)
          else:
            inputs = inputs.to(device)

          loss, logits_y = model(**inputs)
          ce_losses.append(loss.item())
          pred_y = np.argmax(nn.functional.softmax(logits_y, dim=1).squeeze().cpu().numpy(), axis=1)  # beautiful
          true_y = inputs["labels"].cpu().numpy()
          acc_losses.append(np.mean(pred_y == true_y))

    return np.mean(ce_losses), np.mean(acc_losses)

In [None]:
log_loss, accuracy = evaluate(model, test_dataset, device, run_config)
print(f"\nTest log loss = {log_loss:.4f}\nTest accuracy = {accuracy:.4f}")

Nice, we achieve a relatively good accuracy of 0.93. We can now experiment with the model and write some custom reviews.

In [None]:
def predict_review_sentiment(review: str):
  enc = tokenizer(review)
  inputs = {"input_ids": torch.tensor(enc["input_ids"], dtype=torch.long).unsqueeze(0).to(device),
            "attention_mask": torch.tensor(enc["attention_mask"], dtype=torch.long).unsqueeze(0).to(device)}
  with torch.no_grad():
    prediction = np.argmax(nn.functional.softmax(model(**inputs), dim=1).cpu().numpy())
  print(review)
  print(f"Sentiment: {label_to_sentiment.get(prediction)}")