## The dataset
[IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/)

This is a dataset for binary sentiment classification, where each example contains a movie review along with its positive or negative sentiment label.

We split the dataset into train (80%), validate (10%), and test (10%) sets. 

## Finetuning BERT

`bert-base-uncased` is one of the pre-trained models from the BERT (Bidirectional Encoder Representations from Transformers) family of models developed by Google AI. It is a widely used variant of BERT and has been pre-trained on a massive amount of text data, including a mixture of books, articles, and web pages.

Note that different BERT model variants may have different maximum sequence lengths, so be sure to understand which `max_length` you are working with. Generally, `max_length = 512` is the default for BERT models. 

Hyperparameter choices in this notebook:
- Maximum Sequence Length (`max_length`) = 128
- Batch size (`batch_size`) = 64
- Optimizer: `AdamW`
- Learning rate (`lr`) = 2e-5
- Weight decay (`weight_decay`) = 0.01
- Number of epochs (`epochs`) = 5

## Optimization

[`Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) and [`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) algorithms share some similarities, but they also have important differences, mainly related to the handling of weight decay (L2 regularization) in training.

- **`Adam`**: The original `Adam` optimizer includes weight decay as part of the parameter update step. Weight decay is added to the gradients of the model weights during the optimization process. This can lead to unintended behavior, such as incorrect weight decay being applied to some parameters.
- **`AdamW`**: `AdamW` is a variant of `Adam` that explicitly decouples weight decay from the optimization step. Weight decay is applied directly to the model weights after the gradients are computed, making it more straightforward and predictable. This decoupling of weight decay is considered a more mathematically principled approach and is less prone to issues like incorrect weight decay scaling.

In practice, the choice between `Adam` and `AdamW` depends on the specific task and model architecture. When working with models like `BERT` or other transformers, `AdamW` is often preferred due to its more reliable handling of weight decay. However, for simpler models and tasks, `Adam` may still be suitable.

Here, we use `AdamW` with the default value for `weight_decay`, which is 0.01.

## Results
We obtain these metrics on the last epoch:
- `epoch 5: Avg Loss 0.0420, Train Acc 0.9864, Val Acc 0.8902, Test Acc: 0.8874`

Looking at the full results (scroll down), we could perhaps "early-stop" after the 2nd epoch to minimize overfitting.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [2]:
from transformers import BertTokenizer
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertConfig, BertForSequenceClassification, BertTokenizer

## Define device (CPU or GPU)
This notebook is run on a MacBook Pro with Apple M2 Max chip. Thus, device is set to "mps" for GPU. For other hardware architecture use:

```Python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
```


In [3]:
device = torch.device("mps")

## Load the dataset of movie reviews

In [4]:
df = pd.read_csv("data/IMDB Dataset.csv")
print(f'Number of examples is: {len(df)}')
df.head()

Number of examples is: 50000


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
# df = df.sample(n = 1000, random_state=42)
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

## Load BERT tokenizer and model


In [6]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # 2 labels: positive and negative
# The informational message below indicates that 
# the model's output layer, including the logits, is randomly initialized.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# Inspect model config
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [8]:
# Inspect the maximum sequence length
tokenizer.model_max_length

512

In [9]:
# Inspect the dropout rate applied to the hidden layers of BERT
bert_config = BertConfig.from_pretrained(model_name)
bert_config.hidden_dropout_prob

0.1

## Tokenize sentiments

Let's break down the following code:
```python
tokenizer.encode(x, add_special_tokens=True, truncation=True, padding='max_length', max_length=128, return_attention_mask=True)
```

- `add_special_tokens=True`: This parameter instructs the tokenizer to add special tokens to the encoded sequence. In the case of BERT, these special tokens typically include `[CLS]` (the classification token) at the beginning of the sequence and `[SEP]` (the separator token) at the end of the sequence. These special tokens are important for the model to understand the structure of the input and to perform tasks like classification. For example, `[CLS]` is used to represent the start of the input, and `[SEP]` is used to separate segments in a sequence.

- `truncation=True`: This parameter indicates that if a sequence is longer than the specified `max_length`, it should be truncated to fit within the limit. Truncation is necessary to ensure that sequences don't exceed the model's token limit (e.g., 128 tokens in your case).

- `padding='max_length'`: This parameter specifies how to handle padding of sequences. When set to `'max_length'`, the tokenizer will pad sequences to the specified `max_length`. Padding is necessary because BERT models expect input sequences of equal length. By padding shorter sequences with special padding tokens (`[PAD]`), you ensure that all sequences have the same length as specified by `max_length`.

- `max_length=128`: This argument specifies the maximum length of the output sequence after encoding. If the input sequence is longer than this value and truncation is set to True, it will be truncated to fit within this length. If it's shorter, it will be padded to match this length.
  
- `return_attention_mask`: When set to `True`, this argument instructs the tokenizer to return an attention mask along with the encoded input. An attention mask is a binary mask that indicates which tokens in the input sequence the model should pay attention to (i.e., not consider padding tokens). It helps the model focus on the actual content of the sequence and ignore padding tokens.

In [10]:
# Define a function to tokenize and encode the text and return both input_ids and attention_mask
def tokenize_and_encode(text):
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        truncation=True,
        padding='max_length',
        max_length=128,
        return_attention_mask=True,  # Return attention_mask
    )
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']
    return input_ids, attention_mask

tqdm.pandas()

# Apply the function to the "review" column
df[['input_ids', 'attention_mask']] = df['review'].progress_apply(lambda x: tokenize_and_encode(x)).apply(pd.Series)

# Convert sentiments to boolean values
df["sentiment_bool"] = df["sentiment"].apply(lambda x: 1 if x == "positive" else 0)

df.head()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 50000/50000 [01:53<00:00, 439.80it/s]


Unnamed: 0,review,sentiment,input_ids,attention_mask,sentiment_bool
0,One of the other reviewers has mentioned that ...,positive,"[101, 2028, 1997, 1996, 2060, 15814, 2038, 385...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1
1,A wonderful little production. <br /><br />The...,positive,"[101, 1037, 6919, 2210, 2537, 1012, 1026, 7987...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1
2,I thought this was a wonderful way to spend ti...,positive,"[101, 1045, 2245, 2023, 2001, 1037, 6919, 2126...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1
3,Basically there's a family where a little boy ...,negative,"[101, 10468, 2045, 1005, 1055, 1037, 2155, 207...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[101, 9004, 3334, 4717, 7416, 1005, 1055, 1000...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1


## Train / Validate / Test split

In [11]:
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

## Convert data to PyTorch tensors

In [12]:
def get_tensor_data(data):
    out_input_ids = torch.tensor(data["input_ids"].tolist())
    out_attention_masks = torch.tensor(data["attention_mask"].tolist())
    out_labels = torch.tensor(data["sentiment_bool"].tolist())
    return out_input_ids, out_attention_masks, out_labels

train_input_ids, train_attention_masks, train_labels = get_tensor_data(train_df)
val_input_ids, val_attention_masks, val_labels = get_tensor_data(val_df)
test_input_ids, test_attention_masks, test_labels = get_tensor_data(test_df)

## Create DataLoader for efficient batch processing
When you create a DataLoader with shuffle=True, it shuffles the data once at the beginning when the DataLoader is created for that epoch. So, within a single epoch, the order of batches remains the same, but when you start a new epoch, the data is shuffled again, leading to a different order of examples for each epoch. This is a common practice in training deep learning models to ensure that the model doesn't memorize the order of examples and generalizes well across different batches and epochs.

In [13]:
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)
test_dataset = TensorDataset(test_input_ids, test_attention_masks, test_labels)

In [14]:
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

## Define optimizer and loss function


In [15]:
# Note: setting output_attentions and output_hidden_states to False can help with efficiency
# because it reduces the amount of additional information that the model needs to compute and
# store during forward passes. 
model = BertForSequenceClassification.from_pretrained(model_name, num_labels = 2, 
                                                      output_attentions = False, 
                                                      output_hidden_states = False)

# https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
# weight_decay (float, optional) – weight decay coefficient (default: 1e-2)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=1e-2)  # use default weight_decay
criterion = nn.CrossEntropyLoss() # binary classification

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
model.config.hidden_dropout_prob

0.1

## Model training and validation

In [17]:
# Move model to the device (GPU), which is 'mps' on Mac M2
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [18]:
epochs = 5
torch.manual_seed(12345)
for epoch in tqdm(range(epochs)):
    # train
    model.train()
    total_loss = 0.0
    correct = 0
    for batch in train_loader:
        num_batches = len(train_loader)
        inputs, attention_mask, labels = batch
        inputs, attention_mask, labels = inputs.to(device), attention_mask.to(device), labels.to(device)
        optimizer.zero_grad(set_to_none=True)
        outputs_dict = model(inputs, attention_mask=attention_mask, labels=labels)
        loss = criterion(outputs_dict.logits, labels)
        predicted = outputs_dict.logits.argmax(dim=1)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        # Increment the number of correct predictions
        correct += (predicted == labels).type(torch.float).sum().item()
    train_acc = correct / len(train_loader.dataset)
    avg_train_loss = total_loss / num_batches
    
    # validate
    val_correct = 0
    model.eval()
    with torch.no_grad():
        for batch in val_loader:
            inputs, attention_mask, labels = batch
            inputs, attention_mask, labels = inputs.to(device), attention_mask.to(device), labels.to(device)
            outputs_dict = model(inputs, attention_mask=attention_mask, labels=labels)
            val_predicted = outputs_dict.logits.argmax(dim=1)
            val_correct += (val_predicted == labels).type(torch.float).sum().item()
    val_acc = val_correct / len(val_loader.dataset)
    
    print(f'epoch {epoch + 1}: Avg Loss {avg_train_loss:.4f}, Train Acc {train_acc:.4f}, Val Acc {val_acc:.4f}')          
        

 20%|██████████████████████▌                                                                                          | 1/5 [10:46<43:04, 646.14s/it]

epoch 1: Avg Loss 0.3238, Train Acc 0.8562, Val Acc 0.8794


 40%|█████████████████████████████████████████████▏                                                                   | 2/5 [21:34<32:22, 647.61s/it]

epoch 2: Avg Loss 0.2009, Train Acc 0.9195, Val Acc 0.8950


 60%|███████████████████████████████████████████████████████████████████▊                                             | 3/5 [32:16<21:29, 644.88s/it]

epoch 3: Avg Loss 0.1201, Train Acc 0.9565, Val Acc 0.8962


 80%|██████████████████████████████████████████████████████████████████████████████████████████▍                      | 4/5 [42:57<10:43, 643.23s/it]

epoch 4: Avg Loss 0.0660, Train Acc 0.9774, Val Acc 0.8966


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [53:34<00:00, 642.86s/it]

epoch 5: Avg Loss 0.0420, Train Acc 0.9864, Val Acc 0.8902





In [19]:
model.eval()
test_correct = 0
with torch.no_grad():
    for batch in test_loader:
        inputs, attention_mask, labels = batch
        inputs, attention_mask, labels = inputs.to(device), attention_mask.to(device), labels.to(device)
        outputs_dict = model(inputs, attention_mask=attention_mask, labels=labels)
        test_predicted = outputs_dict.logits.argmax(dim=1)
        test_correct += (test_predicted == labels).type(torch.float).sum().item()
test_acc = test_correct / len(test_loader.dataset)

print(f'epoch {epoch + 1}: Avg Loss {avg_train_loss:.4f}, Train Acc {train_acc:.4f}, Val Acc {val_acc:.4f}, Test Acc: {test_acc:.4f}')          


epoch 5: Avg Loss 0.0420, Train Acc 0.9864, Val Acc 0.8902, Test Acc: 0.8874
