# IMDB Sentiment Analysis

## Overview

This notebook demonstrates building a sentiment classifier using simple text classification with BERT. We use the IMDb Movie Reviews dataset built into TensorFlow for our training and validation data. The output is a classifier that can take a text input and classify it as 'Positive' or 'Negative'.

In [3]:
%%capture 
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from tqdm import tqdm;

from transformers import AutoModelForSequenceClassification, AutoTokenizer

### Text Encoder
First we generate a text encoder that takes a sequence of input IDs (created via a BERT tokenizer) and outputs embeddings of a desired dimension (512 in this case).
Some notes about the encoder implementation:
* We use the pretrained 'distilbert-base-uncased' model to avoid training our own.
* The projection layer takes the output of the BERT embeddings and projects them to our desired dimension.
* We are interested in the first-token ([CLS]) embedding, as it represents sentence meaning.
* We use L2 normalization to ensure the length of each vector is = 1.0. This prepares our vectors for cosine similarity analysis with other normalized vectors.

In [4]:
class TextEncoder(nn.Module):
    def __init__(
        self,
        model_name="distilbert-base-uncased",
        embedding_dim=512,
        freeze_bert=False,
    ):
        super().__init__()

        self.bert = AutoModelForSequenceClassification.from_pretrained(model_name)

        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False

        self.projection = nn.Sequential(
            nn.Linear(self.bert.config.hidden_size, embedding_dim),
            nn.LayerNorm(embedding_dim),
            nn.Dropout(0.1),
        )

    def forward(self, input_ids, attention_mask=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )

        # use [CLS] token
        cls_embedding = outputs.hidden_states[-1][:, 0]

        # project to desired dimension
        text_embedding = self.projection(cls_embedding)

        # l2 normalize
        text_embedding = F.normalize(text_embedding, p=2, dim=1)

        return text_embedding

### Text Classifier
Next we generate the text classifier. This uses the encoder we built previously to compute the text embeddings for the IMDb reviews. We finally pass these embeddings through a Dense classifier with dimension = 2 to represent Positive or Negative sentiment. The logits are returned by calling the classifier

In [5]:
class TextClassifier(nn.Module):
    def __init__(self, num_classes=2, embedding_dim=512, **kwargs):
        super().__init__()
        self.text_encoder = TextEncoder(embedding_dim=embedding_dim, freeze_bert=True)
        self.classifier = nn.Linear(embedding_dim, num_classes)

    def forward(self, input_ids, attention_mask=None):
        text_embeddings = self.text_encoder(input_ids, attention_mask)
        logits = self.classifier(text_embeddings)

        return logits

### Load Data
Load the IMDb movie reviews dataset and prepare for consumption from our classifier.

In [6]:
def load_imdb_data(max_train_examples=20_000, max_test_examples=500):
    # Load dataset using Hugging Face datasets
    dataset = load_dataset("imdb")

    print("Dataset info: IMDB Reviews - 25k train, 25k test examples")
    print("Features: text (string), label (0=negative, 1=positive)")

    # convert to lists for handling
    train_examples = []
    for i, example in enumerate(dataset["train"]):
        if i >= max_train_examples:
            break
        train_examples.append((example["text"], example["label"]))

    test_examples = []
    for i, example in enumerate(dataset["test"]):
        if i >= max_test_examples:
            break
        test_examples.append((example["text"], example["label"]))

    print(f"Loaded {len(train_examples)} training examples")
    print(f"Loaded {len(test_examples)} test examples")

    # show examples
    print("\nExample training data:")
    for i in range(5):
        text, label = train_examples[i]
        sentiment = "Positive" if label == 1 else "Negative"
        print(f"{sentiment}: {text[:100]}")

    return train_examples, test_examples

### Data Preprocessor
We preprocess the text data by passing the text inputs through a tokenizer. Texts are contained to a max length of 128 with padding enabled. Texts over 128 are truncated, and texts shorter have a padding token added (0) to maintain the shape.

In [7]:
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )

        return {
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "label": torch.tensor(label, dtype=torch.long),
        }


class TextPreprocessor:
    def __init__(self, tokenizer, max_length=128):
        self.tokenizer = tokenizer
        self.max_length = max_length

    def create_dataloader(self, text_label_pairs, batch_size=32, shuffle=True):
        texts, labels = zip(*text_label_pairs)

        dataset = IMDBDataset(texts, labels, self.tokenizer, self.max_length)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

        return dataloader

### Model Training
Next we create the training function. This takes the tokenizer and classification model and runs it through our preprocess -> compile -> train pipeline. We return the model and the tokenizer.

In [None]:
def train_simple_classifier(
    train_data, batch_size=32, num_epochs=3, learning_rate=2e-5
):
    from sklearn.model_selection import train_test_split

    # split training data into train/validation
    train_split, val_split = train_test_split(
        train_data, test_size=0.2, random_state=42
    )
    print(
        f"Training on {len(train_split)} samples, validating on {len(val_split)} samples"
    )

    # sSet device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Training on device: {device}")

    # initialize tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = TextClassifier(num_classes=2, embedding_dim=512).to(device)

    # create dataloaders
    preprocessor = TextPreprocessor(tokenizer=tokenizer, max_length=128)
    train_dataloader = preprocessor.create_dataloader(
        train_split, batch_size=batch_size, shuffle=True
    )
    val_dataloader = preprocessor.create_dataloader(
        val_split, batch_size=batch_size, shuffle=False
    )

    # setup optimizer and loss function
    optimizer = torch.optim.Adam(
        model.parameters(), lr=learning_rate, weight_decay=1e-3
    )
    criterion = nn.CrossEntropyLoss()

    # early stopping parameters
    best_val_acc = 0
    patience = 3
    patience_counter = 0

    # training loop
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        total_train_loss = 0
        train_correct = 0
        train_total = 0

        train_progress = tqdm(
            train_dataloader, desc=f"Epoch {epoch + 1}/{num_epochs} [Train]"
        )
        for batch in train_progress:
            # move batch to device
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            # forward pass
            optimizer.zero_grad()
            logits = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(logits, labels)

            # backward pass
            loss.backward()
            optimizer.step()

            # calculate accuracy
            predictions = torch.argmax(logits, dim=1)
            train_correct += (predictions == labels).sum().item()
            train_total += labels.size(0)
            total_train_loss += loss.item()

            # update progress bar
            current_acc = train_correct / train_total
            train_progress.set_postfix(
                {"loss": f"{loss.item():.4f}", "acc": f"{current_acc:.4f}"}
            )

        # validation phase
        model.eval()
        total_val_loss = 0
        val_correct = 0
        val_total = 0

        with torch.no_grad():
            val_progress = tqdm(
                val_dataloader, desc=f"Epoch {epoch + 1}/{num_epochs} [Val]"
            )
            for batch in val_progress:
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["label"].to(device)

                logits = model(input_ids=input_ids, attention_mask=attention_mask)
                loss = criterion(logits, labels)

                predictions = torch.argmax(logits, dim=1)
                val_correct += (predictions == labels).sum().item()
                val_total += labels.size(0)
                total_val_loss += loss.item()

                current_val_acc = val_correct / val_total
                val_progress.set_postfix(
                    {
                        "val_loss": f"{loss.item():.4f}",
                        "val_acc": f"{current_val_acc:.4f}",
                    }
                )

        # calculate epoch metrics
        train_loss = total_train_loss / len(train_dataloader)
        train_acc = train_correct / train_total
        val_loss = total_val_loss / len(val_dataloader)
        val_acc = val_correct / val_total

        print(f"Epoch {epoch + 1}:")
        print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
        print(f"  Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        # early stopping check
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            patience_counter = 0
            print(f"  New best validation accuracy: {best_val_acc:.4f}")
        else:
            patience_counter += 1
            print(f"  No improvement. Patience: {patience_counter}/{patience}")

        if patience_counter >= patience:
            print(f"Early stopping after {epoch + 1} epochs")
            break

    return model, tokenizer

### Testing our Model
The testing function takes a set of test texts and predicts the sentiment.

In [9]:
def test_model(test_texts, model, tokenizer):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.eval()

    # tokenize test texts
    encodings = tokenizer(
        test_texts,
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt",
    )

    # move to device
    input_ids = encodings["input_ids"].to(device)
    attention_mask = encodings["attention_mask"].to(device)

    # predict
    with torch.no_grad():
        logits = model(input_ids=input_ids, attention_mask=attention_mask)
        probabilities = F.softmax(logits, dim=1)

    print("Test Results:")
    for i, text in enumerate(test_texts):
        pos_prob = probabilities[i][1].item()
        sentiment = "Positive" if pos_prob > 0.5 else "Negative"
        print(f"'{text}' -> {sentiment} (confidence: {pos_prob:.3f})")

### Run the Test and Evaluation Loop

In [10]:
print("Training simple text classifier...")
train_data, test_data = load_imdb_data(
    max_train_examples=20_000, max_test_examples=1_000
)
model, tokenizer = train_simple_classifier(train_data, num_epochs=5)

print("\nTesting model...")
test_texts = [d[0] for d in test_data[:5]]  # Test on first 5 test examples
test_model(test_texts, model, tokenizer)

Training simple text classifier...
Dataset info: IMDB Reviews - 25k train, 25k test examples
Features: text (string), label (0=negative, 1=positive)
Loaded 20000 training examples
Loaded 1000 test examples

Example training data:
Negative: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it w
Negative: "I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's poli
Negative: If only to avoid making this type of film in the future. This film is interesting as an experiment b
Negative: This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instea
Negative: Oh, brother...after hearing about this ridiculous film for umpteen years all I can think of is that 
Training on 16000 samples, validating on 4000 samples
Training on device: cpu


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5 [Train]: 100%|██████████| 500/500 [09:31<00:00,  1.14s/it, loss=0.5323, acc=0.6532]
Epoch 1/5 [Val]: 100%|██████████| 125/125 [01:18<00:00,  1.59it/s, val_loss=0.5182, val_acc=0.7710]


Epoch 1:
  Train Loss: 0.6378, Train Acc: 0.6532
  Val Loss: 0.5559, Val Acc: 0.7710
  New best validation accuracy: 0.7710


Epoch 2/5 [Train]: 100%|██████████| 500/500 [08:23<00:00,  1.01s/it, loss=0.4757, acc=0.7827]
Epoch 2/5 [Val]: 100%|██████████| 125/125 [01:13<00:00,  1.70it/s, val_loss=0.4722, val_acc=0.8057]


Epoch 2:
  Train Loss: 0.5457, Train Acc: 0.7827
  Val Loss: 0.4844, Val Acc: 0.8057
  New best validation accuracy: 0.8057


Epoch 3/5 [Train]: 100%|██████████| 500/500 [08:22<00:00,  1.01s/it, loss=0.5280, acc=0.7999]
Epoch 3/5 [Val]: 100%|██████████| 125/125 [01:11<00:00,  1.76it/s, val_loss=0.4235, val_acc=0.8095]


Epoch 3:
  Train Loss: 0.4975, Train Acc: 0.7999
  Val Loss: 0.4509, Val Acc: 0.8095
  New best validation accuracy: 0.8095


Epoch 4/5 [Train]: 100%|██████████| 500/500 [11:14<00:00,  1.35s/it, loss=0.3745, acc=0.8025]
Epoch 4/5 [Val]: 100%|██████████| 125/125 [01:22<00:00,  1.51it/s, val_loss=0.4127, val_acc=0.8117]


Epoch 4:
  Train Loss: 0.4716, Train Acc: 0.8025
  Val Loss: 0.4335, Val Acc: 0.8117
  New best validation accuracy: 0.8117


Epoch 5/5 [Train]: 100%|██████████| 500/500 [09:02<00:00,  1.09s/it, loss=0.4312, acc=0.8084]
Epoch 5/5 [Val]: 100%|██████████| 125/125 [01:34<00:00,  1.32it/s, val_loss=0.4074, val_acc=0.8133]

Epoch 5:
  Train Loss: 0.4517, Train Acc: 0.8084
  Val Loss: 0.4238, Val Acc: 0.8133
  New best validation accuracy: 0.8133

Testing model...
Test Results:
'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing


