# Transformer for Classification - DistilBERT Sentiment Analysis (Chapter 16 Application) ðŸš€

---

This notebook demonstrates the superiority and ease of use of the **Transformer Encoder** architecture for sequence classification, applying a powerful pre-trained modelâ€”**DistilBERT**â€”to the **IMDB Sentiment Analysis** task. This fully showcases the advanced, fine-tuning approach of **Chapter 16: Transformers**.

### 1. The Power of Pre-trained Encoders (BERT/DistilBERT) ðŸ§ 

Unlike the RNN approach in Chapter 15, this method uses a vast, pre-trained model:

* **DistilBERT Architecture:** This is a smaller, faster, and highly efficient version of the original **BERT (Bidirectional Encoder Representations from Transformers)** model, built entirely on the **Transformer Encoder** stack. 
* **Bidirectionality:** Unlike the GPT-2 Decoder (which is causal), BERT/DistilBERT uses **Full Bidirectional Attention**. This means the model can look at the entire input sentence simultaneously (words before and after) to build the most contextualized representation for every single word.
* **Pre-training:** The model weights were learned on massive amounts of text data, allowing the model to already understand language, grammar, and context. The task here is merely **Fine-Tuning** those weights for the specific Sentiment Analysis task.

### 2. High-Efficiency Data Pipelining with Hugging Face ðŸ“¦

The notebook implements a specialized pipeline necessary for Transformer-based models:

* **Specialized Tokenizer (`DistilBertTokenizerFast`):** Every Transformer model requires a tokenizer that matches its pre-training. This tokenizer performs:
    * **Sub-word Tokenization (WordPiece):** Breaks down complex words into smaller, more common units for better vocabulary coverage.
    * **Special Tokens:** Automatically adds the necessary tokens like `[CLS]` (Classification Token, whose final hidden state is used for the entire sequence's prediction) and `[SEP]` (Separator Token).
* **Dataset Encoding:** The entire IMDB dataset is encoded into numerical IDs, attention masks (to ignore padding), and segment IDs.
* **`Trainer` Utility:** The notebook employs the high-level **Hugging Face `Trainer`** and **`TrainingArguments`** classes. This utility replaces the entire manual PyTorch training loop (from Chapters 13 and 15), automatically handling:
    * Batching and gradient accumulation.
    * Optimization and learning rate scheduling.
    * Evaluation and logging.

### 3. Classification Head and Fine-Tuning

* **Model Loading (`DistilBertForSequenceClassification`):** This loads the pre-trained DistilBERT model with an extra **Classification Head** (a linear layer) already attached to the final hidden state of the `[CLS]` token.
* **Objective:** The training process is **Fine-Tuning**, where the vast pre-trained weights are slightly adjusted to optimize the model's performance specifically on the sentiment labels (0 or 1) of the IMDB reviews.
* **Loss Function:** **`nn.CrossEntropyLoss`** (or its equivalent within the `transformers` library) is used for the binary classification task.

This notebook demonstrates the state-of-the-art approach in NLP: leveraging large, pre-trained Transformer encoders to achieve highly accurate results in classification tasks with minimal training time compared to training an LSTM from scratch.

In [2]:
import time
import gzip
import shutil
import requests

import numpy as np
import pandas as pd
import torch
import torchtext
import torch.nn.functional as F

import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

from transformers import Trainer, TrainingArguments

In [3]:
torch.backends.cudnn.determinisctic = True
SEED = 123
torch.manual_seed(SEED)

<torch._C.Generator at 0x2122a5b2750>

In [4]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() is True else 'cpu')
NUM_EPCCHS = 3
print(DEVICE)

cuda


In [5]:
url = 'http://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz'
filename = url.split('/')[-1]

with open(filename, 'wb') as f:
    r = requests.get(url)
    f.write(r.content)

with gzip.open(filename, 'rb') as f_in:
    with open('movie_data.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [5]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [6]:
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values

valid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].values

test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

In [7]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(list(train_texts), truncation= True, padding= True)
valid_encodngs = tokenizer(list(valid_texts), truncation= True, padding= True)
test_encodings = tokenizer(list(test_texts), truncation= True, padding= True)

In [8]:
class IMDBdataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx])
                for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])

        return item

    def __len__(self):
        return len(self.labels)


In [9]:
train_dataset = IMDBdataset(train_encodings, train_labels)
valid_dataset = IMDBdataset(valid_encodngs, valid_labels)
test_dataset = IMDBdataset(test_encodings, test_labels)

In [10]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size= 16, shuffle= True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size= 16, shuffle= False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size= 16, shuffle= False)

In [11]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
model.to(DEVICE)
model.train()

optim = torch.optim.Adam(model.parameters(), lr= 5e-5)

In [14]:
def compute_accuracy(model, dataloader, device):
    with torch.no_grad():
        correct_pred, num_examples = 0, 0
        for batch_idx, batch in enumerate(dataloader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            output = model(input_ids, attention_mask= attention_mask)
            logits = output['logits']
            preds = torch.argmax(logits, 1)
            num_examples += labels.size(0)
            correct_pred += (preds == labels).sum()

        result = (correct_pred.float() / num_examples) * 100
        return result

In [18]:
start_time = time.time()

NUM_EPOCHS = 3

for epoch in range(NUM_EPCCHS):
    model.train()
    for batch_idx , batch in enumerate(train_loader):
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        outputs = model(input_ids, attention_mask= attention_mask, labels= labels)
        loss, logits = outputs['loss'], outputs['logits']

        optim.zero_grad()
        loss.backward()
        optim.step()

        if not batch_idx % 250:
            print(f'Epoch {epoch+1:04d}/{NUM_EPOCHS:04d}'
                  f' | Batch '
                  f'{batch_idx:04d}/{len(train_loader)}'
                  f' | Loss {loss:.4f}'
                 )
    model.eval()

    with torch.set_grad_enabled(False):
        print(f'Training Accuracy: {compute_accuracy(model, train_loader, DEVICE):.2f}%')
        print(f'\nValid Accuracy: {compute_accuracy(model, valid_loader, DEVICE):.2f}%')
        print(f'Time Elapsed: {(time.time() - start_time)/60:.2f} min')
print(f'Total Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test Accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch 0001/0003 | Batch 0000/2188 | Loss 0.6832
Epoch 0001/0003 | Batch 0250/2188 | Loss 0.0818
Epoch 0001/0003 | Batch 0500/2188 | Loss 0.1194
Epoch 0001/0003 | Batch 0750/2188 | Loss 0.3069
Epoch 0001/0003 | Batch 1000/2188 | Loss 0.2505
Epoch 0001/0003 | Batch 1250/2188 | Loss 0.1905
Epoch 0001/0003 | Batch 1500/2188 | Loss 0.1401
Epoch 0001/0003 | Batch 1750/2188 | Loss 0.1270
Epoch 0001/0003 | Batch 2000/2188 | Loss 0.2155
Training Accuracy: 96.57%

Valid Accuracy: 92.52%
Time Elapsed: 81.58 min
Epoch 0002/0003 | Batch 0000/2188 | Loss 0.0494
Epoch 0002/0003 | Batch 0250/2188 | Loss 0.0974
Epoch 0002/0003 | Batch 0500/2188 | Loss 0.1154
Epoch 0002/0003 | Batch 0750/2188 | Loss 0.1304
Epoch 0002/0003 | Batch 1000/2188 | Loss 0.0573
Epoch 0002/0003 | Batch 1250/2188 | Loss 0.0937
Epoch 0002/0003 | Batch 1500/2188 | Loss 0.1355
Epoch 0002/0003 | Batch 1750/2188 | Loss 0.0119
Epoch 0002/0003 | Batch 2000/2188 | Loss 0.4044
Training Accuracy: 98.69%

Valid Accuracy: 92.34%
Time Elapsed

In [13]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased'
)
model.to(DEVICE)
model.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [14]:
optim = torch.optim.Adam(model.parameters(), lr= 5e-5)

In [15]:
train_args = TrainingArguments(
    output_dir= './results',
    num_train_epochs= 3,
    per_device_train_batch_size= 16,
    per_device_eval_batch_size= 16,
    logging_dir= './logs',
    logging_steps= 10
)

In [16]:
import evaluate

metric = evaluate.load('accuracy')

Using the latest cached version of the module from C:\Users\98922\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--accuracy\f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Sep 15 01:28:52 2025) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.


In [17]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis= -1)
    return metric.compute(predictions= preds, references= labels)

In [19]:
trainer = Trainer(
    model= model,
    args= train_args,
    train_dataset= train_dataset,
    eval_dataset= test_dataset,
    compute_metrics= compute_metrics,
    optimizers= (optim, None)
)

In [None]:
start_time = time.time()
trainer.train()

In [None]:
print(trainer.evaluate())

In [None]:
model.eval()
model.to(DEVICE)

print(f'Test Accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

In [None]:
training_args = TrainingArguments(
    'test_trainer',
    evaluation_strategy= 'epoch', ...
)