# Tutorial 10-2: The Pre-trained Giant – "Fine-Tuning BERT"

**Course:** CSEN 342: Deep Learning  
**Topic:** Transformers, BERT, Tokenization, and Transfer Learning

## Objective
In the lecture, we learned about **BERT** (Bidirectional Encoder Representations from Transformers). BERT changed the NLP landscape by providing a pre-trained "language understanding" engine that can be fine-tuned for almost any text task.

In this tutorial, we will:

1.  **Understand Tokenization:** See how BERT breaks words into "subwords" to handle unknown vocabulary.
2.  **Fine-Tune DistilBERT:** Adapt a pre-trained model to classify news articles (AG News dataset).
3.  **Perform Inference:** Build a pipeline to classify your own text.

We will use **DistilBERT**, a smaller, faster, cheaper version of BERT that retains 97% of its performance.

**NOTE**: Run this notebook under the `Transformers Bundle` kernel rather than the class kernel.

---

## Part 0: Robust Setup (The Offline Cache)

Compute nodes often block Python libraries from opening connections, but allow system tools like `wget`. We will manually download the model components and the dataset into a local folder.

In [6]:
# Run if needed...
!python -m pip install --user torch==2.6.0+cu118 torchvision torchaudio transformers accelerate --index-url https://download.pytorch.org/whl/cu118
# Restart the kernel after installation

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch==2.6.0+cu118
  Downloading https://download.pytorch.org/whl/cu118/torch-2.6.0%2Bcu118-cp39-cp39-linux_x86_64.whl.metadata (27 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch==2.6.0+cu118)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (23.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.2/23.2 MB[0m [31m75.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting nvidia-cuda-runtime-cu11==11.8.89 (from torch==2.6.0+cu118)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (875 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m875.6/875.6 kB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu11==11.8.87 (from torch==2.6.0+cu118)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cu

In [1]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
from tqdm import tqdm

# Define paths
data_root = '../data'
model_root = '../data/distilbert_local'
os.makedirs(data_root, exist_ok=True)
os.makedirs(model_root, exist_ok=True)

def download_file(url, save_path):
    if not os.path.exists(save_path):
        print(f"Downloading {os.path.basename(save_path)}...")
        # -nc: No clobber, -q: Quiet (except errors), -O: Output file
        exit_code = os.system(f"wget -nc -q -O {save_path} {url}")
        if exit_code != 0:
            print(f"Error downloading {url}")

# 1. Download AG News Dataset
dataset_url = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
dataset_path = os.path.join(data_root, 'ag_news_train.csv')
download_file(dataset_url, dataset_path)

# 2. Download DistilBERT Model Files (Hugging Face Hub)
# We need these specific files to load the model "offline"
base_hf_url = "https://huggingface.co/distilbert-base-uncased/resolve/main/"
files_to_fetch = [
    "config.json",
    "pytorch_model.bin",
    "vocab.txt",
    "tokenizer.json",
    "tokenizer_config.json"
]

for filename in files_to_fetch:
    download_file(base_hf_url + filename, os.path.join(model_root, filename))

print("All files ready.")

All files ready.


---

## Part 1: The Tokenizer

Transformers don't read words; they read **Tokens**. 
BERT uses **WordPiece** tokenization. It breaks common words into wholes (`apple`) and rare words into chunks (`app`, `##le`). This solves the "Unknown Word" problem.

We load the tokenizer from our local folder.

In [2]:
from transformers import DistilBertTokenizer

# Load from local path
tokenizer = DistilBertTokenizer.from_pretrained(model_root)

# Demonstration
text = "Transformers are fascinating!"
encoded = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

print(f"Original: {text}")
print(f"Tokens:   {tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])}")
print(f"IDs:      {encoded['input_ids'][0]}")

Original: Transformers are fascinating!
Tokens:   ['[CLS]', 'transformers', 'are', 'fascinating', '!', '[SEP]']
IDs:      tensor([  101, 19081,  2024, 17160,   999,   102])


### Discussion
* `[CLS]`: Special classification token added to the start. The model uses the embedding of this token to represent the *entire sentence*.
* `[SEP]`: Separator token at the end.
* `##ing`: The "##" indicates this token is attached to the previous one.

---

## Part 2: The Dataset Class

We wrap the AG News CSV. 

**Classes:** 1-World, 2-Sports, 3-Business, 4-Sci/Tech. (We will map them to 0-3).

In [3]:
class AGNewsDataset(Dataset):
    def __init__(self, csv_path, tokenizer, max_len=64, samples=2000):
        # Load only a subset for tutorial speed
        self.df = pd.read_csv(csv_path, header=None).sample(n=samples, random_state=42)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        # Columns: 0=Class, 1=Title, 2=Description
        text = str(row[1]) + " " + str(row[2]) 
        label = int(row[0]) - 1 # Convert 1-4 to 0-3
        
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Create Loaders
dataset = AGNewsDataset(dataset_path, tokenizer)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=16)

---

## Part 3: The Model & Fine-Tuning

We use `DistilBertForSequenceClassification`. This class wraps the raw DistilBERT model and adds a simple Linear Layer on top of the `[CLS]` token output.

**Fine-Tuning:** We update *all* weights (BERT + Head), but with a small learning rate ($2e-5$ is standard for BERT).

In [4]:
from transformers import DistilBertForSequenceClassification

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load Model from local path
model = DistilBertForSequenceClassification.from_pretrained(
    model_root, 
    num_labels=4 # 4 classes in AG News
)
model = model.to(device)

optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

def train(epochs=1):
    print("Starting Fine-Tuning (this might take 2-3 mins on GPU)...")
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        
        for batch in tqdm(train_loader):
            input_ids = batch['input_ids'].to(device)
            mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            optimizer.zero_grad()
            
            # HF models return a tuple (loss, logits)
            outputs = model(input_ids, attention_mask=mask, labels=labels)
            loss = outputs.loss
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}: Loss {avg_loss:.4f}")

train(epochs=1)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at ../data/distilbert_local and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting Fine-Tuning (this might take 2-3 mins on GPU)...


100%|██████████| 100/100 [00:10<00:00,  9.65it/s]

Epoch 1: Loss 0.7517





---

## Part 4: Inference

Let's test the model on new, unseen sentences.

In [5]:
class_names = ["World", "Sports", "Business", "Sci/Tech"]

def predict(text):
    model.eval()
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64, padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        prediction = torch.argmax(logits, dim=1).item()
        
    return class_names[prediction]

# Test Cases
samples = [
    "The stock market crashed today after the new inflation report.",
    "Manchester United scored a late goal to win the championship.",
    "Apple released a new iPhone with a faster processor.",
    "Peace talks in the middle east have stalled again."
]

print("--- Predictions ---")
for s in samples:
    pred = predict(s)
    print(f"Text: {s}")
    print(f"Pred: {pred}\n")

--- Predictions ---
Text: The stock market crashed today after the new inflation report.
Pred: Business

Text: Manchester United scored a late goal to win the championship.
Pred: Sports

Text: Apple released a new iPhone with a faster processor.
Pred: Sci/Tech

Text: Peace talks in the middle east have stalled again.
Pred: World



### Conclusion
You have successfully fine-tuned a Transformer!

**Why this works:**
The pre-trained BERT model already "knows" English grammar, syntax, and some world knowledge (from reading Wikipedia). Fine-tuning simply teaches it how to map that understanding to your 4 specific categories.

**In the next tutorial:** We will look at *Generative* Transformers (GPT) which can write text instead of just classifying it.