# ML Lab 09: Fine-Tune a Model

Pre-trained models like BERT have already read the internet. They understand language — grammar,
meaning, context. But they don't know *your* task. In this lab, you'll take DistilBERT (a smaller,
faster version of BERT), see what it already knows, then fine-tune it to classify baseball vs space
posts. A few hundred examples and three epochs later, you'll have a specialist model that crushes
zero-shot performance.

This is how real ML teams work: start with a pre-trained foundation model, adapt it with your data.

---
## Section 1: What Does a Pre-trained Model Already Know?

DistilBERT was trained on Wikipedia and BookCorpus — millions of documents. It learned to predict
missing words, and in doing so, it learned the *structure* of language.

Let's load it and see: does it already know that baseball sentences are similar to each other,
and different from space sentences?

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# Load the pre-trained DistilBERT model and tokenizer
print("Loading DistilBERT (this downloads ~260 MB the first time)...")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
base_model = AutoModel.from_pretrained("distilbert-base-uncased")
base_model.eval()
print(f"Model loaded! Parameters: {sum(p.numel() for p in base_model.parameters()):,}")

In [None]:
# Test sentences — two about baseball, two about space
sentences = [
    "The pitcher threw a fastball for a strike in the bottom of the ninth.",
    "The batter hit a grand slam home run to win the championship game.",
    "NASA launched a new satellite to orbit Mars and study its atmosphere.",
    "The Hubble telescope captured images of a distant galaxy cluster.",
]

labels = ["baseball", "baseball", "space", "space"]

# Tokenize and get embeddings
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors="pt")

with torch.no_grad():
    outputs = base_model(**inputs)

# Use the [CLS] token embedding as the sentence representation
embeddings = outputs.last_hidden_state[:, 0, :].numpy()
print(f"Embedding shape: {embeddings.shape}")
print(f"Each sentence is represented as a {embeddings.shape[1]}-dimensional vector")

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise cosine similarity
sim_matrix = cosine_similarity(embeddings)

print("Cosine Similarity Matrix:")
print(f"{'':>12s}", end="")
for i, label in enumerate(labels):
    print(f"  {label}-{i+1:d}", end="")
print()

for i in range(len(sentences)):
    print(f"{labels[i]}-{i+1:d}:     ", end="")
    for j in range(len(sentences)):
        print(f"  {sim_matrix[i][j]:.3f} ", end="")
    print()

# Check: are same-topic sentences more similar?
same_topic = (sim_matrix[0][1] + sim_matrix[2][3]) / 2
diff_topic = (sim_matrix[0][2] + sim_matrix[0][3] + sim_matrix[1][2] + sim_matrix[1][3]) / 4

print(f"\nAvg same-topic similarity:  {same_topic:.3f}")
print(f"Avg cross-topic similarity: {diff_topic:.3f}")
print(f"\nThe model already groups similar topics closer together!")
print("But the gap is small — fine-tuning will make it much larger.")

**Key insight:** The model already understands language structure. Baseball sentences are slightly
more similar to each other than to space sentences. But the gap is small because the model wasn't
trained for *this specific task*. Fine-tuning will amplify this signal.

---

## Section 2: Zero-Shot vs Fine-Tuned

Before fine-tuning, let's establish a baseline. **Zero-shot classification** uses the model's
general knowledge to classify text into categories it has never been explicitly trained on.

How well can DistilBERT classify baseball vs space posts *without any task-specific training*?

In [None]:
from sklearn.datasets import fetch_20newsgroups

# Load a small test set for our zero-shot baseline
categories = ['rec.sport.baseball', 'sci.space']
test_raw = fetch_20newsgroups(
    subset='test', categories=categories, shuffle=True, random_state=42
)

# Use a small subset for zero-shot (it's slow per-example)
n_zero_shot = 50
zs_texts = test_raw.data[:n_zero_shot]
zs_labels = test_raw.target[:n_zero_shot]

print(f"Zero-shot test set: {n_zero_shot} examples")
print(f"Classes: {test_raw.target_names}")
print(f"  Label 0: {test_raw.target_names[0]}")
print(f"  Label 1: {test_raw.target_names[1]}")

In [None]:
from transformers import pipeline
from tqdm import tqdm

# Zero-shot classification using DistilBERT
# Note: this uses a different model variant optimized for NLI (natural language inference)
print("Running zero-shot classification (this takes a minute)...")
zs_classifier = pipeline(
    "zero-shot-classification",
    model="typeform/distilbert-base-uncased-mnli",
    device=-1  # CPU
)

candidate_labels = ["baseball", "space"]
label_map = {"baseball": 0, "space": 1}

zs_predictions = []
for text in tqdm(zs_texts, desc="Zero-shot"):
    # Truncate long texts to speed up
    truncated = text[:512]
    result = zs_classifier(truncated, candidate_labels=candidate_labels)
    predicted_label = result["labels"][0]
    zs_predictions.append(label_map[predicted_label])

zs_predictions = np.array(zs_predictions)

from sklearn.metrics import accuracy_score
zs_accuracy = accuracy_score(zs_labels, zs_predictions)
print(f"\nZero-shot accuracy: {zs_accuracy:.3f} ({zs_accuracy*100:.1f}%)")
print("Not bad for a model that was never trained on this task!")
print("But let's see if fine-tuning can do better.")

**Key insight:** Zero-shot classification uses general knowledge. It's decent but not great.
The model has never seen these specific newsgroup posts or been told what "baseball" and "space"
mean in this context. Fine-tuning teaches it YOUR data.

---

## Section 3: Prepare Training Data

To fine-tune, we need labeled training data. We'll use the same 20 Newsgroups dataset, but now
we'll prepare it properly for the HuggingFace Trainer:

1. Load the data
2. Tokenize with DistilBERT's tokenizer
3. Create train/validation/test splits

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# Load the full dataset
categories = ['rec.sport.baseball', 'sci.space']
train_raw = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test_raw = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

print(f"Full training set: {len(train_raw.data)} examples")
print(f"Full test set:     {len(test_raw.data)} examples")

# Use a smaller subset to keep training fast on CPU
# In production, you'd use the full dataset
MAX_TRAIN = 200
MAX_TEST = 100

train_texts = train_raw.data[:MAX_TRAIN]
train_labels = train_raw.target[:MAX_TRAIN]

# Split the subset into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42, stratify=train_labels
)

X_test = test_raw.data[:MAX_TEST]
y_test = test_raw.target[:MAX_TEST]

print(f"\nSubset for CPU training:")
print(f"  Train:      {len(X_train)} examples")
print(f"  Validation: {len(X_val)} examples")
print(f"  Test:       {len(X_test)} examples")

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize all splits
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(X_val, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=128)

# Show what tokenization looks like
example_text = X_train[0][:200]
example_tokens = tokenizer.tokenize(example_text)

print(f"Original text (first 200 chars):")
print(f"  {example_text}")
print(f"\nTokenized ({len(example_tokens)} tokens):")
print(f"  {example_tokens[:20]}...")
print(f"\nToken IDs (first 20):")
print(f"  {train_encodings['input_ids'][0][:20]}")
print(f"\nThe tokenizer splits text into subword tokens and converts them to integer IDs.")
print(f"The model only sees numbers — never raw text.")

In [None]:
import torch
from torch.utils.data import Dataset

class NewsGroupDataset(Dataset):
    """PyTorch Dataset for our tokenized newsgroup data."""
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NewsGroupDataset(train_encodings, y_train.tolist())
val_dataset = NewsGroupDataset(val_encodings, y_val.tolist())
test_dataset = NewsGroupDataset(test_encodings, y_test.tolist())

print(f"Datasets created:")
print(f"  train_dataset: {len(train_dataset)} examples")
print(f"  val_dataset:   {len(val_dataset)} examples")
print(f"  test_dataset:  {len(test_dataset)} examples")

# Peek at one example
sample = train_dataset[0]
print(f"\nSample item keys: {list(sample.keys())}")
print(f"  input_ids shape: {sample['input_ids'].shape}")
print(f"  label: {sample['labels'].item()} ({categories[sample['labels'].item()]})")

**Key insight:** The tokenizer converts raw text into integer IDs that the model understands.
DistilBERT uses *subword tokenization* — common words stay whole, rare words get split into pieces.
This is why it can handle any text, even words it's never seen before.

---

## Section 4: Fine-Tune DistilBERT

Now the main event. We'll:
1. Load `DistilBertForSequenceClassification` — a DistilBERT with a classification head on top
2. Set training arguments (epochs, learning rate, batch size)
3. Train with the HuggingFace `Trainer`
4. Watch the loss decrease

On CPU, this takes about 5-10 minutes with our small dataset. On GPU, it would take seconds.

In [None]:
from transformers import DistilBertForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# Load DistilBERT with a classification head (2 classes)
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

print(f"Model loaded with classification head")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# Define a function to compute metrics during training
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    return {"accuracy": acc, "f1": f1}

In [None]:
# Training arguments — tuned for CPU speed
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none",  # Disable W&B/MLflow reporting
    no_cuda=True,       # Force CPU
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

print("Starting fine-tuning (3 epochs on CPU — ~5-10 minutes)...")
print("Watch the training loss decrease!\n")
train_result = trainer.train()

print(f"\nTraining complete!")
print(f"  Total steps: {train_result.global_step}")
print(f"  Final training loss: {train_result.training_loss:.4f}")

In [None]:
# Evaluate on the test set
print("Evaluating on test set...\n")
test_results = trainer.evaluate(test_dataset)

print(f"Test Results:")
print(f"  Accuracy: {test_results['eval_accuracy']:.3f} ({test_results['eval_accuracy']*100:.1f}%)")
print(f"  F1 Score: {test_results['eval_f1']:.3f}")
print(f"  Loss:     {test_results['eval_loss']:.4f}")

**Key insight:** The training loss should decrease steadily over the 3 epochs. The model is
adjusting its 66 million parameters — but mostly the classification head and the top layers.
The lower layers (which capture general language understanding) change very little.

---

## Section 5: Compare Results

Time for the payoff. Let's compare zero-shot vs fine-tuned side by side, look at the
classification report, plot the confusion matrix, and visualize the embeddings.

In [None]:
# Get fine-tuned predictions on the same test set
ft_output = trainer.predict(test_dataset)
ft_predictions = np.argmax(ft_output.predictions, axis=-1)
ft_accuracy = accuracy_score(y_test, ft_predictions)

# Build comparison table
print("=" * 50)
print("  Zero-Shot vs Fine-Tuned Comparison")
print("=" * 50)
print(f"")
print(f"  {'Method':<20s} {'Accuracy':>10s}")
print(f"  {'-'*20} {'-'*10}")
print(f"  {'Zero-shot':<20s} {zs_accuracy:>9.1%}")
print(f"  {'Fine-tuned':<20s} {ft_accuracy:>9.1%}")
print(f"  {'-'*20} {'-'*10}")
print(f"  {'Improvement':<20s} {(ft_accuracy - zs_accuracy):>+9.1%}")
print(f"")
print(f"A few hundred examples and 3 epochs transformed")
print(f"a general model into a specialist.")

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Classification report
target_names = ['rec.sport.baseball', 'sci.space']
print("Fine-Tuned Model — Classification Report:")
print(classification_report(y_test, ft_predictions, target_names=target_names))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, ft_predictions)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=target_names,
            yticklabels=target_names, ax=ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Fine-Tuned DistilBERT — Confusion Matrix')
plt.tight_layout()
plt.show()

### Section 5b: Explore Embeddings with t-SNE

Let's extract the embeddings from the fine-tuned model and visualize them in 2D.
If fine-tuning worked, we should see clear separation between the two classes.

In [None]:
from sklearn.manifold import TSNE

# Extract embeddings from the fine-tuned model
# We need to get the hidden states before the classification head
model.eval()
all_embeddings = []
all_labels = []

with torch.no_grad():
    for i in range(len(test_dataset)):
        item = test_dataset[i]
        input_ids = item['input_ids'].unsqueeze(0)
        attention_mask = item['attention_mask'].unsqueeze(0)
        
        # Get hidden states from the base model (before classification head)
        outputs = model.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
        
        all_embeddings.append(cls_embedding)
        all_labels.append(item['labels'].item())

all_embeddings = np.array(all_embeddings)
all_labels = np.array(all_labels)

print(f"Extracted {len(all_embeddings)} embeddings of dimension {all_embeddings.shape[1]}")

# Reduce to 2D with t-SNE
print("Running t-SNE dimensionality reduction...")
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(all_embeddings) - 1))
embeddings_2d = tsne.fit_transform(all_embeddings)
print("Done!")

In [None]:
# Plot t-SNE visualization
fig, ax = plt.subplots(figsize=(10, 8))

colors = ['#2196F3', '#FF9800']  # Blue for baseball, orange for space
for label_idx in [0, 1]:
    mask = all_labels == label_idx
    ax.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        c=colors[label_idx],
        label=target_names[label_idx],
        alpha=0.7,
        s=50,
        edgecolors='white',
        linewidth=0.5
    )

ax.set_title('Fine-Tuned DistilBERT Embeddings (t-SNE)', fontsize=14)
ax.set_xlabel('t-SNE dimension 1')
ax.set_ylabel('t-SNE dimension 2')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("The clear separation shows the model learned distinct representations")
print("for each class. These learned embeddings are what make search,")
print("recommendations, and RAG systems work.")

**Key insight:** The t-SNE plot should show two clearly separated clusters. This means the
fine-tuned model has learned to push baseball posts and space posts into different regions of
embedding space. The better the separation, the easier classification becomes.

---

## Section 6: Save and Track

A fine-tuned model is only useful if you can reload it later. Let's save it, check the file sizes,
and record the full experiment lineage — everything someone would need to reproduce this result.

In [None]:
import os

# Save the fine-tuned model and tokenizer
save_dir = "./fine_tuned_distilbert"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

# Check what files were created
print("Saved model files:")
total_size = 0
for fname in sorted(os.listdir(save_dir)):
    fpath = os.path.join(save_dir, fname)
    size = os.path.getsize(fpath)
    total_size += size
    size_mb = size / (1024 * 1024)
    print(f"  {fname:<40s} {size_mb:>8.2f} MB")

print(f"  {'':->40s} {'':->8s}---")
print(f"  {'TOTAL':<40s} {total_size / (1024 * 1024):>8.2f} MB")

In [None]:
from transformers import DistilBertForSequenceClassification, AutoTokenizer

# Reload the model from disk
loaded_model = DistilBertForSequenceClassification.from_pretrained(save_dir)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_dir)
loaded_model.eval()

# Test on new examples
test_sentences = [
    "The shortstop made a diving catch to end the inning.",
    "SpaceX successfully landed its booster rocket after launch.",
    "The umpire called strike three on the final pitch.",
    "The Mars rover collected soil samples from the crater.",
]

inputs = loaded_tokenizer(test_sentences, padding=True, truncation=True, max_length=128, return_tensors="pt")

with torch.no_grad():
    outputs = loaded_model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    probabilities = torch.softmax(outputs.logits, dim=-1)

print("Predictions from the reloaded model:\n")
for text, pred, proba in zip(test_sentences, predictions, probabilities):
    label = target_names[pred.item()]
    confidence = proba[pred.item()].item() * 100
    print(f"  [{label:>22s}] ({confidence:.0f}%) \"{text[:55]}...\"")

In [None]:
import json
from datetime import datetime

# Record full experiment lineage
experiment_log = {
    "experiment": "Lab 09 — Fine-Tune DistilBERT",
    "timestamp": datetime.now().isoformat(),
    "base_model": "distilbert-base-uncased",
    "task": "binary text classification (baseball vs space)",
    "data": {
        "source": "20newsgroups (scikit-learn)",
        "categories": categories,
        "train_size": len(X_train),
        "val_size": len(X_val),
        "test_size": len(X_test),
        "max_length": 128,
    },
    "hyperparameters": {
        "epochs": 3,
        "learning_rate": 2e-5,
        "batch_size": 16,
        "weight_decay": 0.01,
        "optimizer": "AdamW",
    },
    "results": {
        "zero_shot_accuracy": round(zs_accuracy, 3),
        "fine_tuned_accuracy": round(ft_accuracy, 3),
        "fine_tuned_f1": round(test_results['eval_f1'], 3),
        "improvement": round(ft_accuracy - zs_accuracy, 3),
    },
    "model_path": save_dir,
    "model_size_mb": round(total_size / (1024 * 1024), 1),
}

# Save the experiment log
log_path = "./experiment_log.json"
with open(log_path, "w") as f:
    json.dump(experiment_log, f, indent=2)

print("Experiment Log:")
print(json.dumps(experiment_log, indent=2))
print(f"\nSaved to: {log_path}")
print(f"\nFull lineage: base model + training data + hyperparameters = your fine-tuned model.")
print(f"Anyone can reproduce this result with this log.")

In [None]:
# Clean up saved files (optional — comment out if you want to keep them)
import shutil

if os.path.exists(save_dir):
    shutil.rmtree(save_dir)
    print(f"Cleaned up {save_dir}")

if os.path.exists(log_path):
    os.remove(log_path)
    print(f"Cleaned up {log_path}")

if os.path.exists("./results"):
    shutil.rmtree("./results")
    print("Cleaned up ./results")

print("\nDone!")

---
## Summary

You just fine-tuned a language model. Here's what you now know:

| Concept | What You Learned |
|---------|------------------|
| **Pre-trained knowledge** | DistilBERT already understands language — it groups similar topics together |
| **Zero-shot baseline** | General knowledge gets you decent accuracy without any training |
| **Fine-tuning** | A few hundred examples + 3 epochs dramatically improves task-specific performance |
| **Transfer learning** | You're adapting existing knowledge, not learning from scratch |
| **Embeddings** | t-SNE shows clear class separation after fine-tuning |
| **Model lineage** | Base model + data + hyperparameters = reproducible fine-tuned model |

### What's Next?

In **ML Lab 10**, you'll build a RAG pipeline — combining retrieval with generation to answer
questions using your own documents. The embeddings you learned about here are exactly what make
RAG work.