# Fine-Tuning BERT for Sequence Classification

This notebook demonstrates how to fine-tune a pre-trained BERT model for sequence classification using the Hugging Face transformers library. We'll work with the GLUE MRPC (Microsoft Research Paraphrase Corpus) dataset to classify whether two sentences are paraphrases of each other.

## Learning Objectives
- Understand the basic fine-tuning workflow
- Learn how to preprocess data for BERT
- Use the Trainer API for training and evaluation
- Compute metrics to evaluate model performance

In [None]:
# Core PyTorch libraries for training
import torch
from torch.optim import AdamW

# Hugging Face transformers for pre-trained models
from transformers import AutoModelForSequenceClassification, AutoTokenizer

## 1. Import Required Libraries

First, let's import the essential libraries we'll need for fine-tuning our model.

In [None]:
# Load a pre-trained BERT model and tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# Create sample training data (positive sentiment examples)
sequences = [
    "This is a great movie!",
    "This is a fantastic movie!",
]

# Tokenize the input sequences
# padding=True ensures all sequences have the same length
# truncation=True cuts off sequences that are too long
# return_tensors="pt" returns PyTorch tensors
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# Add labels for our examples (1 = positive, 0 = negative)
batch["labels"] = torch.tensor([1, 1])  

# Set up optimizer (AdamW is commonly used for transformers)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Forward pass: compute loss
loss = model(**batch).loss
print(f"Initial loss: {loss.item():.4f}")

# Backward pass: compute gradients
loss.backward()

# Update model parameters
optimizer.step()

print("Completed one training step!")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 2. Basic Fine-Tuning Example

Let's start with a simple example to understand the basic workflow of fine-tuning. We'll create a minimal training loop with just two examples.

In [None]:
# Load the GLUE MRPC dataset using Hugging Face datasets library
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")

# Display the dataset structure
print("Dataset structure:")
print(raw_datasets)

# The dataset contains:
# - Training set: 3,668 sentence pairs
# - Validation set: 408 sentence pairs  
# - Test set: 1,725 sentence pairs
# Each example has: sentence1, sentence2, label (0=not paraphrase, 1=paraphrase), idx

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

## 3. Loading a Real Dataset

Now let's work with a real dataset from the GLUE benchmark. We'll use the MRPC (Microsoft Research Paraphrase Corpus) dataset which contains pairs of sentences labeled as paraphrases or not.

In [None]:
# Access the training dataset and examine a specific example
raw_train_dataset = raw_datasets["train"]

# Look at example 14 from the training set
example = raw_train_dataset[14]
print("Example from training set:")
print(f"Sentence 1: {example['sentence1']}")
print(f"Sentence 2: {example['sentence2']}")
print(f"Label: {example['label']} ({'Paraphrase' if example['label'] == 1 else 'Not paraphrase'})")
print(f"Index: {example['idx']}")

{'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .',
 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .',
 'label': 0,
 'idx': 15}

### 3.1 Exploring the Dataset

Let's examine the structure and content of our dataset to understand what we're working with.

In [None]:
# Examine the features (columns) in the training dataset
print("Training dataset features:")
print(raw_train_dataset.features)

# This shows us the data types and structure:
# - sentence1: string
# - sentence2: string  
# - label: ClassLabel with 2 classes (0, 1)
# - idx: int32 (unique identifier)

{'sentence1': Value('string'),
 'sentence2': Value('string'),
 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
 'idx': Value('int32')}

In [None]:
# Look at an example from the validation set
raw_validation_dataset = raw_datasets["validation"]

example_val = raw_validation_dataset[87]
print("Example from validation set:")
print(f"Sentence 1: {example_val['sentence1']}")
print(f"Sentence 2: {example_val['sentence2']}")
print(f"Label: {example_val['label']} ({'Paraphrase' if example_val['label'] == 1 else 'Not paraphrase'})")

{'sentence1': 'However , EPA officials would not confirm the 20 percent figure .',
 'sentence2': 'Only in the past few weeks have officials settled on the 20 percent figure .',
 'label': 0,
 'idx': 812}

In [None]:
# Validation dataset has the same structure as training dataset
print("Validation dataset features:")
print(raw_validation_dataset.features)

{'sentence1': Value('string'),
 'sentence2': Value('string'),
 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
 'idx': Value('int32')}

In [None]:
# Tokenize the first sentence from example 15
sentence1 = raw_datasets["train"]["sentence1"][15]
print(f"Original sentence 1: {sentence1}")

inputs = tokenizer(sentence1, return_tensors="pt")
print(f"Tokenized input IDs: {inputs['input_ids']}")
print(f"Attention mask: {inputs['attention_mask']}")

{'input_ids': tensor([[  101, 24049,  2001,  2087,  3728,  3026,  3580,  2343,  2005,  1996,
          9722,  1004,  4132,  9340, 12439,  2964,  2449,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## 4. Tokenization for BERT

Now we need to convert our text data into tokens that BERT can understand. Let's explore how BERT tokenizes single sentences and sentence pairs.

In [None]:
# Tokenize the second sentence from the same example
sentence2 = raw_datasets["train"]["sentence2"][15]
print(f"Original sentence 2: {sentence2}")

inputs = tokenizer(sentence2, return_tensors="pt")
print(f"Tokenized input IDs: {inputs['input_ids']}")
print(f"Attention mask: {inputs['attention_mask']}")

{'input_ids': tensor([[  101,  3026,  3580,  2343,  4388, 24049,  1010,  3839,  2132,  1997,
          1996,  9722,  1998,  4132,  9340, 12439,  2964,  3131,  1010,  2097,
          2599,  1996,  2047,  9178,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])}

In [None]:
# Tokenize both sentences together (this is what we need for sentence pair tasks)
# BERT uses [CLS] token at start, [SEP] to separate sentences, and [SEP] at end
sentence1 = raw_datasets["train"]["sentence1"][15]
sentence2 = raw_datasets["train"]["sentence2"][15]

inputs = tokenizer(sentence1, sentence2, return_tensors="pt")
print(f"Combined tokenization:")
print(f"Input IDs: {inputs['input_ids']}")
print(f"Token type IDs: {inputs['token_type_ids']}")  # 0 for sentence1, 1 for sentence2
print(f"Attention mask: {inputs['attention_mask']}")

{'input_ids': tensor([[  101, 24049,  2001,  2087,  3728,  3026,  3580,  2343,  2005,  1996,
          9722,  1004,  4132,  9340, 12439,  2964,  2449,  1012,   102,  3026,
          3580,  2343,  4388, 24049,  1010,  3839,  2132,  1997,  1996,  9722,
          1998,  4132,  9340, 12439,  2964,  3131,  1010,  2097,  2599,  1996,
          2047,  9178,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
# Convert token IDs back to readable tokens to see how BERT processes the text
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens with special BERT tokens:")
for i, token in enumerate(tokens):
    print(f"{i:2d}: {token}")

# Notice: [CLS] at start, [SEP] between sentences, [SEP] at end

['[CLS]',
 'rudder',
 'was',
 'most',
 'recently',
 'senior',
 'vice',
 'president',
 'for',
 'the',
 'developer',
 '&',
 'platform',
 'evan',
 '##gel',
 '##ism',
 'business',
 '.',
 '[SEP]',
 'senior',
 'vice',
 'president',
 'eric',
 'rudder',
 ',',
 'formerly',
 'head',
 'of',
 'the',
 'developer',
 'and',
 'platform',
 'evan',
 '##gel',
 '##ism',
 'unit',
 ',',
 'will',
 'lead',
 'the',
 'new',
 'entity',
 '.',
 '[SEP]']

In [None]:
# Define a function to tokenize sentence pairs
def tokenize_function(example):
    # tokenizer automatically handles sentence pairs when given two arguments
    # truncation=True ensures sequences don't exceed BERT's max length (512 tokens)
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# Apply tokenization to all splits of the dataset
# batched=True processes multiple examples at once for efficiency
tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)

print("Tokenized dataset structure:")
print(tokenized_dataset)
print(f"\nExample tokenized fields: {list(tokenized_dataset['train'][0].keys())}")

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### 4.1 Batch Tokenization

Now let's tokenize the entire dataset efficiently using the `map` function. This applies our tokenization function to all examples in the dataset.

In [None]:
# Create a data collator that will pad sequences to the same length in each batch
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Get a small sample to demonstrate padding
# Remove non-tensor columns (strings can't be converted to tensors)
samples = tokenized_dataset["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}

# Show the original lengths before padding
print("Original sequence lengths:")
lengths = [len(x) for x in samples["input_ids"]]
print(lengths)
print(f"Min length: {min(lengths)}, Max length: {max(lengths)}")

[50, 59, 47, 67, 59, 50, 62, 32]

## 5. Data Collation and Padding

Since sequences have different lengths, we need to pad them to the same length for batching. The `DataCollatorWithPadding` handles this dynamically.

In [None]:
# Apply the data collator to create a properly padded batch
batch = data_collator(samples)

# Show the shapes after padding - all sequences now have the same length
print("Batch shapes after padding:")
for key, tensor in batch.items():
    print(f"{key}: {tensor.shape}")

# All sequences are now padded to the same length (the longest in the batch)

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

 Replicate the preprocessing on the GLUE SST-2 dataset. It’s a little bit different since it’s composed of single sentences instead of pairs, but the rest of what we did should look the same. For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks.

In [None]:
# Set up training arguments
from transformers import TrainingArguments

# Define basic training configuration
# "test-trainer" is the output directory where model checkpoints will be saved
training_args = TrainingArguments(
    output_dir="test-trainer",
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=64,   # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
)

print("Training arguments configured!")

## 6. Setting Up Training with the Trainer API

The Hugging Face `Trainer` class provides a high-level interface for training and evaluation. Let's set up the training arguments and model.

In [None]:
# Load model for sequence classification
from transformers import AutoModelForSequenceClassification

# Load BERT with a classification head for 2 classes (paraphrase vs not paraphrase)
# num_labels=2 adds a classification layer on top of BERT
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

print("Model loaded successfully!")
print(f"Model has {model.num_parameters():,} parameters")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Create the Trainer object
from transformers import Trainer

trainer = Trainer(
    model=model,                              # The model to train
    args=training_args,                       # Training arguments
    train_dataset=tokenized_dataset["train"], # Training dataset
    eval_dataset=tokenized_dataset["validation"], # Validation dataset
    data_collator=data_collator,              # Data collator for padding
    processing_class=tokenizer,               # Tokenizer for processing
)

print("Trainer configured and ready for training!")

In [None]:
# Start training! This will fine-tune BERT on the MRPC dataset
# The trainer will automatically handle the training loop, loss computation, and optimization
print("Starting training...")
trainer.train()
print("Training completed!")

Step,Training Loss
500,0.5368
1000,0.3229


TrainOutput(global_step=1377, training_loss=0.36212434124582993, metrics={'train_runtime': 110.8924, 'train_samples_per_second': 99.231, 'train_steps_per_second': 12.417, 'total_flos': 405114969714960.0, 'train_loss': 0.36212434124582993, 'epoch': 3.0})

### 6.1 Training the Model

Now let's actually train the model! This will take a few minutes depending on your hardware.

In [None]:
# Get predictions on the validation set
predictions = trainer.predict(tokenized_dataset["validation"])

print(f"Prediction logits shape: {predictions.predictions.shape}")  # (num_examples, num_classes)
print(f"True labels shape: {predictions.label_ids.shape}")         # (num_examples,)
print(f"Number of validation examples: {len(predictions.label_ids)}")

(408, 2) (408,)


## 7. Model Evaluation

Now let's evaluate our trained model on the validation set to see how well it performs.

In [None]:
# Convert logits to predicted class labels
import numpy as np

# Take the argmax to get the predicted class (0 or 1)
preds = np.argmax(predictions.predictions, axis=-1)

print(f"First 10 predictions: {preds[:10]}")
print(f"First 10 true labels: {predictions.label_ids[:10]}")

# Quick accuracy calculation
accuracy = (preds == predictions.label_ids).mean()
print(f"Simple accuracy: {accuracy:.4f}")

In [None]:
# Load the official GLUE metric for MRPC
import evaluate

metric = evaluate.load("glue", "mrpc")

# Compute the official metrics (accuracy and F1 score)
results = metric.compute(predictions=preds, references=predictions.label_ids)

print("Official GLUE MRPC metrics:")
for key, value in results.items():
    print(f"{key}: {value:.4f}")

# MRPC uses both accuracy and F1 score as evaluation metrics

Downloading builder script: 0.00B [00:00, ?B/s]

{'accuracy': 0.8504901960784313, 'f1': 0.893542757417103}

### 7.1 Computing Proper Metrics

Let's use the official GLUE metrics for proper evaluation of our model's performance.

In [None]:
# Define a function that computes metrics during training
def compute_metrics(eval_preds):
    """
    Function to compute metrics during evaluation.
    Called automatically by the Trainer.
    """
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    
    # Convert logits to predictions
    predictions = np.argmax(logits, axis=-1)
    
    # Return the computed metrics
    return metric.compute(predictions=predictions, references=labels)

print("Metric computation function defined!")

### 7.2 Setting Up Automatic Evaluation

For better training, we can set up automatic evaluation during training. This helps monitor our model's progress.

In [None]:
# Create new training arguments with evaluation enabled
training_args = TrainingArguments(
    output_dir="test-trainer",
    eval_strategy="epoch",           # Evaluate at the end of each epoch
    save_strategy="epoch",           # Save model at the end of each epoch
    logging_dir='./logs',
    num_train_epochs=2,              # Fewer epochs for this demo
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    load_best_model_at_end=True,     # Load the best model when training ends
)

# Load a fresh model (this will reset any previous training)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# Create new trainer with metric computation
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,  # This enables automatic metric computation
)

print("New trainer with evaluation configured!")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 7.3 Training with Automatic Evaluation

Now let's train a new model with evaluation happening at the end of each epoch.

In [None]:
# Train the model with automatic evaluation
# You'll see both training loss and evaluation metrics after each epoch
print("Starting training with evaluation...")
trainer.train()
print("Training with evaluation completed!")

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.430748,0.835784,0.88057
2,0.559400,0.413827,0.857843,0.900685
3,0.337600,0.664447,0.855392,0.900506


TrainOutput(global_step=1377, training_loss=0.3823748209724066, metrics={'train_runtime': 116.2305, 'train_samples_per_second': 94.674, 'train_steps_per_second': 11.847, 'total_flos': 405114969714960.0, 'train_loss': 0.3823748209724066, 'epoch': 3.0})

## Summary

Congratulations! You've successfully fine-tuned a BERT model for sequence classification. Here's what we accomplished:

1. **Basic Fine-tuning**: Understood the core workflow with a simple example
2. **Dataset Loading**: Loaded and explored the GLUE MRPC dataset
3. **Tokenization**: Learned how BERT processes sentence pairs
4. **Data Preparation**: Set up efficient batching and padding
5. **Training Setup**: Configured the Trainer API for fine-tuning
6. **Model Training**: Fine-tuned BERT on the paraphrase detection task
7. **Evaluation**: Computed proper metrics to assess performance

### Key Takeaways
- Pre-trained models need only the classification head to be randomly initialized
- Tokenization for sentence pairs uses special tokens ([CLS], [SEP])
- The Trainer API simplifies the training process significantly
- Evaluation during training helps monitor progress
- GLUE tasks have specific metrics (accuracy + F1 for MRPC)

### Next Steps
- Try different hyperparameters (learning rate, batch size, epochs)
- Experiment with other GLUE tasks (SST-2, QNLI, etc.)
- Compare different pre-trained models (RoBERTa, DeBERTa)
- Explore parameter-efficient fine-tuning (LoRA, adapters)