# Fine-tuning BERT in PyTorch

## Table of contents

1. [Understanding BERT and transfer learning](#understanding-bert-and-transfer-learning)
2. [Setting up the environment](#setting-up-the-environment)
3. [Loading the pre-trained BERT model](#loading-the-pre-trained-bert-model)
4. [Preparing the dataset](#preparing-the-dataset)
5. [Tokenizing input data for BERT](#tokenizing-input-data-for-bert)
6. [Modifying BERT for fine-tuning](#modifying-bert-for-fine-tuning)
7. [Training the fine-tuned BERT model](#training-the-fine-tuned-bert-model)
8. [Evaluating the fine-tuned BERT model](#evaluating-the-fine-tuned-bert-model)
9. [Experimenting with different fine-tuning strategies](#experimenting-with-different-fine-tuning-strategies)

## Understanding BERT and transfer learning

### **Key concepts**
Fine-tuning BERT (Bidirectional Encoder Representations from Transformers) involves adapting a pre-trained BERT model to a specific downstream task, such as text classification, question answering, or named entity recognition. BERT, based on the Transformer architecture, is pre-trained on massive text corpora using tasks like Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Fine-tuning customizes the model to align with task-specific data while leveraging its general language understanding capabilities.

Key steps in fine-tuning BERT include:
- **Task-specific heads**: Adding classification, regression, or other task-related layers on top of BERT’s pre-trained layers.
- **Layer freezing**: Optionally freezing earlier layers to reduce computational cost while fine-tuning the later layers.
- **Learning rate customization**: Using differential learning rates for pre-trained and task-specific layers.
- **Batch processing**: Efficiently managing large sequences and datasets with PyTorch’s data loaders.

The `transformers` library by Hugging Face simplifies fine-tuning BERT in PyTorch, providing prebuilt models and utilities.

### **Applications**
Fine-tuning BERT is used in a variety of NLP tasks:
- **Text classification**: Sentiment analysis, spam filtering, or topic categorization.
- **Question answering**: Extracting answers to user queries from text passages.
- **Named entity recognition (NER)**: Identifying entities like names, dates, and organizations in text.
- **Text summarization**: Condensing lengthy documents into concise summaries.
- **Semantic similarity**: Determining the relationship or similarity between text pairs.

### **Advantages**
- **Pretrained knowledge**: Leverages extensive pretraining, reducing the need for large task-specific datasets.
- **Bidirectional context**: Captures dependencies in both directions of a sequence for improved understanding.
- **Efficiency**: Fine-tuning requires fewer resources compared to training from scratch.
- **Versatility**: Easily adapts to diverse NLP tasks with minimal modifications.

### **Challenges**
- **Computational cost**: Fine-tuning requires significant resources, especially for large datasets or extended training sessions.
- **Data dependency**: Small datasets can lead to overfitting, requiring careful regularization and augmentation.
- **Hyperparameter sensitivity**: Learning rates, batch sizes, and optimizer settings require careful tuning for optimal performance.
- **Model complexity**: Managing large models like BERT can be challenging in terms of memory and runtime.

## Setting up the environment


##### **Q1: How do you install the necessary libraries for fine-tuning BERT?**


In [1]:
# !pip install transformers datasets

##### **Q2: How do you import the required modules from the `transformers` library to load BERT and handle tokenization?**


In [2]:
from transformers import BertTokenizer, BertForSequenceClassification

##### **Q3: How do you configure the environment to use GPU for fine-tuning BERT in PyTorch?**

In [3]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


## Loading the pre-trained BERT model


##### **Q4: How do you load a pre-trained BERT model using Hugging Face’s `transformers` library?**


In [5]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)  # binary classification
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

##### **Q5: How do you load the corresponding tokenizer for BERT to handle input data preprocessing?**


In [6]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")  # load BERT tokenizer

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

##### **Q6: How do you inspect the structure of the pre-trained BERT model to understand its layers and outputs?**

In [8]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## Preparing the dataset


##### **Q7: How do you load a text classification dataset using Hugging Face’s `datasets` library?**


In [10]:
from datasets import load_dataset

dataset = load_dataset("imdb")  # load IMDB sentiment classification dataset

##### **Q8: How do you split the dataset into training, validation, and test sets for fine-tuning BERT?**


In [11]:
dataset = dataset.rename_column("text", "input_text")  # rename for clarity
dataset = dataset["train"].train_test_split(test_size=0.2)  # 80/20 train-test split
dataset["validation"] = dataset["test"].train_test_split(test_size=0.5)["train"]  # split test set into val/test
dataset["test"] = dataset["test"].train_test_split(test_size=0.5)["test"]

##### **Q9: How do you preprocess the dataset before passing it to the tokenizer?**

In [12]:
def preprocess_function(examples):
    return tokenizer(examples["input_text"], padding="max_length", truncation=True, max_length=256)  # tokenize inputs

In [13]:
encoded_dataset = dataset.map(preprocess_function, batched=True)  # apply tokenization

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

## Tokenizing input data for BERT


##### **Q10: How do you use `BertTokenizer` to tokenize input text and convert it into token IDs for BERT?**


In [14]:
sample = "This movie was surprisingly good!"
tokens = tokenizer(sample, padding="max_length", truncation=True, max_length=256, return_tensors="pt")  # tokenize sample
tokens = {key: val.to(device) for key, val in tokens.items()}  # move to device

##### **Q11: How do you handle padding and truncation to ensure all input sequences are the same length before feeding them into BERT?**


In [15]:
# already handled in previous tokenization using padding="max_length" and truncation=True
# example shown again for clarity
tokenizer("A short review", padding="max_length", truncation=True, max_length=256)  # uniform input length

{'input_ids': [101, 1037, 2460, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

##### **Q12: How do you create attention masks to distinguish between padded and real tokens in the input sequences?**


In [16]:
# attention_mask is automatically created by tokenizer with return_tensors
print(tokens["attention_mask"])  # 1s for real tokens, 0s for padding

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')


##### **Q13: How do you create a PyTorch `DataLoader` to batch the tokenized input data for efficient training?**

In [17]:
from torch.utils.data import DataLoader

encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])  # format for PyTorch
train_dataloader = DataLoader(encoded_dataset["train"], batch_size=16, shuffle=True)  # training loader
val_dataloader = DataLoader(encoded_dataset["validation"], batch_size=16)  # validation loader
test_dataloader = DataLoader(encoded_dataset["test"], batch_size=16)  # test loader

## Modifying BERT for fine-tuning


##### **Q14: How do you add a classification layer on top of the pre-trained BERT model for text classification tasks?**


In [18]:
# already done in Q4. i.e.,
print(model.classifier)

Linear(in_features=768, out_features=2, bias=True)


##### **Q15: How do you freeze the BERT base layers and train only the added classification head to avoid overfitting early on?**


In [19]:
for param in model.bert.parameters():
    param.requires_grad = False  # freeze BERT encoder

##### **Q16: How do you unfreeze the BERT base layers after initial training to fine-tune the entire model?**

In [20]:
for param in model.bert.parameters():
    param.requires_grad = True  # unfreeze BERT encoder

## Training the fine-tuned BERT model


##### **Q17: How do you define the loss function for training BERT on a text classification task?**


In [21]:
import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()  # standard loss for classification

##### **Q18: How do you set up the AdamW optimizer with weight decay to update the model’s parameters during training?**


In [23]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)  # pytorch-native AdamW

##### **Q19: How do you implement the training loop, including the forward pass through BERT, loss calculation, and backpropagation?**


In [24]:
from tqdm import tqdm

def train(model, dataloader, optimizer, loss_fn):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)  # forward pass
        logits = outputs.logits
        loss = loss_fn(logits, labels)  # compute loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    
    avg_loss = total_loss / len(dataloader)
    accuracy = correct / total
    print(f"Train Loss: {avg_loss:.4f} | Accuracy: {accuracy:.4f}")

##### **Q20: How do you apply gradient clipping to prevent exploding gradients during the fine-tuning of BERT?**


In [25]:
from torch.nn.utils import clip_grad_norm_

# insert after loss.backward() in the training loop
clip_grad_norm_(model.parameters(), max_norm=1.0)  # apply clipping

tensor(0.)

##### **Q21: How do you track and log the training loss and accuracy over multiple epochs while fine-tuning BERT?**

In [26]:
num_epochs = 3
train_losses = []
train_accuracies = []

for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        loss = loss_fn(logits, labels)

        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), max_norm=1.0)  # apply clipping
        optimizer.step()

        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

    avg_loss = total_loss / len(train_dataloader)
    accuracy = correct / total
    train_losses.append(avg_loss)
    train_accuracies.append(accuracy)
    print(f"Train Loss: {avg_loss:.4f} | Accuracy: {accuracy:.4f}")

Epoch 1/3
Train Loss: 0.2919 | Accuracy: 0.8814
Epoch 2/3
Train Loss: 0.1735 | Accuracy: 0.9441
Epoch 3/3
Train Loss: 0.1070 | Accuracy: 0.9732


## Evaluating the fine-tuned BERT model


##### **Q22: How do you evaluate the fine-tuned BERT model on a validation or test set to calculate performance metrics?**


In [27]:
def evaluate(model, dataloader, loss_fn):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            loss = loss_fn(logits, labels)

            total_loss += loss.item()
            preds = torch.argmax(logits, dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    avg_loss = total_loss / len(dataloader)
    accuracy = correct / total
    print(f"Eval Loss: {avg_loss:.4f} | Accuracy: {accuracy:.4f}")

In [28]:
evaluate(model, val_dataloader, loss_fn)

Eval Loss: 0.3460 | Accuracy: 0.9152


##### **Q23: How do you compute additional evaluation metrics for the fine-tuned BERT model?**


In [29]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(model, dataloader):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=1)

            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
    print(f"Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

In [30]:
compute_metrics(model, val_dataloader)

Precision: 0.9193 | Recall: 0.9073 | F1: 0.9133


##### **Q24: How do you implement a function to perform inference using the fine-tuned BERT model on new text data?**

In [31]:
def predict(text, model, tokenizer, max_length=256):
    model.eval()
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=1)
        pred = torch.argmax(probs, dim=1).item()
    
    return pred, probs.squeeze().cpu().tolist()

In [32]:
example_text = "An absolutely wonderful movie with great performances!"
label, probabilities = predict(example_text, model, tokenizer)
print(f"Predicted Label: {label} | Probabilities: {probabilities}")

Predicted Label: 1 | Probabilities: [0.0010631340555846691, 0.9989368319511414]


## Experimenting with different fine-tuning strategies


##### **Q25: How do you experiment with freezing and unfreezing different layers of BERT during fine-tuning to observe their impact on performance?**


In [42]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [43]:
def freeze_bert_layers(model, freeze_until_layer):
    for name, param in model.bert.named_parameters():
        if "encoder.layer." in name:
            layer_num = int(name.split("encoder.layer.")[1].split(".")[0])
            param.requires_grad = layer_num >= freeze_until_layer
        else:
            param.requires_grad = True

def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

In [44]:
dataset = load_dataset("imdb")["train"].train_test_split(test_size=0.2)
tokenized = dataset.map(preprocess_function, batched=True)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
train_loader = DataLoader(tokenized["train"].select(range(1000)), batch_size=16, shuffle=True)
val_loader = DataLoader(tokenized["test"].select(range(300)), batch_size=16)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [45]:
def train_and_eval(freeze_until_layer):
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device)
    freeze_bert_layers(model, freeze_until_layer=freeze_until_layer)
    
    optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
    loss_fn = nn.CrossEntropyLoss()

    model.train()
    for epoch in range(2):
        for batch in train_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)

            optimizer.zero_grad()
            loss.backward()
            clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

    # evaluation
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            total_loss += loss.item()

            preds = torch.argmax(outputs.logits, dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

    avg_loss = total_loss / len(val_loader)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
    print(f"Freeze < {freeze_until_layer} layers --> Loss: {avg_loss:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

In [46]:
for layer in [0, 6, 10, 12]:  # 0 = full model trainable, 12 = only classifier
    train_and_eval(freeze_until_layer=layer)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freeze < 0 layers --> Loss: 0.4247 | Precision: 0.7638 | Recall: 0.9870 | F1: 0.8612


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freeze < 6 layers --> Loss: 0.2995 | Precision: 0.8973 | Recall: 0.8506 | F1: 0.8733


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freeze < 10 layers --> Loss: 0.4179 | Precision: 0.8092 | Recall: 0.7987 | F1: 0.8039


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freeze < 12 layers --> Loss: 0.6432 | Precision: 0.6667 | Recall: 0.8442 | F1: 0.7450


##### **Q26: How do you experiment with different learning rates for the classification head and the BERT base model?**


In [47]:
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

In [48]:
def train_and_eval(base_lr, head_lr):
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device)

    optimizer = AdamW([
        {"params": model.bert.parameters(), "lr": base_lr},
        {"params": model.classifier.parameters(), "lr": head_lr}
    ], weight_decay=0.01)

    loss_fn = nn.CrossEntropyLoss()
    model.train()
    for epoch in range(2):
        for batch in train_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)

            optimizer.zero_grad()
            loss.backward()
            clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

    # evaluation
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            total_loss += loss.item()

            preds = torch.argmax(outputs.logits, dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

    avg_loss = total_loss / len(val_loader)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
    print(f"Base LR: {base_lr:.1e} | Head LR: {head_lr:.1e} --> Loss: {avg_loss:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

In [49]:
combinations = [(5e-5, 5e-5), (1e-5, 5e-5), (1e-5, 1e-4), (2e-6, 1e-4)]
for base_lr, head_lr in combinations:
    train_and_eval(base_lr, head_lr)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base LR: 5.0e-05 | Head LR: 5.0e-05 --> Loss: 0.4350 | Precision: 0.7308 | Recall: 0.9870 | F1: 0.8398


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base LR: 1.0e-05 | Head LR: 5.0e-05 --> Loss: 0.4168 | Precision: 0.7579 | Recall: 0.9351 | F1: 0.8372


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base LR: 1.0e-05 | Head LR: 1.0e-04 --> Loss: 0.3854 | Precision: 0.8198 | Recall: 0.9156 | F1: 0.8650


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base LR: 2.0e-06 | Head LR: 1.0e-04 --> Loss: 0.5275 | Precision: 0.7584 | Recall: 0.7338 | F1: 0.7459


##### **Q27: How do you experiment with different batch sizes and observe their impact on training stability and performance?**


In [50]:
def train_with_batch_size(batch_size):
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

    train_subset = tokenized["train"].select(range(1000))
    val_subset = tokenized["test"].select(range(300))
    train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_subset, batch_size=batch_size)

    model.train()
    for epoch in range(2):
        for batch in train_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)

            optimizer.zero_grad()
            loss.backward()
            clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

    # evaluation
    model.eval()
    total_loss = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            total_loss += loss.item()

            preds = torch.argmax(outputs.logits, dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

    avg_loss = total_loss / len(val_loader)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
    print(f"Batch Size: {batch_size} --> Loss: {avg_loss:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

In [51]:
for bs in [8, 16, 32]:
    train_with_batch_size(bs)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Batch Size: 8 --> Loss: 0.4070 | Precision: 0.8710 | Recall: 0.8766 | F1: 0.8738


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Batch Size: 16 --> Loss: 0.3356 | Precision: 0.8462 | Recall: 0.9286 | F1: 0.8854


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Batch Size: 32 --> Loss: 0.3122 | Precision: 0.8537 | Recall: 0.9091 | F1: 0.8805


##### **Q28: How do you fine-tune BERT with a smaller dataset and apply regularization techniques to prevent overfitting?**


In [52]:
small_train_dataset = tokenized["train"].select(range(300))  # small training set
val_dataset = tokenized["test"].select(range(300))

train_loader = DataLoader(small_train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

In [54]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.dropout.p = 0.4  # increase dropout rate
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
loss_fn = nn.CrossEntropyLoss()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [55]:
model.train()
for epoch in range(3):
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)

        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

In [56]:
model.eval()
total_loss = 0
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)
        total_loss += loss.item()

        preds = torch.argmax(outputs.logits, dim=1)
        all_preds.extend(preds.cpu().tolist())
        all_labels.extend(labels.cpu().tolist())

In [57]:
avg_loss = total_loss / len(val_loader)
precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
print(f"Small Dataset Eval --> Loss: {avg_loss:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

Small Dataset Eval --> Loss: 0.4103 | Precision: 0.8841 | Recall: 0.7922 | F1: 0.8356


##### **Q29: How do you implement early stopping based on validation performance to prevent overfitting during fine-tuning?**


In [58]:
train_dataset = tokenized["train"].select(range(1000))
val_dataset = tokenized["test"].select(range(300))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

In [59]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.dropout.p = 0.3  # optional dropout tuning
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
loss_fn = nn.CrossEntropyLoss()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [60]:
best_val_loss = float("inf")
patience = 2
counter = 0

for epoch in range(10):
    print(f"Epoch {epoch+1}")
    model.train()
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)

        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

    # evaluate
    model.eval()
    val_loss = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            val_loss += loss.item()

            preds = torch.argmax(outputs.logits, dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

    avg_val_loss = val_loss / len(val_loader)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
    print(f"Validation Loss: {avg_val_loss:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        counter = 0
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered.")
            break

Epoch 1
Validation Loss: 0.3658 | Precision: 0.8291 | Recall: 0.8506 | F1: 0.8397
Epoch 2
Validation Loss: 0.3070 | Precision: 0.9021 | Recall: 0.8377 | F1: 0.8687
Epoch 3
Validation Loss: 0.4064 | Precision: 0.8774 | Recall: 0.8831 | F1: 0.8803
Epoch 4
Validation Loss: 0.5336 | Precision: 0.8710 | Recall: 0.8766 | F1: 0.8738
Early stopping triggered.


##### **Q30: How do you fine-tune BERT on a specific task and analyze the results?**

In [61]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
loss_fn = nn.CrossEntropyLoss()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [62]:
train_loader = DataLoader(tokenized["train"].select(range(1000)), batch_size=16, shuffle=True)
test_loader = DataLoader(tokenized["test"].select(range(300)), batch_size=16)

In [63]:
model.train()
for epoch in range(3):
    print(f"Epoch {epoch+1}")
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)

        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

Epoch 1
Epoch 2
Epoch 3


In [64]:
model.eval()
total_loss = 0
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)
        total_loss += loss.item()

        preds = torch.argmax(outputs.logits, dim=1)
        all_preds.extend(preds.cpu().tolist())
        all_labels.extend(labels.cpu().tolist())

In [65]:
avg_loss = total_loss / len(test_loader)
precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
print(f"Test Loss: {avg_loss:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

Test Loss: 0.5654 | Precision: 0.8111 | Recall: 0.9481 | F1: 0.8743


In [66]:
# inference function
def predict(text):
    model.eval()
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)
        pred = torch.argmax(probs, dim=1).item()
    
    return pred, probs.squeeze().cpu().tolist()

In [67]:
# analyze prediction
text = "A genuinely moving film with outstanding performances."
label, probs = predict(text)
print(f"Inference → Label: {label} | Probabilities: {probs}")

Inference → Label: 1 | Probabilities: [0.005515105556696653, 0.9944848418235779]
