# Fine Tuning Pre-Trained Text Classifier Models

Having explored how to build text processing pipelines and classifiers from the ground up, you are now ready to leverage more advanced and efficient techniques. Instead of training a model from scratch, which can be computationally intensive and require vast amounts of data, you'll use a pre-trained model. This approach utilizes a model that has already learned rich language patterns from enormous datasets, giving you a powerful head start through transfer learning.

In this lab, you will focus on fine-tuning **DistilBERT**, a lighter and faster version of the formidable BERT model, to classify recipe titles. This process demonstrates how to adapt a general purpose language model for a specialized task. You'll also see how tools from the Hugging Face ecosystem streamline many of the manual data preparation steps, such as tokenization and padding.

This lab will guide you through the following essential steps:

* Loading the pre-trained DistilBERT model along with its specific tokenizer.
* Preparing the recipe dataset using a custom `Dataset` class and an automated `DataCollatorWithPadding` for efficient batching.
* Implementing two fine-tuning strategies: one where you update the entire model and another, more efficient method where you only train the final few layers.
* Comparing the performance of both methods to evaluate the trade-offs between accuracy and computational cost.
* Testing your model(s) on new, unseen recipe titles to assess its generalization capabilities.

## Imports

In [2]:
!pip install torchmetrics



In [3]:
!git clone https://github.com/fayrouz2/B5.git

Cloning into 'B5'...
remote: Enumerating objects: 785, done.[K
remote: Counting objects: 100% (85/85), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 785 (delta 66), reused 37 (delta 37), pack-reused 700 (from 2)[K
Receiving objects: 100% (785/785), 35.63 MiB | 24.05 MiB/s, done.
Resolving deltas: 100% (270/270), done.


In [4]:
%cd /content/B5/W5_NLP/M2/labs/C2_M3_Lab_4_finetuned_text_classifier


/content/B5/W5_NLP/M2/labs/C2_M3_Lab_4_finetuned_text_classifier


In [5]:
!ls


C2_M3_Lab_4_finetuned_text_classifier.ipynb  helper_utils.py


In [None]:
import random

import numpy as np
import pandas as pd

from sklearn.utils.class_weight import compute_class_weight

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

import transformers

import helper_utils

# Set random seed for reproducibility
SEED = 99
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [None]:
import gdown
import pandas as pd

file_id = '1NZjBHPzTrLahTaWcZ6GUH1GEYCTAODOw'
url = f'https://drive.google.com/uc?id={file_id}'

# سيقوم هذا الأمر بتحميل الملف إلى جهازك أو البيئة التي تعمل عليها
output = 'recipes_fruit_veg.csv'
gdown.download(url, output, quiet=False)

# الآن قم بقراءته باستخدام pandas
df = pd.read_csv(output)
print(df.head())

Downloading...
From (original): https://drive.google.com/uc?id=1NZjBHPzTrLahTaWcZ6GUH1GEYCTAODOw
From (redirected): https://drive.google.com/uc?id=1NZjBHPzTrLahTaWcZ6GUH1GEYCTAODOw&confirm=t&uuid=e9d3b12d-f0cc-413e-80c6-b41117530c58
To: /content/B5/W5_NLP/M2/labs/C2_M3_Lab_4_finetuned_text_classifier/recipes_fruit_veg.csv
100%|██████████| 107M/107M [00:01<00:00, 59.9MB/s] 


                               name      id  minutes  \
0  a bit different  breakfast pizza   31490       30   
1         all in the kitchen  chili  112140      130   
2                alouette  potatoes   59389       45   
3           apple a day  milk shake    5289        0   
4          bananas 4 ice cream  pie   70971      180   

                                         ingredients  \
0  ['prepared pizza crust', 'sausage patty', 'egg...   
1  ['ground beef', 'yellow onions', 'diced tomato...   
2  ['spreadable cheese with garlic and herbs', 'n...   
3  ['milk', 'vanilla ice cream', 'frozen apple ju...   
4  ['chocolate sandwich style cookies', 'chocolat...   

                                               steps   category  
0  ['preheat oven to 425 degrees f', 'press dough...  vegetable  
1  ['brown ground beef in large pot', 'add choppe...  vegetable  
2  ['place potatoes in a large pot of lightly sal...  vegetable  
3  ['combine ingredients in blender', 'cover and ...      frui

## Revisiting Recipe Dataset

You will re-use the recipe dataset from the previous lab. As a reminder, this is a specialized subset of the large [Food.com Recipes and Interactions](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions), containing titles for recipes that have been clearly classified as either fruit-based or vegetable-based.

### Data Preparation

* Load the `recipes_fruit_veg.csv` file into a pandas DataFrame.
* Create a numerical `label` column from the text categories, mapping `'fruit'` to `0` and `'vegetable'` to `1`.
* Extract the recipe names and numerical labels into two separate lists, `texts` and `labels`.

In [None]:
# Load the filtered dataset into a pandas DataFrame
#df = pd.read_csv("recipes_fruit_veg.csv")

# Create the numerical 'label' column: 0 for 'fruit', 1 for 'vegetable'
df['label'] = 1
df.loc[df['category'] == 'fruit', 'label'] = 0

# Extract the recipe names and labels into lists
df_clean = df.dropna(subset=['name'])
texts = df_clean['name'].tolist()
labels = df_clean['label'].tolist()

# Verify the dataset size and class distribution
print(f"Total samples for classification:  {len(texts)}")
print(f"Fruit recipes:                     {labels.count(0)}, {round(labels.count(0)/(labels.count(0) + labels.count(1)) *100,1)} %")
print(f"Vegetable recipes:                 {labels.count(1)}, {round(labels.count(1)/(labels.count(0) + labels.count(1)) *100,1)} %")

Total samples for classification:  142915
Fruit recipes:                     29148, 20.4 %
Vegetable recipes:                 113767, 79.6 %


### Previewing the `name`and `label` Columns

Your data is now structured with the `name` and `label` columns.

* Run the cell below to review a random sample of these training pairs.

In [None]:
# Set the number of random samples to display.
num_samples = 10

# Display a sample of name and label pairs.
display(df[['name', 'label']].sample(num_samples, random_state=25).style.hide(axis="index"))

name,label
cajun tomato gravy,1
scallop soup,1
chicken piccata light,0
bombay kidney beans,1
surefire siu mai dim sum,1
linguini alla critzos,1
zuppa toscana,1
maine blueberry cake,0
lemon cream cheese coffee cake,0
sweet smoky salmon kabobs,1


## Loading the Pre-trained Transformer

In the previous lab, you built every part of your text classifier from scratch. The core difference in this lab is that you will replace several components you previously had to build yourself with highly optimized tools from the Hugging Face ecosystem.

Specifically, you will be using the [DistilBERT](https://huggingface.co/distilbert-base-uncased) model. This involves loading two key components that are designed to work together:

* **The Pre-trained Model**: This is a powerful neural network, DistilBERT, that has already learned to understand language from a massive amount of text. Its role is to provide a strong foundation of language understanding that you will adapt for your recipe classification task.

* **The Tokenizer**: This is the bridge between your raw text and the model. It will translate your recipe titles into the specific numerical format the model was trained on. Each pre-trained model has its own specific tokenizer, and it is crucial to use the one that matches your model.

* Execute the cell below to download the base DistilBERT model and tokenizer from the Hugging Face.

In [None]:
model_name="distilbert-base-uncased"
model_path="./distilbert-local-base"

# Ensure the model is downloaded
helper_utils.download_bert(model_name, model_path)

Downloading base model 'distilbert-base-uncased' to ./distilbert-local-base...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Base model downloaded and saved successfully.


* Load the pre-trained transformer.
    * `num_classes=2`: Attaches a new randomly initialized classification head with 2 output labels, preparing the model for your binary classification task.
    
**Note**: You will see a warning that some weights were "newly initialized." This is expected. It confirms that you have successfully loaded the pre-trained DistilBERT base and attached a new, untrained classification head.  

In [None]:
bert_model, bert_tokenizer = helper_utils.load_bert(model_path, num_classes=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at ./distilbert-local-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading base model from ./distilbert-local-base and adding a new head with 2 classes.
Model and tokenizer loaded successfully.


## Preparing Data for Training

Now that you have your model, tokenizer, and data lists ready, the next step is to structure this data into the objects PyTorch requires for training. This process is simpler than in the previous lab because many of the manual steps you performed before, such as cleaning text with the `preprocess_text` function and building a custom `Vocabulary` class, are no longer necessary.

The Hugging Face tokenizer handles this work for you. It performs the text cleaning, tokenization, and numerical conversion automatically inside the custom `Dataset` class you are about to create. You will define this `Dataset` to wrap your data and then use `DataLoaders` to create iterable batches.

### `RecipeDataset` Dataset Class

You will start by defining a `RecipeDataset` class, the purpose of which is to use your tokenizer to convert a single raw text sample into the required numerical tensors on the fly, right when the model needs it.

* Define the `RecipeDataset` which will serve as a container for your data and manage the on the fly tokenization process.
    * `__init__`: Initializes the dataset by storing your `texts`, `labels`, and the `tokenizer`.
    * `__len__`: Returns the total number of samples in your dataset.
    * `__getitem__`: This is the core method where the on the fly processing occurs. For each text sample, the single call to the `tokenizer` performs all the complex preprocessing steps you previously handled manually. It cleans the text, tokenizes it into sub words, converts tokens to numerical IDs using its built in vocabulary, and creates an attention mask. The method then combines these tensors with the correct label into a dictionary, ready for the model.

In [None]:
class RecipeDataset(Dataset):
    """
    Custom PyTorch Dataset for text classification.

    This Dataset class stores raw texts and their corresponding labels. It is
    designed to work efficiently with a Hugging Face tokenizer, performing
    tokenization on the fly for each sample when it is requested.
    """
    def __init__(self, texts, labels, tokenizer):
        """
        Initializes the RecipeDataset.

        Args:
            texts: A list of raw text strings.
            labels: A list of integer labels corresponding to the texts.
            tokenizer: A Hugging Face tokenizer instance for processing text.
        """
        # Store the list of raw text strings.
        self.texts = texts
        # Store the list of integer labels.
        self.labels = labels
        # Store the tokenizer instance that will process the text.
        self.tokenizer = tokenizer

    def __len__(self):
        """Returns the total number of samples in the dataset."""
        # Return the size of the dataset based on the number of texts.
        return len(self.texts)

    def __getitem__(self, idx):
        """
        Retrieves and processes one sample from the dataset.

        For a given index, this method fetches the corresponding text and label,
        tokenizes the text, and returns a dictionary of tensors.

        Args:
            idx: The index of the sample to retrieve.

        Returns:
            A dictionary containing the tokenized inputs ('input_ids',
            'attention_mask') and the 'labels' as tensors.
        """
        # Get the raw text and label for the specified index.
        text = self.texts[idx]
        label = self.labels[idx]

        # Tokenize the text, handling tasks like cleaning, numerical conversion,
        # and truncation. Padding is handled later by a DataCollator.
        encoding = self.tokenizer(text, truncation=True, max_length=512)

        # Add the label to the encoding dictionary and convert it to a tensor.
        encoding['labels'] = torch.tensor(label, dtype=torch.long)

        # Return the dictionary containing all processed data for the sample.
        return encoding

* Create an instance of your `RecipeDataset`.

In [None]:
# Create the full dataset
full_dataset = RecipeDataset(texts, labels, bert_tokenizer)

### Splitting the Data

* Divide your `full_dataset` into an 80% training set and a 20% validation set.

In [None]:
# Split the full dataset into an 80% training set and a 20% validation set.
train_dataset, val_dataset = helper_utils.create_dataset_splits(
    full_dataset, 
    train_split_percentage=0.8
)

# Print the number of samples in each set to verify the split.
print(f"Training samples:   {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training samples:   114332
Validation samples: 28583


### Create DataLoaders

In the previous lab, you addressed the challenge of batching variable-length text by writing custom `collate_fn` functions to manually pad sequences or create offsets. The Hugging Face `DataCollatorWithPadding` function automates this complex step for you. 

* Use [DataCollatorWithPadding](https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorWithPadding) and pass it your `bert_tokenizer`. It will automatically handle the dynamic padding of each batch.

In [None]:
# Data collator handles dynamic padding for each batch
data_collator = transformers.DataCollatorWithPadding(tokenizer=bert_tokenizer)

* Create two `DataLoader` instances, `train_loader` and `val_loader`.
    * `collate_fn=data_collator`: Passing your `data_collator` to create dynamically padded batches instead of the default PyTorch behavior.

In [None]:
# Set the number of samples to process in each batch.
batch_size = 32

# Create the DataLoader for the training set with `data_collator`
train_loader = DataLoader(train_dataset, 
                          batch_size=batch_size, 
                          shuffle=True, 
                          collate_fn=data_collator
                         )

# Create the DataLoader for the validation set with `data_collator`
val_loader = DataLoader(val_dataset, 
                        batch_size=batch_size, 
                        shuffle=False, 
                        collate_fn=data_collator
                       )

## Training the Model

With the pre-trained DistilBERT model loaded and the `DataLoaders` fully configured, the foundational work is complete. You are now ready to begin the fine-tuning process.

### Addressing Class Imbalance

* Calculate class weights to address the data imbalance in your training set.

In [None]:
# Extract all labels from the training set to calculate class weights for handling imbalance.
train_labels_list = [train_dataset.dataset.labels[i] for i in train_dataset.indices]
    
    
# Use scikit-learn's utility to automatically calculate class weights.
class_weights = compute_class_weight(
    # The strategy for calculating weights. 'balanced' is automatic.
    class_weight='balanced',
    # The array of unique class labels (e.g., [0, 1]).
    classes=np.unique(train_labels_list),
    # The list of all training labels, used to count class frequencies.
    y=train_labels_list
)

# Convert the NumPy array of weights into a PyTorch tensor of type float
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

# Print the final weights to verify the calculation.
print("Calculated Class Weights:")
print(f"  - Fruit (Class 0):     {class_weights[0]:.2f}")
print(f"  - Vegetable (Class 1): {class_weights[1]:.2f}")

Calculated Class Weights:
  - Fruit (Class 0):     2.45
  - Vegetable (Class 1): 0.63


### Configuring the Loss Function

* Define `nn.CrossEntropyLoss` as your loss function, and pass your previously calculated `class_weights` tensor to the weight parameter.

In [None]:
# Initialize the CrossEntropyLoss function with the calculated `class_weights`.
loss_function = nn.CrossEntropyLoss(weight=class_weights)

### Baseline Approach: Fine-Tuning the Entire Model

First, you will take the standard approach: fine-tuning the *entire* DistilBERT model. This means that every parameter, from the initial embedding layers to the final classification layer, will have its weights updated during training. Keep in mind that we are beginning the training using the pre-trained weights and will continue to further train the model.

This method adapts the whole model to the recipe classification task and will serve as your performance baseline. You will use the `training_loop` function to run the training process and see how well this approach works.

* For each batch, it explicitly unpacks the `input_ids`, `attention_mask`, and `labels` required by the model.
* It then fine-tunes all layers of the DistilBERT model on your dataset.

In [None]:
## Uncomment if you want to see the training loop function
helper_utils.display_function(helper_utils.training_loop)

```python
def training_loop(model, train_loader, val_loader, loss_function, num_epochs, device):
    """
    Performs a full training and validation cycle for a PyTorch model.

    Args:
        model: The PyTorch model to be trained.
        train_loader: The DataLoader for the training dataset.
        val_loader: The DataLoader for the validation dataset.
        loss_function: The loss function used for training.
        num_epochs: The total number of epochs to train for.
        device: The computational device ('cuda' or 'cpu') to run on.

    Returns:
        A tuple containing the trained model and a dictionary of the final
        performance metrics from the last validation epoch.
    """
    # Move the model to the specified computational device.
    model.to(device)

    # Initialize the AdamW optimizer with a default learning rate.
    optimizer = optim.AdamW(model.parameters(), lr=5e-5)

    # Determine the number of classes from the model's configuration.
    num_classes = model.config.num_labels

    # Initialize metric objects from torchmetrics for stateful metric calculation.
    val_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes).to(device)
    val_precision = torchmetrics.Precision(task="multiclass", num_classes=num_classes, average="macro").to(device)
    val_recall = torchmetrics.Recall(task="multiclass", num_classes=num_classes, average="macro").to(device)
    val_f1 = torchmetrics.F1Score(task="multiclass", num_classes=num_classes, average="macro").to(device)

    # Create the main progress bar that iterates over the epochs.
    epoch_loop = tqdm(range(num_epochs), desc="Training Progress")

    # Begin the main training and validation loop.
    for epoch in epoch_loop:

        # --- Training Phase ---
        # Set the model to training mode, which enables layers like dropout.
        model.train()
        # Initialize the accumulated training loss for the epoch.
        train_loss_epoch = 0

        # Create a nested progress bar for the training batches of the current epoch.
        train_inner_loop = tqdm(
            train_loader, desc=f"Epoch {epoch+1}/{num_epochs} Training", leave=False
        )
        # Iterate over the training data batches.
        for batch in train_inner_loop:
            # Unpack the batch and move all tensors to the active device.
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            # Clear any gradients from the previous iteration.
            optimizer.zero_grad()

            # Perform a forward pass to get the model's raw outputs (logits).
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            # Calculate the loss for the current batch.
            loss = loss_function(logits, labels)

            # Accumulate the loss and perform backpropagation to compute gradients.
            train_loss_epoch += loss.item()
            loss.backward()

            # Update the model's weights based on the computed gradients.
            optimizer.step()

            # Update the inner progress bar's postfix with the current batch loss.
            train_inner_loop.set_postfix(loss=loss.item())

        # Calculate the average training loss over all batches in the epoch.
        train_loss_epoch /= len(train_loader)

        # --- Validation Phase ---
        # Set the model to evaluation mode, which disables layers like dropout.
        model.eval()
        # Initialize the accumulated validation loss for the epoch.
        val_loss_epoch = 0

        # Create a nested progress bar for the validation batches.
        val_inner_loop = tqdm(
            val_loader, desc=f"Epoch {epoch+1}/{num_epochs} Validation", leave=False
        )
        # Disable gradient calculations to save memory and computations.
        with torch.no_grad():
            # Iterate over the validation data batches.
            for batch in val_inner_loop:
                # Unpack the batch and move tensors to the active device.
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                # Perform a forward pass to get the model's logits.
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits

                # Calculate the validation loss for the current batch.
                val_loss = loss_function(logits, labels)
                val_loss_epoch += val_loss.item()

                # Get model predictions and update the metric objects with batch results.
                preds = torch.argmax(logits, dim=-1)
                val_accuracy.update(preds, labels)
                val_precision.update(preds, labels)
                val_recall.update(preds, labels)
                val_f1.update(preds, labels)
        
        # Calculate the average validation loss for the epoch.
        val_loss_epoch /= len(val_loader)

        # --- Logging and Metric Calculation ---
        # Compute the final metrics over the entire validation set for the epoch.
        epoch_acc = val_accuracy.compute()
        epoch_prec = val_precision.compute()
        epoch_recall = val_recall.compute()
        epoch_f1 = val_f1.compute()

        # Reset all metric objects to be ready for the next epoch.
        val_accuracy.reset()
        val_precision.reset()
        val_recall.reset()
        val_f1.reset()

        # Update the main progress bar with the results of the completed epoch.
        epoch_loop.set_postfix(
            train_loss=f"{train_loss_epoch:.4f}",
            val_loss=f"{val_loss_epoch:.4f}",
            val_acc=f"{epoch_acc:.4f}",
        )
        # Use tqdm.write to log metrics without interfering with the progress bars.
        tqdm.write(
            f"Epoch {epoch+1} Metrics -> Val Acc: {epoch_acc:.4f}, Val F1: {epoch_f1:.4f}"
        )

    # Indicate that the entire training process is complete.
    print("\n--- Training complete ---")

    # Store the final metrics from the last epoch in a dictionary.
    final_results = {
        "val_accuracy": epoch_acc.item(),
        "val_precision": epoch_prec.item(),
        "val_recall": epoch_recall.item(),
        "val_f1": epoch_f1.item(),
    }
    
    # Return the trained model and the final results.
    return model, final_results

```

In [None]:
# Set the total number of epochs.
num_epochs = 3

# Call the training loop to start the full fine-tuning process.
full_finetuned_bert, full_results = helper_utils.training_loop(
    bert_model, 
    train_loader, 
    val_loader, 
    loss_function, 
    num_epochs, 
    device
)

ERROR: Error in parse(text = input): <text>:5:20: unexpected ','
4: # Call the training loop to start the full fine-tuning process.
5: full_finetuned_bert,
                      ^


* Print the validation metrics from the `results_bert` dictionary to review the performance of your fine-tuned model on the validation set.

In [None]:
# Display the results 
helper_utils.print_final_results(full_results)

Final Validation Metrics

Accuracy:   0.9580
Precision:  0.9351
Recall:     0.9360
F1:         0.9356



### An Efficient Alternative: Partial Fine-Tuning

While fine-tuning the entire model is effective, it can be computationally expensive. Now, you will explore a more efficient strategy known as **partial fine-tuning**. Instead of training the entire model, you will strategically freeze the majority of the model's layers and train only those most effective for adapting to the new task.

First take a look at the architecture of the DistilBERT model you are using.

In [None]:
print(bert_model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


The decision of which layers to freeze is based on how transformers learn hierarchically:

* **Earlier Layers**: 
The layers closer to the input learn general language features, such as grammar and basic word relationships. 
Since these features are useful for almost any task, they are often kept frozen. 
In your DistilBERT model, these are the `embeddings` and the **first four** `TransformerBlock` layers:

In [None]:
# embeddings
print("\nEmbeddings: \n")
print(bert_model.distilbert.embeddings)

# first four TransformerBlock layers
print("\nFirst four TransformerBlock layers: \n")
print(bert_model.distilbert.transformer.layer[:4])


Embeddings: 

Embeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

First four TransformerBlock layers: 

ModuleList(
  (0-3): 4 x TransformerBlock(
    (attention): DistilBertSdpaAttention(
      (dropout): Dropout(p=0.1, inplace=False)
      (q_lin): Linear(in_features=768, out_features=768, bias=True)
      (k_lin): Linear(in_features=768, out_features=768, bias=True)
      (v_lin): Linear(in_features=768, out_features=768, bias=True)
      (out_lin): Linear(in_features=768, out_features=768, bias=True)
    )
    (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (ffn): FFN(
      (dropout): Dropout(p=0.1, inplace=False)
      (lin1): Linear(in_features=768, out_features=3072, bias=True)
      (lin2): Linear(in_features=3072, out_features=768, bias=True)
      (activation): GELUActivat

* **Later Layers**: The layers closer to the output learn more complex and abstract features that become more specialized to the data they are trained on. These are the layers you typically want to unfreeze to adapt the model to the nuances of your new task. In your model, these are the **last two** `TransformerBlock` layers and the final classification layers:

In [None]:
# last two TransformerBlock layers
print("\nLast two TransformerBlock layers: \n")
print(bert_model.distilbert.transformer.layer[4:6])

# final classification layers
print("\nFinal Classifier Layer: \n")
print(bert_model.pre_classifier)
print(bert_model.classifier)


Last two TransformerBlock layers: 

ModuleList(
  (0-1): 2 x TransformerBlock(
    (attention): DistilBertSdpaAttention(
      (dropout): Dropout(p=0.1, inplace=False)
      (q_lin): Linear(in_features=768, out_features=768, bias=True)
      (k_lin): Linear(in_features=768, out_features=768, bias=True)
      (v_lin): Linear(in_features=768, out_features=768, bias=True)
      (out_lin): Linear(in_features=768, out_features=768, bias=True)
    )
    (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (ffn): FFN(
      (dropout): Dropout(p=0.1, inplace=False)
      (lin1): Linear(in_features=768, out_features=3072, bias=True)
      (lin2): Linear(in_features=3072, out_features=768, bias=True)
      (activation): GELUActivation()
    )
    (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  )
)

Final Classifier Layer: 

Linear(in_features=768, out_features=768, bias=True)
Linear(in_features=768, out_features=2, bias=True)


For the task at hand, you will unfreeze and train the **final classifier head** and the **last two transformer layers** (**later layers**). This allows the model to adjust its high level feature extraction to the nuances of recipe classification, while still leveraging the robust, general language understanding from its frozen layers.

This approach tests a key hypothesis: can you achieve comparable performance to the baseline while saving significant computational resources?

* Your first step is to freeze all parameters in the model by setting their `requires_grad` attribute to `False`. This prevents their weights from being updated during the training process.

In [None]:
# Freeze ALL model parameters first
for param in bert_model.parameters():
    param.requires_grad = False

* Next, you will unfreeze the **last two transformer layers** to make them trainable by setting their `requires_grad` attribute back to `True`.

In [None]:
# Unfreeze the last 2 transformer layers
# Set the number of final transformer layers to unfreeze and train.
layers_to_train = 2 

# Access the list of all transformer layers in the DistilBERT model.
transformer_layers = bert_model.distilbert.transformer.layer

# Loop backwards from the end of the layer list for the number of layers you want to train.
for i in range(layers_to_train):
    # Select a layer using negative indexing (e.g., -1 for the last, -2 for the second to last).
    layer_to_unfreeze = transformer_layers[-(i+1)]
    
    # Iterate through all parameters of the selected layer.
    for param in layer_to_unfreeze.parameters():
        # Set requires_grad to True to make the parameter trainable.
        param.requires_grad = True

* The final step is to unfreeze the model's classification head, which consists of the `pre_classifier` and `classifier` layers, to ensure it can be trained on your new task.

In [None]:
# Unfreeze the classifier head
# The final layers of the model must be made trainable to adapt to the new task.

# For DistilBERT, this head consists of two linear layers.
# Unfreeze the pre_classifier layer.
for param in bert_model.pre_classifier.parameters():
    param.requires_grad = True

# Unfreeze the final classifier layer.
for param in bert_model.classifier.parameters():
    param.requires_grad = True

With your partial fine-tuning strategy configured, you will now execute the `training_loop` function to start the training process. This will handle the entire training process and return the trained model along with a dictionary of the final validation metrics.

* For each batch, it explicitly unpacks the `input_ids`, `attention_mask`, and `labels` required by the model.

In [None]:
## Uncomment if you want to see the training loop function

# helper_utils.display_function(helper_utils.training_loop)

In [None]:
# Set the total number of epochs.
num_epochs = 3

# Call the training loop to start the partial fine-tuning process.
partial_finetuned_bert, partial_results = helper_utils.training_loop(
    bert_model, 
    train_loader, 
    val_loader, 
    loss_function, 
    num_epochs, 
    device
)

Training Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1/3 Training:   0%|          | 0/3573 [00:00<?, ?it/s]

Epoch 1/3 Validation:   0%|          | 0/894 [00:00<?, ?it/s]

Epoch 1 Metrics -> Val Acc: 0.9523, Val F1: 0.9282


Epoch 2/3 Training:   0%|          | 0/3573 [00:00<?, ?it/s]

* Print the validation metrics from the `results_bert` dictionary to review the performance of your fine-tuned model on the validation set.

In [None]:
# Display the results 
helper_utils.print_final_results(partial_results)

### Comparing Fine-Tuning Approaches

* Directly compare the performance of the two approaches: the full fine-tuning baseline and the efficient partial fine-tuning method.
    * `full_results`: Contains the metrics from the full fine-tuning of the entire model.
    * `partial_results`: Contains the metrics from the partial fine-tuning, where only the last two transformer layers and the classifier were trained.

In [None]:
# Compare your results
helper_utils.display_results(full_results, partial_results)

Based on these results, it's clear that both models perform almost identically after just `3` epochs. As you must have noticed, this more efficient approach took less time to train since it updated far fewer parameters. This perfectly illustrates the core benefit of partial fine-tuning: achieving comparable, if not better, performance while saving valuable time and computational resources.

## Testing the Fine-tuned BERT Model on New Examples

Now for the final test. It's time to see how your fine-tuned model performs on completely new, unseen data. This is the best way to get a qualitative feel for how well your model has learned to generalize.

* Define a `test_products` list containing a mix of new recipe titles. This list includes straightforward examples as well as more challenging ones to see where the model excels and where it might struggle.
    * Feel free to add your own recipe titles to this list to test the model even further!
 
**Note**: Remember, the model's predictions are based *only* on the words in the recipe's `name`. It was never shown the ingredients list, so it has no knowledge of whether fruits or vegetables are the dominant ingredient. A recipe's name can sometimes be misleading, and the model's classification will reflect only what it has learned from the title's text.

In [None]:
test_products = [
    "Blueberry Muffins",                  # Expected: Fruit
    "Spinach and Feta Stuffed Chicken",   # Expected: Vegetable
    "Classic Carrot Cake with Frosting",  # Expected: Vegetable
    "Tomato and Basil Bruschetta",        # Expected: Vegetable
    "Avocado Toast",                      # Expected: Fruit
    "Zucchini Bread with Walnuts",        # Expected: Vegetable
    "Lemon and Herb Roasted Chicken",     # Expected: Fruit
    "Strawberry Rhubarb Pie",             # Expected: Fruit
]


* Finally, loop through the `test_products` list to run the prediction for each recipe and see the model's final output.

In [None]:
## Uncomment if you want to see the predict category function

# helper_utils.display_function(helper_utils.predict_category)

In [None]:
# Loop through each test product
for product in test_products:
    # Call the prediction function with the required arguments
    category = helper_utils.predict_category(
        partial_finetuned_bert, # Try it with `full_finetuned_bert` as well.
        bert_tokenizer,
        product,
        device
    )
    # Print the results
    print(f"Product: '{product}'\nPredicted: {category}.\n")

## Conclusion

Congratulations on completing this lab! You have successfully moved beyond building models from scratch and have now fine-tuned a state-of-the-art transformer model for a custom text classification task.

You began by loading a pre-trained DistilBERT model and saw firsthand how it simplifies the entire text-to-tensor pipeline. The main takeaway from your experiments is the effectiveness of **partial fine-tuning**. You demonstrated that by strategically freezing most of the model's layers and only updating the final, task-specific ones, you can achieve performance comparable to or even slightly better than fully fine-tuning the entire model. This insight is immensely valuable for practical applications, as it allows for significant savings in training time and computational resources without sacrificing quality.

The skills you've developed here, loading and adapting pre-trained models, managing data with modern tools, and strategically choosing which parts of a model to train—are the building blocks for tackling a wide range of complex NLP challenges, from sentiment analysis to machine translation and beyond.