<a href="https://colab.research.google.com/github/abdulsamadkhan/Courses-LLM-Lectures/blob/main/FineTunningwithTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective:
This tutorial will utilize PyTorch to fine-tune the 'bertbase-model' from Hugging Face. It consists of the following steps:
##1. Installing Libraries. 📚
##2. Preprocessing the data using tokenizer with dynamic padding   🔍🧹📊
##3. DataLoader 🚚  🤗  📊
##4. Loading the model  🤗 🧠
##5. Setting optimizer and Learning rate schedular 🔧
##6. Training Model 🏋️‍♂️
##7. The Evaluation Loop  🔄


#1. Installing Libraries. 📚
The following lines will download the necessary libraries: `Transformers`, `Datasets`, and `Accelerate`.


In [None]:
!pip install transformers
! pip install datasets
!pip install accelerate


Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.18.0 dill-0.3.8 multiprocess-0.70.16
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected pack

#2. Preprocessing the data using tokenizer with dynamic padding   🔍🧹📊

The dynamic padding is done with `DataCollatorWithPadding`

**DataCollatorWithPadding:**

This is a class provided by the Hugging Face Transformers library. It is used for collating and padding input data (usually tokenized sequences) during language model training.

**Purpose:**

The resulting `data_collator` instance will be used during training to prepare batches of data. It ensures that input sequences within a batch are padded to the same length (using padding tokens) for efficient processing by the model.


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_datasets

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

**Removing undesired columns:**

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our `tokenized_datasets`, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

1. Remove the columns corresponding to values the model does not expect (like the `sentence1` and `sentence2` columns).
2. Rename the column `label` to `labels` (because the model expects the argument to be named `labels`).
3. Set the format of the datasets so they return PyTorch tensors instead of lists.

Our `tokenized_datasets` has one method for each of those steps:


In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

#3. DataLoader 🚚  🤗  📊


**DataLoader Class:**
The DataLoader class is part of the PyTorch library and is used for creating data loaders. It efficiently loads and batches data during training or evaluation of machine learning models.

**Purpose and Usage:**
The primary purpose of DataLoader is to create an iterable over a dataset. It provides an efficient way to load data in batches, shuffle the data, and apply transformations.

**Parameters:**
The DataLoader class takes several important parameters:
- `dataset`: The dataset object (usually an instance of a custom dataset class).
- `batch_size`: The number of samples in each batch.
- `shuffle`: Determines whether to shuffle the data before creating batches.
- `collate_fn`: An optional function that collates individual samples into batches.

**Benefits of Using DataLoader:**

- **Efficient loading:** DataLoader loads data in parallel using multiple worker processes.
- **Batching:** It automatically creates batches of data.
- **Shuffling:** If `shuffle=True`, DataLoader shuffles the data before creating batches.
- **Custom transformations:** You can apply custom transformations (e.g., normalization) using `collate_fn`.


In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can inspect a batch like this:



In [None]:
for batch in train_dataloader:
    break
#print(batch)
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 62]),
 'token_type_ids': torch.Size([8, 62]),
 'attention_mask': torch.Size([8, 62])}

#4. Loading the model  🤗 🧠





In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To make sure that everything will go smoothly during training, we pass our batch to this model:



In [None]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7251, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


#5. Setting optimizer and Learning rate schedular 🔧


**Initializing the Optimizer:**
`optimizer = AdamW(model.parameters(), lr=5e-5)` initializes an instance of the AdamW optimizer. Here’s what each parameter does:

- `model.parameters()`: This provides the parameters (weights and biases) of a neural network model (which should be defined elsewhere in the code).
- `lr=5e-5`: This sets the learning rate for the optimizer to 5e-5 (which is equivalent to 0.00005).


The resulting optimizer instance will be used during training to update the model’s parameters (weights and biases) based on gradients computed during backpropagation. The learning rate determines how large the steps are during optimization. Smaller learning rates lead to slower convergence but more stable training.




In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



**Importing the Necessary Function:**
The line `from transformers import get_scheduler` imports a function called `get_scheduler` from the transformers library. This function is used to create a learning rate scheduler for training neural network models.

**Setting Up Variables:**
- `num_epochs = 3`: This variable represents the total number of training epochs. An epoch is a complete pass through the entire training dataset.
- `num_training_steps = num_epochs * len(train_dataloader)`: Here, we calculate the total number of training steps based on the number of epochs and the length of the training data loader (`train_dataloader`). Each training step corresponds to one batch of data processed during training.

**Creating the Learning Rate Scheduler:**

`lr_scheduler = get_scheduler(...)`: This line initializes a learning rate scheduler using the `get_scheduler` function. The function takes several arguments:

- `"linear"`: The type of scheduler. In this case, it’s a linear scheduler.
- `optimizer`: The optimizer used for training (e.g., Adam, SGD, etc.). You should have already defined an optimizer (not shown in the provided snippet).
- `num_warmup_steps=0`: The number of warm-up steps. Warm-up steps gradually increase the learning rate from zero to its initial value. Setting it to zero means no warm-up.
- `num_training_steps=num_training_steps`: The total number of training steps (calculated earlier based on epochs and data loader length).

**Linear Learning Rate Schedule:**

The "linear" scheduler decreases the learning rate linearly from its initial value to zero over the course of training. It’s a simple and commonly used schedule. During the warm-up phase (if specified), the learning rate gradually increases from zero to its initial value. After the warm-up, the learning rate decreases linearly as training progresses.


In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)


1377


#6. Training Model 🏋️‍♂️


**Accelerator Initialization:**
`accelerator = Accelerator():` This line initializes an accelerator object. Accelerate is a library that simplifies distributed training for machine learning models. It helps manage parallelism, data loading, and other aspects of training.


In [None]:
from accelerate import Accelerator
accelerator = Accelerator()
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( train_dataloader, eval_dataloader, model, optimizer)

**Training Loop:**
- `model.train():` Puts the model in training mode. This is essential because some layers (like dropout) behave differently during training and evaluation.
- The `outer loop` iterates over num_epochs, representing the number of times the entire training dataset is processed.
- The `inner loop` iterates over batches from the train_dataloader.
For each batch:
- `outputs = model(**batch):` The model processes the input batch and produces predictions. The **batch syntax unpacks the batch into individual inputs (e.g., input tokens, attention masks).
- `loss = outputs.loss:` The loss value is computed based on the model’s predictions and the ground truth labels.
- `accelerator.backward(loss):` Computes gradients with respect to the loss using automatic differentiation.
- `optimizer.step():` Updates the model’s parameters using the computed gradients.
- `lr_scheduler.step():` Adjusts the learning rate (if using a learning rate scheduler).
- `optimizer.zero_grad():` Clears the gradients for the next batch.
- `progress_bar.update(1):` Advances the progress bar by one step.

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(3):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


  0%|          | 0/1377 [00:00<?, ?it/s]

#7. The Evaluation Loop  🔄

- `metric = evaluate.load("glue", "mrpc"): `This line loads a metric associated with the MRPC dataset from the GLUE benchmark.

**Evaluation Loop:**
- `model.eval():` Puts the model in evaluation mode. During evaluation, the model behaves differently (e.g., disables dropout layers) compared to training mode.

The subsequent loop iterates over batches from an eval_dataloader.
For each batch:
- `with torch.no_grad():` Temporarily disables gradient computation to save memory during evaluation.
- `outputs = model(**batch):` The model processes the input batch and produces predictions.
- `logits = outputs.logits:` The raw output scores (logits) from the model.
predictions = torch.argmax(logits, dim=-1): Computes the predicted class labels by taking the index of the maximum logit value along the last dimension (usually representing classes).
- `metric.add_batch(predictions=predictions, references=batch["labels"]):` Adds the batch predictions and ground truth labels to the metric for later computation.

Finally, `metric.compute()` computes the final metric value based on the accumulated predictions and references.

In [None]:
!pip install evaluate
import torch
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()



{'accuracy': 0.8480392156862745, 'f1': 0.8934707903780068}