# **Initial Setup**

## **Environment Setup**
We start by mounting Google Drive (for saving models and outputs) and installing the required Python libraries:
- **transformers** – Hugging Face's library for model architectures and training utilities.
- **datasets** – To load and manage the SQuAD dataset.
- **peft** – For parameter-efficient fine-tuning methods like LoRA and Prefix-Tuning.
- **bitsandbytes** – For quantization and efficient GPU memory usage (especially with QLoRA).
- **evaluate**, **sacrebleu**, **rouge-score**, **bert-score**, **nltk**, **textstat**, **detoxify** – For evaluation metrics and analysis.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers datasets peft bitsandbytes accelerate
!pip install evaluate sacrebleu rouge-score bert-score nltk textstat detoxify

Collecting bitsandbytes
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl (61.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.47.0
Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting textstat
  Downloading textstat-0.7.8-py3-none-any.whl.metadata (15 kB)
Collecting d

## **Importing Libraries**
Here, we import all the necessary libraries and modules for:
- Tokenization (T5Tokenizer)
- Model loading (T5ForConditionalGeneration)
- Training utilities (Seq2SeqTrainer, DataCollatorForSeq2Seq)
- Dataset handling (Hugging Face Datasets)
- Parameter-efficient fine-tuning utilities from PEFT.


In [None]:
import torch
import transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, TaskType
import evaluate
import nltk
import numpy as np
import textstat
from detoxify import Detoxify
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import random
import pandas as pd

print(transformers.__version__)


4.55.2


## **Loading the SQuAD Dataset**
We use the **SQuAD v1.1** dataset, which consists of question–answer pairs along with their context passages.
We split it into:
- `train_data` – For training the model.
- `val_data` – For validation and evaluation.


In [None]:
dataset = load_dataset("squad")
train_data = dataset["train"]
val_data = dataset["validation"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

## **Data Preprocessing**
We define a preprocessing function that:
1. Combines the question and context into a single input string.
2. Tokenizes the input and the target answer.
3. Truncates or pads sequences to a fixed maximum length to ensure efficient batching.


In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")

def preprocess_function(examples):
    inputs = ["question: " + q + " context: " + c for q, c in zip(examples["question"], examples["context"])]

    model_inputs = tokenizer(
        inputs,
        max_length=512,
        truncation=True,
        padding="max_length"
        )

    answers_texts = [a['text'][0] if len(a['text']) > 0 else '' for a in examples["answers"]]

    labels = tokenizer(
        answers_texts,
        max_length=64,
        truncation=True,
        padding="max_length"
    )

    labels["input_ids"] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]]
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_data.map(preprocess_function, batched=True, remove_columns=train_data.column_names)
tokenized_val = val_data.map(preprocess_function, batched=True, remove_columns=val_data.column_names)


# Inference Functions

## **Inference Preparation**
The `prepare_eval_data` function extracts text-question pairs from the validation set and organizes them for evaluation.


In [None]:
def prepare_eval_data(data):
    from tqdm import tqdm
    val_texts, val_labels = [], []

    for example in tqdm(data, desc="Preparing evaluation data"):
        question = example["question"]
        context = example["context"]
        answer = example["answers"]["text"][0] if example["answers"]["text"] else ""

        input_text = f"question: {question} context: {context}"
        val_texts.append(input_text)
        val_labels.append(answer)

    return val_texts, val_labels

## **Prediction Generation**
The `generate_predictions` function runs the model on evaluation data and returns predicted answers.
We specify:
- `model` – HuggingFace fine-tuned model.
- `tokenizer` – HuggingFace tokenizer.
- `val_texts` – List of input strings.
- `batch_size` – Number of samples per inference step.
- `max_input_len` – Maximum token length for inputs.
- `max_output_len` – Maximum token length for generated answers.


In [None]:
import torch
from torch.utils.data import DataLoader

def generate_predictions(model, tokenizer, val_texts, batch_size=16, max_input_len=512, max_output_len=64, device="cuda"):
    """
    Generate predictions for a list of input texts in batches.

    Args:
        model: HuggingFace model (already fine-tuned).
        tokenizer: HuggingFace tokenizer.
        val_texts (list): List of input strings.
        batch_size (int): Batch size for DataLoader.
        max_input_len (int): Max length for input tokenization.
        max_output_len (int): Max length for generated output.
        device (str): Device to run inference on.

    Returns:
        list: List of decoded predictions.
    """
    # Tokenize once
    encodings = tokenizer(
        val_texts,
        truncation=True,
        padding=True,
        max_length=max_input_len,
        return_tensors="pt"
    )

    # Create DataLoader
    dataset_torch = torch.utils.data.TensorDataset(
        encodings["input_ids"],
        encodings["attention_mask"]
    )
    loader = DataLoader(dataset_torch, batch_size=batch_size)

    # Switch to eval mode and move to device
    model.eval()
    model.to(device)

    preds = []
    with torch.no_grad():
        for batch in tqdm(loader, desc="Generating"):
            input_ids, attention_mask = [x.to(device) for x in batch]
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=max_output_len
            )
            preds.extend(tokenizer.batch_decode(outputs, skip_special_tokens=True))

    return preds


## **Evaluation Metrics Setup**
We load various NLP metrics for evaluating generated answers:
- **BLEU** – Measures n-gram precision.
- **ROUGE** – Measures recall-oriented overlap.
- **METEOR** – Considers synonymy and stemming.
- **GLEU** – Balanced precision-recall for translation-like tasks.
- **Readability & toxicity metrics** – Such as Flesch Reading Ease and Detoxify toxicity score.


In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score
from nltk.translate.gleu_score import sentence_gleu
import textstat
from sentence_transformers import SentenceTransformer, util
import bert_score
from detoxify import Detoxify
import numpy as np
import nltk

nltk.download('wordnet')

def calculate_metrics(val_labels, preds):
    # Load models
    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    detox_model = Detoxify('original')
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    smooth_fn = SmoothingFunction().method1

    # Storage
    bleu_scores, rouge1_scores, rouge2_scores, rougeL_scores = [], [], [], []
    meteor_scores, gleu_scores = [], []
    repetition_rates, flesch_scores = [], []
    cosim_scores, toxicity_scores = [], []
    novelty_scores, diversity_scores = [], []

    # --- Precompute embeddings for CoSIM/Novelty ---
    emb_refs = embedder.encode(val_labels, convert_to_tensor=True, batch_size=32, show_progress_bar=True)
    emb_hyps = embedder.encode(preds, convert_to_tensor=True, batch_size=32, show_progress_bar=True)

    for ref, hyp, emb_ref, emb_hyp in zip(val_labels, preds, emb_refs, emb_hyps):
        ref_tokens = ref.split()
        hyp_tokens = hyp.split()

        # BLEU
        bleu_scores.append(sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smooth_fn))

        # ROUGE
        scores = scorer.score(ref, hyp)
        rouge1_scores.append(scores["rouge1"].fmeasure)
        rouge2_scores.append(scores["rouge2"].fmeasure)
        rougeL_scores.append(scores["rougeL"].fmeasure)

        # METEOR (tokenized inputs)
        meteor_scores.append(meteor_score([ref_tokens], hyp_tokens))

        # GLEU
        gleu_scores.append(sentence_gleu([ref_tokens], hyp_tokens))

        # Repetition Rate
        repetition_rates.append(1 - len(set(hyp_tokens)) / len(hyp_tokens) if hyp_tokens else 0)

        # Flesch Reading Ease
        flesch_scores.append(textstat.flesch_reading_ease(hyp))

        # CoSIM
        cosim = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
        cosim_scores.append(cosim)

        # Novelty
        novelty_scores.append(1 - cosim)

        # Diversity
        diversity_scores.append(len(set(hyp_tokens)) / len(hyp_tokens) if hyp_tokens else 0)

    # --- BERTScore (batch) ---
    P, R, F1 = bert_score.score(preds, val_labels, lang="en", verbose=False)
    bert_f1_scores = F1.tolist()

    # --- Toxicity (batch) ---
    toxicity_batch = detox_model.predict(preds)
    toxicity_scores = toxicity_batch["toxicity"]

    return {
        "BLEU": bleu_scores,
        "ROUGE-1": rouge1_scores,
        "ROUGE-2": rouge2_scores,
        "ROUGE-L": rougeL_scores,
        "METEOR": meteor_scores,
        "GLEU": gleu_scores,
        "Repetition Rate": repetition_rates,
        "Flesch Reading Ease": flesch_scores,
        "CoSIM": cosim_scores,
        "Novelty": novelty_scores,
        "Diversity": diversity_scores,
        "BERTScore F1": bert_f1_scores,
        "Toxicity": toxicity_scores
    }

[nltk_data] Downloading package wordnet to /root/nltk_data...


## **Prepare Metrics DataFrame**
Based on the metrics calculated in the earlier step, create a readable dataframe.


In [None]:
def create_metrics_df(metrics):
    # Create dictionary of metrics and their average scores
    metrics_data = {
        "Metric": [
            "BLEU", "ROUGE-1", "ROUGE-2", "ROUGE-L", "METEOR", "GLEU",
            "Repetition Rate", "Flesch Reading Ease", "CoSIM", "BERTScore F1",
            "Toxicity", "Novelty", "Diversity"
        ],
        "Score": [
            np.mean(metrics['BLEU']),
            np.mean(metrics['ROUGE-1']),
            np.mean(metrics['ROUGE-2']),
            np.mean(metrics['ROUGE-L']),
            np.mean(metrics['METEOR']),
            np.mean(metrics['GLEU']),
            np.mean(metrics['Repetition Rate']),
            np.mean(metrics['Flesch Reading Ease']),
            np.mean(metrics['CoSIM']),
            np.mean(metrics['BERTScore F1']),
            np.mean(metrics['Toxicity']),
            np.mean(metrics['Novelty']),
            np.mean(metrics['Diversity'])
        ]
    }

    # Convert to DataFrame
    metrics_df = pd.DataFrame(metrics_data)

    metrics_df["Score"] = metrics_df["Score"].round(4)

    return metrics_df

## **Generate Answers**
Predict the answers for questions given to the trained model and format it as a DataFrame with the Question, Context and Answer format.


In [None]:
def generate_answers(tokenizer, model):

    # 6 test questions & contexts
    test_data = [
        {
            "question": "Who wrote the play Hamlet?",
            "context": "Hamlet is a tragedy written by William Shakespeare sometime between 1599 and 1601."
        },
        {
            "question": "What is the capital of France?",
            "context": "France is a country in Western Europe. Its capital city is Paris."
        },
        {
            "question": "Who is attending the Gen AI Course?",
            "context": "IIT is organizing a Gen AI professional certification course, Arko is one of the people who is attending it."
        },
        {
            "question": "Why does ice float on water?",
            "context": "Water has unusual properties compared to many other liquids. For example, unlike most substances, water expands when it freezes, which makes ice less dense than liquid water. This property is why ice floats on lakes and oceans during winter."
        },
        {
            "question": "Where is Mount Everest located?",
            "context": "Mount Everest is Earth’s highest mountain above sea level and is located in the Himalayas along the border of Nepal and China’s Tibet Autonomous Region. The officially recognized height is 8,848.86 meters (29,031.7 feet)."
        },
        {
            "question": "How many chambers does the human heart have?",
            "context": "The human heart has four chambers: two atria (upper chambers) and two ventricles (lower chambers). The left ventricle is the strongest chamber because it pumps oxygenated blood through the entire body."
        }
    ]

    # Generate predictions
    results = []
    for item in test_data:
        input_text = f"question: {item['question']} context: {item['context']}"
        inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True, max_length=512).to(model.device)
        outputs = model.generate(**inputs, max_length=64)
        pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({
            "Question": item['question'],
            "Context": item['context'],
            "Generated Answer": pred
        })

    # Convert to DataFrame for tabular view
    df = pd.DataFrame(results)

    return df


# LORA

## **LORA Training**

### **Overview**
- Uses 8-bit quantization to reduce memory requirements
- Applies LoRA to the whole train dataset
- Custom data collator for proper label handling
- Saves checkpoints and selects best model automatically
- Final model saved for inference use

### **1. Model Initialization with LoRA**

Initializes the Flan-T5-large model with:
- 8-bit quantization (`load_in_8bit=True`)
- Automatic device placement (`device_map="auto"`)
- LoRA configuration targeting query and value attention layers

In [None]:
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", load_in_8bit=True)

lora_config = LoraConfig(
    r=8,                              # Rank of the update matrices
    lora_alpha=16,                    # Scaling factor for LoRA weights
    target_modules=["q", "v"],        # Apply LoRA to query and value layers
    lora_dropout=0.05,                # Dropout probability for LoRA layers
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, lora_config)


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### **2. Custom Data Collator**

Extends `DataCollatorForSeq2Seq` to ensure proper label handling:

Features:
- Falls back to using `input_ids` as labels if no explicit labels provided
- Maintains all standard Seq2Seq collation functionality
- Ensures consistent batch formatting for the T5 model

Reason to create:
- Was getting 0 training loss and validation loss. Had to make sure trainer was considering the `labels`, as `DataCollatorForSeq2Seq` wasn't excepting labels as a parameter.

In [None]:
class DataCollatorForSeq2SeqWithLabels(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        batch = super().__call__(features, return_tensors)
        if 'labels' not in batch:
            batch['labels'] = batch['input_ids']
        return batch

data_collator = DataCollatorForSeq2SeqWithLabels(tokenizer, model=model)

### **3. Training Configuration**

Sets up training parameters with `Seq2SeqTrainingArguments`:

Key settings:
- **Checkpointing**: Saves every epoch, keeps last 3 checkpoints
- **Batch Processing**:
  - `per_device_train_batch_size=8`
  - `gradient_accumulation_steps=4` (effective batch size = 32)
- **Learning**:
  - `learning_rate=5e-4`
  - `num_train_epochs=3`
- **Evaluation**:
  - `eval_strategy="epoch"`
  - `predict_with_generate=True`
- **Precision**: `fp16=True` for mixed-precision training

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/FlanT5_checkpoints",
    eval_strategy="epoch",
    save_strategy="epoch",            # Save checkpoint after every epoch
    save_total_limit=3,               # Keep only the last 3 checkpoints to save disk space
    learning_rate=5e-4,
    run_name="flan-t5-lora-squad-debug",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    predict_with_generate=True,
    logging_steps=50,
    logging_first_step=True,
    fp16=True,
    logging_dir="./logs",
    save_steps=500,                   # Additionally save every 500 steps (optional)
    load_best_model_at_end=True,      # Load best model after training based on eval loss
    metric_for_best_model="loss",      # Choose loss as criteria for best checkpoint
    report_to="none",
)


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

### **4. Trainer Initialization**

Creates the `Seq2SeqTrainer` with:
- Configured model (with LoRA)
- Training arguments
- Tokenized training and validation datasets
- Custom data collator

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)

  trainer = Seq2SeqTrainer(
No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


### **5. Training Execution**

Starts the training process with:
- `resume_from_checkpoint=True`: Runtime was interupting during training as training was done on A100 GPU and availablity was not guaranteed. Only worked after first 1 epoch was completed.

In [None]:
trainer.train(resume_from_checkpoint=True)
# trainer.train()

Epoch,Training Loss,Validation Loss
3,6.3004,0.357009


TrainOutput(global_step=8214, training_loss=2.0791697905015445, metrics={'train_runtime': 8138.9728, 'train_samples_per_second': 32.289, 'train_steps_per_second': 1.009, 'total_flos': 6.091999549902029e+17, 'train_loss': 2.0791697905015445, 'epoch': 3.0})

### **6. Model Saving**

Saves the fine-tuned model and tokenizer to:
`/content/drive/MyDrive/Lora_flan_t5_final`

Includes:
- Full model weights (with LoRA adaptations)
- Tokenizer files
- Configuration files

In [None]:
output_dir = "/content/drive/MyDrive/Lora_flan_t5_final"

# Save model
trainer.save_model(output_dir)

# Save tokenizer
tokenizer.save_pretrained(output_dir)


## **LORA Inference**

### **1. Model Initialization with LoRA**

Load the Flan-T5-large base model and combine it with the fine-tuned LoRA model:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel

model_path = "/content/drive/MyDrive/Lora_flan_t5_final"

base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
model = PeftModel.from_pretrained(base_model, model_path)

tokenizer = AutoTokenizer.from_pretrained(model_path)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


### **2. Metrics generation for model fien-tuned with LoRA**

Call all the functions declared in the Inference functions section and generate a DataFrame with all the metrics we have calculated.

Additionally also generate a DataFrame containing the answers generated by the LoRA fine-tuned model for the questions in `test_data`.

In [None]:
val_texts, val_labels = prepare_eval_data(val_data)
lora_preds = generate_predictions(model, tokenizer, val_texts, batch_size=16, max_input_len=512, max_output_len=64, device="cuda")
lora_metrics = calculate_metrics(val_labels, lora_preds)
lora_metrics_df = create_metrics_df(lora_metrics)
lora_generated_answers = generate_answers(tokenizer, model)

Preparing evaluation data: 100%|██████████| 10570/10570 [00:00<00:00, 13483.20it/s]
Generating: 100%|██████████| 661/661 [20:01<00:00,  1.82s/it]


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading: "https://github.com/unitaryai/detoxify/releases/download/v0.1-alpha/toxic_original-c1212f89.ckpt" to /root/.cache/torch/hub/checkpoints/toxic_original-c1212f89.ckpt
100%|██████████| 418M/418M [00:04<00:00, 99.7MB/s]


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
lora_metrics_df

Unnamed: 0,Metric,Score
0,BLEU,0.3244
1,ROUGE-1,0.8457
2,ROUGE-2,0.533
3,ROUGE-L,0.8453
4,METEOR,0.649
5,GLEU,0.7505
6,Repetition Rate,0.0058
7,Flesch Reading Ease,40.5182
8,CoSIM,0.8955
9,BERTScore F1,0.9641


In [None]:
lora_generated_answers = generate_answers(tokenizer, model)
lora_generated_answers

Unnamed: 0,Question,Context,Generated Answer
0,Who wrote the play Hamlet?,Hamlet is a tragedy written by William Shakesp...,William Shakespeare
1,What is the capital of France?,France is a country in Western Europe. Its cap...,Paris
2,Who is attending the Gen AI Course?,IIT is organizing a Gen AI professional certif...,Arko
3,Why does ice float on water?,Water has unusual properties compared to many ...,water expands when it freezes
4,Where is Mount Everest located?,Mount Everest is Earth’s highest mountain abov...,Himalayas
5,How many chambers does the human heart have?,The human heart has four chambers: two atria (...,four


# QLoRA

## QLoRA Training

### Overview
- Uses 4-bit quantization (NF4 type with double quantization) for optimal memory efficiency
- Applies QLoRA adapters (r=8) to query and value attention projections
- Processes the entire training dataset with gradient accumulation
- Custom Seq2Seq data collator with automatic label handling
- Automatic checkpointing after each epoch (saves top 2 checkpoints)
- BF16 mixed-precision training (optimal for A100 GPUs)
- Paged 8-bit optimizer for memory-efficient weight updates
- Saves final adapter weights separately for inference deployment

### **1. Model Initialization with QLoRA**

**Quantization Configuration:**
- `load_in_4bit`: Reduces model size by 75% compared to FP32
- `double_quant`: Adds secondary quantization for extra 20% memory savings
- `nf4`: Optimal 4-bit data type for neural network weights
- `bfloat16`: Maintains training stability while being memory efficient



In [None]:
from transformers import BitsAndBytesConfig, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16  # A100-friendly
)

# Load base model in 4-bit
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# QLoRA config
qlora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q", "v"],  # T5 attention projections
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

# Apply QLoRA adapters
model = get_peft_model(model, qlora_config)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### **2. Custom Data Collator**

Extends `DataCollatorForSeq2Seq` to ensure proper label handling:

Features:
- Falls back to using `input_ids` as labels if no explicit labels provided
- Maintains all standard Seq2Seq collation functionality
- Ensures consistent batch formatting for the T5 model

Reason to create:
- Was getting 0 training loss and validation loss. Had to make sure trainer was considering the `labels`, as `DataCollatorForSeq2Seq` wasn't excepting labels as a parameter.

In [None]:
class DataCollatorForSeq2SeqWithLabels(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        batch = super().__call__(features, return_tensors)
        if 'labels' not in batch:
            batch['labels'] = batch['input_ids']
        return batch

data_collator = DataCollatorForSeq2SeqWithLabels(tokenizer, model=model)

### **3. Training Configuration**

Sets up training parameters with `Seq2SeqTrainingArguments`:

Key settings:
- **Checkpointing**: Saves every epoch, keeps last 3 checkpoints
- **Batch Processing**:
  - `per_device_train_batch_size=8`
  - `gradient_accumulation_steps=2` (effective batch size = 16)
- **Learning**:
  - `learning_rate=2e-4`
  - `num_train_epochs=3`
- **Evaluation**:
  - `eval_strategy="epoch"`
  - `predict_with_generate=True`
- **Precision**: `bf16=True` for mixed-precision training

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/Qlora_FlanT5_checkpoints",
    per_device_train_batch_size=8,         # Larger batch size fits fine on A100
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,         # Effective batch size = 16
    learning_rate=2e-4,                    # Good for LoRA/QLoRA
    num_train_epochs=3,                    # SQuAD is small; 3–5 works well
    logging_dir="./logs",
    logging_steps=25,
    eval_strategy="epoch",           # Evaluate after each epoch
    save_strategy="epoch",
    save_total_limit=2,                    # Keep disk space clean
    bf16=True,                             # A100 supports bf16 (better than fp16 here)
    predict_with_generate=True,            # Required for ROUGE/BLEU style evals
    generation_max_length=256,             # For answer generation
    warmup_steps=100,                       # Stabilizes early training
    weight_decay=0.01,                      # Helps generalization
    optim="paged_adamw_8bit",               # QLoRA-optimized AdamW
    lr_scheduler_type="linear",
    report_to="none"                        # Disable W&B unless you want logging
)


### **4. Trainer Initialization**

Creates the `Seq2SeqTrainer` with:
- Configured model (with QLoRA)
- Training arguments
- Tokenized training and validation datasets
- Custom data collator

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)

  trainer = Seq2SeqTrainer(


### **5. Training Execution**

Starts the training process with:
- `resume_from_checkpoint=True` (Optional): Runtime was interupting during training as training was done on A100 GPU and availablity was not guaranteed. Only worked after first 1 epoch was completed.

In [None]:
trainer.train()
# trainer.train(resume_from_checkpoint=True)


Epoch,Training Loss,Validation Loss
1,0.2042,0.245085
2,0.2361,0.241795
3,0.2093,0.24316


TrainOutput(global_step=16425, training_loss=0.23242021485187872, metrics={'train_runtime': 21458.6416, 'train_samples_per_second': 12.247, 'train_steps_per_second': 0.765, 'total_flos': 6.075907920573235e+17, 'train_loss': 0.23242021485187872, 'epoch': 3.0})

### **6. Model Saving**

Saves the fine-tuned model and tokenizer to:
`/content/drive/MyDrive/Qlora_flan_t5_final`

Includes:
- Full model weights (with QLoRA adaptations)
- Tokenizer files
- Configuration files

In [None]:
output_dir = "/content/drive/MyDrive/Qlora_flan_t5_final"

# Save model
trainer.save_model(output_dir)

# Save tokenizer
tokenizer.save_pretrained(output_dir)

('/content/drive/MyDrive/Qlora_flan_t5_final/tokenizer_config.json',
 '/content/drive/MyDrive/Qlora_flan_t5_final/special_tokens_map.json',
 '/content/drive/MyDrive/Qlora_flan_t5_final/spiece.model',
 '/content/drive/MyDrive/Qlora_flan_t5_final/added_tokens.json')

### **7. Runtime Disconnection**

After completing the training process, disconnecting the Runtime to conserve compute units for further training.

In [None]:
from google.colab import runtime
runtime.unassign()

## QLoRA Inference

### **1. Model Initialization with QLoRA**

Load the Flan-T5-large base model and combine it with the fine-tuned QLoRA model:

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig
from peft import PeftModel
from datasets import load_dataset

model_dir = "/content/drive/MyDrive/Qlora_flan_t5_final"

tokenizer = AutoTokenizer.from_pretrained(model_dir)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-large",
    quantization_config=bnb_config,
    device_map="auto"
)

model = PeftModel.from_pretrained(base_model, model_dir)
# model.eval()

### **2. Metrics generation for model fien-tuned with QLoRA**

Call all the functions declared in the Inference functions section and generate a DataFrame with all the metrics we have calculated.

Additionally also generate a DataFrame containing the answers generated by the QLoRA fine-tuned model for the questions in `test_data`.

In [None]:
val_texts, val_labels = prepare_eval_data(val_data)
qlora_preds = generate_predictions(model, tokenizer, val_texts, batch_size=16, max_input_len=512, max_output_len=64, device="cuda")
qlora_metrics = calculate_metrics(val_labels, qlora_preds)
qlora_metrics_df = create_metrics_df(qlora_metrics)
qlora_generated_answers = generate_answers(tokenizer, model)

Preparing evaluation data: 100%|██████████| 10570/10570 [00:00<00:00, 13435.75it/s]
Generating: 100%|██████████| 661/661 [27:37<00:00,  2.51s/it]


Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
qlora_metrics_df

Unnamed: 0,Metric,Score
0,BLEU,0.3411
1,ROUGE-1,0.8675
2,ROUGE-2,0.5576
3,ROUGE-L,0.8672
4,METEOR,0.6717
5,GLEU,0.7765
6,Repetition Rate,0.0058
7,Flesch Reading Ease,40.5583
8,CoSIM,0.9103
9,BERTScore F1,0.9678


In [None]:
qlora_generated_answers = generate_answers(tokenizer, model)
qlora_generated_answers

Unnamed: 0,Question,Context,Generated Answer
0,Who wrote the play Hamlet?,Hamlet is a tragedy written by William Shakesp...,William Shakespeare
1,What is the capital of France?,France is a country in Western Europe. Its cap...,Paris
2,Who is attending the Gen AI Course?,IIT is organizing a Gen AI professional certif...,Arko
3,Why does ice float on water?,Water has unusual properties compared to many ...,less dense than liquid water
4,Where is Mount Everest located?,Mount Everest is Earth’s highest mountain abov...,Himalayas
5,How many chambers does the human heart have?,The human heart has four chambers: two atria (...,four


# Prefix Tuning


## Prefix-Tuning Training

### Overview
- Uses Prefix-Tuning (a parameter-efficient method) with 40 virtual tokens
- Automatically selects BF16/FP16 precision based on A100 GPU support
- Processes full batches (size=16) without gradient accumulation
- Custom Seq2Seq data collator with automatic label generation
- Saves best checkpoint by validation loss after each epoch
- BF16 mixed-precision training for optimal A100 performance
- Final model saves only prefix-tuning parameters (~0.1% of total weights)

### **1. Model Initialization with Prefix Tuning**

Initializes the Flan-T5-large model with:
- Automatically load the model in `bf16` since training is being done using A100 GPU.
- Set the Prefix Tuning configuration
- Set the Model with the  Prefix-tuning config



In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model

# Load tokenizer
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model in fp16/bf16 depending on A100 support
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Prefix-Tuning config
prefix_tuning_config = PrefixTuningConfig(
    task_type="SEQ_2_SEQ_LM",
    num_virtual_tokens=40,
    encoder_hidden_size=model.config.d_model
)

# Wrap model for Prefix-Tuning
model = get_peft_model(model, prefix_tuning_config)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### **2. Training Configuration**

Sets up training parameters with `Seq2SeqTrainingArguments`:

Key settings:
- **Checkpointing**: Saves every epoch, keeps last 4 checkpoints
- **Batch Processing**:
  - `per_device_train_batch_size=8`
  - `gradient_accumulation_steps=2` (effective batch size = 16)
- **Learning**:
  - `learning_rate=1e-3`
  - `num_train_epochs=4` for better results increased the number of epochs
- **Evaluation**:
  - `eval_strategy="epoch"`
  - `predict_with_generate=True`
- **Precision**: `bf16=True` for mixed-precision training

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/PFT_FlanT5_checkpoints",
    eval_strategy="epoch",
    save_strategy="epoch",             # save at the same time as eval
    load_best_model_at_end=True,        # load best checkpoint by eval metric
    metric_for_best_model="eval_loss",  # or ROUGE/BLEU if defined in compute_metrics
    greater_is_better=False,            # because lower loss is better
    learning_rate=1e-3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    num_train_epochs=4,
    weight_decay=0.01,
    predict_with_generate=True,
    bf16=True,
    logging_dir="./logs",
    logging_steps=50,
    report_to="none"
)


### **3. Custom Data Collator**

Extends `DataCollatorForSeq2Seq` to ensure proper label handling:

Features:
- Falls back to using `input_ids` as labels if no explicit labels provided
- Maintains all standard Seq2Seq collation functionality
- Ensures consistent batch formatting for the T5 model

Reason to create:
- Was getting 0 training loss and validation loss. Had to make sure trainer was considering the `labels`, as `DataCollatorForSeq2Seq` wasn't excepting labels as a parameter.

In [None]:
class DataCollatorForSeq2SeqWithLabels(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        batch = super().__call__(features, return_tensors)
        if 'labels' not in batch:
            batch['labels'] = batch['input_ids']
        return batch

data_collator = DataCollatorForSeq2SeqWithLabels(tokenizer, model=model)

### **4. Trainer Initialization**

Creates the `Seq2SeqTrainer` with:
- Configured model (with Prefix Tuning)
- Training arguments
- Tokenized training and validation datasets
- Custom data collator

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)

  trainer = Seq2SeqTrainer(


### **5. Training Execution**

Starts the training process with:
- `resume_from_checkpoint=True` (Optional): Runtime was interupting during training as training was done on A100 GPU and availablity was not guaranteed. Only worked after first 1 epoch was completed.

In [None]:
# Increased Epoch to improve BLEU, ROUGE-2, and METEOR

# trainer.train()
trainer.train(resume_from_checkpoint=True)

Epoch,Training Loss,Validation Loss
4,0.254,0.253146


TrainOutput(global_step=21900, training_loss=0.06411400032914392, metrics={'train_runtime': 1722.03, 'train_samples_per_second': 203.478, 'train_steps_per_second': 12.718, 'total_flos': 8.078764812030444e+17, 'train_loss': 0.06411400032914392, 'epoch': 4.0})

### **6. Model Saving**

Saves the fine-tuned model and tokenizer to:
`/content/drive/MyDrive/PFT_flan_t5_final`

Includes:
- Full model weights (with Prefix Tuning)
- Tokenizer files
- Configuration files

In [None]:
output_dir = "/content/drive/MyDrive/PFT_flan_t5_final"

# Save model
trainer.save_model(output_dir)

# Save tokenizer
tokenizer.save_pretrained(output_dir)

('/content/drive/MyDrive/PFT_flan_t5_final/tokenizer_config.json',
 '/content/drive/MyDrive/PFT_flan_t5_final/special_tokens_map.json',
 '/content/drive/MyDrive/PFT_flan_t5_final/spiece.model',
 '/content/drive/MyDrive/PFT_flan_t5_final/added_tokens.json',
 '/content/drive/MyDrive/PFT_flan_t5_final/tokenizer.json')

## Prefix-Tuning Inference

### **1. Model Initialization with Prefix-Tuning**

Load the Flan-T5-large base model and combine it with the fine-tuned Prefix-Tuning model:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
import torch

model_path = "/content/drive/MyDrive/PFT_flan_t5_final"

# Load base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-large",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load PEFT weights of Prefix-Tuning
model = PeftModel.from_pretrained(base_model, model_path)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)


### **2. Metrics generation for model fien-tuned with Prefix-Tuning**

Call all the functions declared in the Inference functions section and generate a DataFrame with all the metrics we have calculated.

Additionally also generate a DataFrame containing the answers generated by the Prefix-Tuning fine-tuned model for the questions in `test_data`.

In [None]:
val_texts, val_labels = prepare_eval_data(val_data)
pft_preds = generate_predictions(model, tokenizer, val_texts, batch_size=16, max_input_len=512, max_output_len=64, device="cuda")
pft_metrics = calculate_metrics(val_labels, pft_preds)
pft_metrics_df = create_metrics_df(pft_metrics)
pft_generated_answers = generate_answers(tokenizer, model)

Preparing evaluation data: 100%|██████████| 10570/10570 [00:00<00:00, 13357.21it/s]
Generating:   0%|          | 0/661 [00:00<?, ?it/s]`cache.key_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].keys` instead.
`cache.value_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].values` instead.
Generating: 100%|██████████| 661/661 [14:57<00:00,  1.36s/it]


Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
pft_metrics_df

Unnamed: 0,Metric,Score
0,BLEU,0.1611
1,ROUGE-1,0.6222
2,ROUGE-2,0.3011
3,ROUGE-L,0.621
4,METEOR,0.4381
5,GLEU,0.4622
6,Repetition Rate,0.0497
7,Flesch Reading Ease,40.7858
8,CoSIM,0.7515
9,BERTScore F1,0.9155


In [None]:
pft_generated_answers = generate_answers(tokenizer, model)
pft_generated_answers

`cache.key_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].keys` instead.
`cache.value_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].values` instead.


Unnamed: 0,Question,Context,Generated Answer
0,Who wrote the play Hamlet?,Hamlet is a tragedy written by William Shakesp...,William Shakespeare
1,What is the capital of France?,France is a country in Western Europe. Its cap...,Paris
2,Who is attending the Gen AI Course?,IIT is organizing a Gen AI professional certif...,Arko
3,Why does ice float on water?,Water has unusual properties compared to many ...,water expands
4,Where is Mount Everest located?,Mount Everest is Earth’s highest mountain abov...,Himalayas
5,How many chambers does the human heart have?,The human heart has four chambers: two atria (...,four


# Full Tuning

## Training

### Overview
- Full Fine-Tuning of Flan-T5-large (all parameters updated)
- BF16 Mixed-Precision training (auto-enabled on A100 GPUs)
- Gradient Accumulation (steps=4) for effective batch size of 32
- Custom Seq2Seq Data Collator with automatic label handling
- Checkpoint Management: Saves top 2 models by validation loss
- Lower Learning Rate (5e-4) for stable full parameter updates
- Complete model saved for deployment (~3GB for T5-large)



### **1. Model Initialization with Full Fine-Tuning**

Initializes the Flan-T5-large model with:
- Automatically load the model in `bf16` since training is being done using A100 GPU.
- No model wrapping with config as we are doing full fine-tuning



In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    torch_dtype="auto",  # bf16 on A100 automatically
    device_map="auto"
)


### **2. Training Configuration**

Sets up training parameters with `Seq2SeqTrainingArguments`:

Key settings:
- **Checkpointing**: Saves every epoch, keeps last 2 checkpoints
- **Batch Processing**:
  - `per_device_train_batch_size=8`
  - `gradient_accumulation_steps=2` (effective batch size = 16)
- **Learning**:
  - `learning_rate=5e-4`
  - `num_train_epochs=2` Flan T5 Large performs well with QnA Tasks hence we can have lower number of epochs for training and still get good results
- **Evaluation**:
  - `eval_strategy="epoch"`
  - `predict_with_generate=True`
- **Precision**: `bf16=True` for mixed-precision training

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="content/drive/MyDrive/Full_FlanT5_checkpoints",
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    gradient_accumulation_steps=4,
    learning_rate=5e-4,                  # full FT often uses lower LR
    per_device_train_batch_size=8,       # A100 can handle bigger, maybe 16 if RAM allows
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    predict_with_generate=True,
    bf16=True,
    logging_dir="./logs",
    logging_steps=50,
    report_to="none"
)


### **3. Custom Data Collator**

Extends `DataCollatorForSeq2Seq` to ensure proper label handling:

Features:
- Falls back to using `input_ids` as labels if no explicit labels provided
- Maintains all standard Seq2Seq collation functionality
- Ensures consistent batch formatting for the T5 model

Reason to create:
- Was getting 0 training loss and validation loss. Had to make sure trainer was considering the `labels`, as `DataCollatorForSeq2Seq` wasn't excepting labels as a parameter.

In [None]:
class DataCollatorForSeq2SeqWithLabels(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        batch = super().__call__(features, return_tensors)
        if 'labels' not in batch:
            batch['labels'] = batch['input_ids']
        return batch

data_collator = DataCollatorForSeq2SeqWithLabels(tokenizer, model=model)

### **4. Trainer Initialization**

Creates the `Seq2SeqTrainer` with:
- Configured model
- Training arguments
- Tokenized training and validation datasets
- Custom data collator

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)


  trainer = Seq2SeqTrainer(


### **5. Training Execution**

Starts the training process with:
- `resume_from_checkpoint=True` (Optional): Runtime was interupting during training as training was done on A100 GPU and availablity was not guaranteed. Only worked after first 1 epoch was completed.

In [None]:
# trainer.train()
trainer.train(resume_from_checkpoint=True)


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
2,0.1611,0.306119


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=5476, training_loss=0.08648843378634066, metrics={'train_runtime': 4356.1551, 'train_samples_per_second': 40.218, 'train_steps_per_second': 1.257, 'total_flos': 4.037907354526679e+17, 'train_loss': 0.08648843378634066, 'epoch': 2.0})

### **6. Model Saving**

Saves the fine-tuned model and tokenizer to:
`/content/drive/MyDrive/Full_flan_t5_final`

Includes:
- Full model weights
- Tokenizer files
- Configuration files

In [None]:
output_dir = "/content/drive/MyDrive/Full_flan_t5_final"

# Save model
trainer.save_model(output_dir)

# Save tokenizer
tokenizer.save_pretrained(output_dir)

('/content/drive/MyDrive/Full_flan_t5_final/tokenizer_config.json',
 '/content/drive/MyDrive/Full_flan_t5_final/special_tokens_map.json',
 '/content/drive/MyDrive/Full_flan_t5_final/spiece.model',
 '/content/drive/MyDrive/Full_flan_t5_final/added_tokens.json',
 '/content/drive/MyDrive/Full_flan_t5_final/tokenizer.json')

### **7. Runtime Disconnection**

After completing the training process, disconnecting the Runtime to conserve compute units for further training.

In [None]:
#To conserve compute units

from google.colab import runtime
runtime.unassign()

## Inference

### **1. Model Initialization for Full Fine-Tuning**

Load the Flan-T5-large base model and combine it with the fine-tuned model:

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
import torch


model_path = "/content/drive/MyDrive/Full_flan_t5_final"

# Load the fine-tuned model directly
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
# model.eval()

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

### **2. Metrics generation for model Fine-Tuned**

Call all the functions declared in the Inference functions section and generate a DataFrame with all the metrics we have calculated.

Additionally also generate a DataFrame containing the answers generated by the fine-tuned model for the questions in `test_data`.

In [None]:
val_texts, val_labels = prepare_eval_data(val_data)
full_preds = generate_predictions(model, tokenizer, val_texts, batch_size=16, max_input_len=512, max_output_len=64, device="cuda")
full_metrics = calculate_metrics(val_labels, full_preds)
full_metrics_df = create_metrics_df(full_metrics)
full_generated_answers = generate_answers(tokenizer, model)

Preparing evaluation data: 100%|██████████| 10570/10570 [00:00<00:00, 13087.36it/s]
Generating: 100%|██████████| 661/661 [12:38<00:00,  1.15s/it]


Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
full_metrics_df

Unnamed: 0,Metric,Score
0,BLEU,0.3342
1,ROUGE-1,0.8497
2,ROUGE-2,0.5477
3,ROUGE-L,0.8494
4,METEOR,0.661
5,GLEU,0.754
6,Repetition Rate,0.0063
7,Flesch Reading Ease,41.1772
8,CoSIM,0.8969
9,BERTScore F1,0.9646


In [None]:
full_generated_answers = generate_answers(tokenizer, model)
full_generated_answers

Unnamed: 0,Question,Context,Generated Answer
0,Who wrote the play Hamlet?,Hamlet is a tragedy written by William Shakesp...,William Shakespeare
1,What is the capital of France?,France is a country in Western Europe. Its cap...,Paris
2,Who is attending the Gen AI Course?,IIT is organizing a Gen AI professional certif...,Arko
3,Why does ice float on water?,Water has unusual properties compared to many ...,water expands when it freezes
4,Where is Mount Everest located?,Mount Everest is Earth’s highest mountain abov...,Himalayas
5,How many chambers does the human heart have?,The human heart has four chambers: two atria (...,four


# **Results & Analysis**

In [None]:
import pandas as pd

# Rename Score columns
full_metrics_df = full_metrics_df.rename(columns={"Score": "Full Tuning"})
lora_metrics_df = lora_metrics_df.rename(columns={"Score": "LoRA"})
qlora_metrics_df = qlora_metrics_df.rename(columns={"Score": "QLoRA"})
pft_metrics_df = pft_metrics_df.rename(columns={"Score": "Prefix-Tuning"})

# Merge dataframes
df_combined = full_metrics_df.merge(lora_metrics_df, on="Metric") \
                     .merge(qlora_metrics_df, on="Metric") \
                     .merge(pft_metrics_df, on="Metric")

df_combined.set_index("Metric", inplace=True)

df_combined

Unnamed: 0_level_0,Full Tuning,LoRA,QLoRA,Prefix-Tuning
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BLEU,0.3342,0.3244,0.3411,0.1611
ROUGE-1,0.8497,0.8457,0.8675,0.6222
ROUGE-2,0.5477,0.533,0.5576,0.3011
ROUGE-L,0.8494,0.8453,0.8672,0.621
METEOR,0.661,0.649,0.6717,0.4381
GLEU,0.754,0.7505,0.7765,0.4622
Repetition Rate,0.0063,0.0058,0.0058,0.0497
Flesch Reading Ease,41.1772,40.5182,40.5583,40.7858
CoSIM,0.8969,0.8955,0.9103,0.7515
BERTScore F1,0.9646,0.9641,0.9678,0.9155


## **Strengths and Weaknesses**

Comparing the strengths and weaknesses of all the Fine-Tuning methods used.

### **1. Full Tuning**
#### **Strengths:**

- High BLEU (0.3342) and ROUGE-L (0.8494) → Captures n-gram overlaps and longer sequence matches well.
- Low Repetition Rate (0.0063) → Produces varied outputs without excessive redundancy.
- Balanced Novelty (0.1031) → Some creative variation without straying too far from reference.
- High BERTScore F1 (0.9646) → Strong semantic similarity to references.

#### **Weaknesses:**

- Slightly weaker than QLoRA in ROUGE scores and semantic similarity (CoSIM, BERTScore F1).
- Flesch Reading Ease (41.18) is slightly lower → Outputs may be a bit harder to read
- Training is computationally very expensive and most time consuming.

### **2. LoRA**
#### **Strengths:**

- Very close to Full Tuning in BLEU (0.3244) and ROUGE-L (0.8453).
- Low Toxicity (0.0112) → Safer and more neutral outputs.
- Low Repetition Rate (0.0058) → Minimal redundancy.
- High Novelty (0.1045) → Slightly more creative than Full Tuning.

#### **Weaknesses:**

- Slightly lower semantic similarity (CoSIM 0.8955) than QLoRA/Full.
- Slight drop in METEOR and ROUGE compared to QLoRA.
- Better than Full Tuning but still computationally expensive and time consuming.

### **3. QLoRA**
#### **Strengths:**

- Best overall semantic quality: CoSIM (0.9103) and BERTScore F1 (0.9678) — shows it captures meaning better even if exact wording differs.
- Best ROUGE scores — especially ROUGE-1 (0.8675) and ROUGE-L (0.8672).
- Highest METEOR (0.6717) — strong on synonym and paraphrase matching.
- Low repetition rate (0.0058) and very high diversity (0.9942).
- Computationally lighter than full tuning and LoRA tuning.

#### **Weaknesses:**

- Slightly lower novelty (0.0897) — may stick closer to training phrasing.
- BLEU (0.3411) slightly below Full Tuning — n-gram overlap not maximized.

### **4. Prefix Tuning**
#### **Strengths:**

- Highest novelty (0.2485) — more original phrasing, good for creative tasks.
- Low toxicity (0.0118) and reasonable diversity (0.9502).
- Computationally lightest and fastest training out of all the methods.

#### **Weaknesses:**

- Significant drop in lexical metrics — BLEU (0.1611), ROUGE-1 (0.6222), ROUGE-L (0.6210) — meaning outputs differ greatly from references in wording.
- High repetition rate (0.0497) compared to others — risk of redundant phrases.
- Lower semantic alignment — CoSIM (0.7515) and BERTScore F1 (0.9155) much weaker.

## Key Observations


1.   Full Tuning → Best for balanced performance, especially if computational resources are not a concern. Excels in fluency and semantic accuracy, but novelty is low.
2.   LoRA → Nearly identical to full tuning at much lower compute cost; a sweet spot if memory is limited but quality needs to stay high.
3.   QLoRA → Faster than Full Tuning and LoRA, matches or beats LoRA in semantic accuracy and readability, but tends to be less novel (more literal).
4.   Prefix-Tuning → Computationally fastest, most novel and diverse in expression, but weakest in fluency and semantic faithfulness. Better for creative paraphrasing, worse for exact factual reproduction.


## Trade-offs

1. Fluency vs. Diversity → Full tuning and LoRA prioritize fluency over novelty, while Prefix-Tuning trades fluency for novelty.
2. Literal Accuracy vs. Creativity → QLoRA is most literal; Prefix-Tuning is most creative.
3. Toxicity Control → All methods produce low toxicity, but QLoRA and Full Tuning are slightly better at keeping it minimal.
4. Resource Efficiency vs. Output Quality → LoRA and QLoRA provide near–full tuning quality at much lower compute/memory costs.