### What is Fine-Tuning?

Fine-tuning is the process of taking a **pre-trained model** (already trained on a massive corpus)
and adapting it to a **specific task or domain**.

- **Full Fine-Tuning:** Update **all model parameters** (very compute-heavy, large storage).
- **Parameter-Efficient Fine-Tuning (PEFT):** Only update **a few extra parameters** while keeping the base model frozen.

#### Why PEFT?
- Uses **~1–5% of the parameters** instead of 100%.
- Faster, cheaper, and Colab-friendly.
- Popular method: **LoRA (Low-Rank Adaptation)** – injects small trainable matrices into model layers.

We'll use **Hugging Face PEFT library** to implement LoRA in just a few lines.

### What is LoRA (Low-Rank Adaptation)?

LoRA is a **parameter-efficient fine-tuning (PEFT)** method that lets us adapt a large language model
to a new task **without modifying most of its original weights**.

---

#### **Analogy – Adding Sticky Notes to a Book**
- Imagine you have a **500-page textbook (pre-trained model)**.
- Instead of rewriting the entire book (full fine-tuning), you just **add sticky notes in key pages**
  to highlight changes or corrections (LoRA fine-tuning).
- The original book remains **unchanged**, but when you read it with the sticky notes,
  you get the **updated knowledge**.

---

#### **How LoRA Works?**
- In standard transformers, large **weight matrices (W)** exist in attention layers.
- LoRA adds **two small matrices (A & B)** of much lower rank (`r` ≪ full size).
- During fine-tuning:
  - The original weights stay **frozen**.
  - Only **A & B are trained**.
  - Final weight = `W + A×B`.

---

#### **Benefits of LoRA**
- **Lightweight:** Only a few million parameters trained instead of billions.
- **Composable:** Multiple LoRA adapters can be merged or switched on demand.
- **Fast:** Works even on CPU or free-tier Colab with small models.

### Environment Setup for LoRA Fine-Tuning

We'll need:
- **transformers**: to load pre-trained models and tokenizers.
- **peft**: the Hugging Face library that implements LoRA and other parameter-efficient methods.
- **datasets**: (optional) to easily load sample datasets.

In [None]:
!pip install -q transformers peft datasets accelerate

### Creating a Tiny Dataset for Fine-Tuning

We'll create a **toy sentiment classification dataset** with only two labels:
- **Positive (1)**
- **Negative (0)**

This is intentionally tiny (just a few samples) to:
- Keep training fast (works on free-tier Colab).
- Focus on understanding **LoRA/PEFT workflow**, not achieving high accuracy.

In [None]:
from datasets import Dataset

# Few-shot dataset
texts = [
    "I love this product! It's amazing.",      # Positive
    "Absolutely fantastic service.",           # Positive
    "This is the worst experience ever.",      # Negative
    "I hate how slow this is.",                # Negative
    "Pretty good, I might recommend it.",      # Positive
    "Terrible! Would not buy again."           # Negative
]

labels = [1, 1, 0, 0, 1, 0]

# Create a Hugging Face Dataset object
tiny_dataset = Dataset.from_dict({"text": texts, "label": labels})

# Quick preview
tiny_dataset

### Tokenization & Preprocessing

Transformers cannot directly read raw text — they need:
- **input_ids**: numerical IDs for each token.
- **attention_mask**: tells the model which tokens are padding (0) vs real (1).
- **labels**: our target values (0 or 1 for sentiment).

In [None]:
from transformers import AutoTokenizer

# Load tokenizer for distilbert-base-uncased
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenization function
def tokenize_batch(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=32)

# Apply tokenization
tokenized_dataset = tiny_dataset.map(tokenize_batch, batched=True)

# Keep only the necessary columns
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

# Rename label to labels (expected by Trainer)
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

tokenized_dataset.set_format("torch")

tokenized_dataset

How the above code works:
- First we load the vocabulary and tokenization rules for distilbert-base-uncased.
- The `tokenize_batch` function takes a batch of text strings (batch["text"]) and:
  - Padding: Ensures each sentence is same length (here: 32 tokens max).
  - Truncation: If a sentence is too long, it cuts off after 32 tokens.
  - Returns: input_ids and attention_mask.
  - Example:
    - Input: "I love this product!"
    - Output:
    
      input_ids: [101, 1045, 2293, 2023, 4031, 999, 102, 0, 0, ...]
      
      attention_mask: [1, 1, 1, 1, 1, 1, 1, 0, 0, ...]
- Applying Tokenization to the Dataset
  - map() applies our function to every entry in the dataset.
  - batched=True means it processes multiple samples at once (faster).
- Remove the raw text column since the model won’t use it.
- Renaming Labels: Hugging Face Trainer expects the label column to be named labels, not label

---

Mini Example Walkthrough:

- Take this one sample:
  - Text: "I love this product!"
  - Label: 1
- After tokenization (max_length=10 for simplicity):
  - input_ids:      [101, 1045, 2293, 2023, 4031, 999, 102, 0, 0, 0]
  - attention_mask: [1,   1,    1,    1,    1,   1,   1,  0, 0, 0]
  - labels:         1

Now the model has a numerical format it can work with.

For a more detailed explanation watch these -
- https://www.youtube.com/watch?v=KEv-F5UkhxU
- https://www.youtube.com/watch?v=t1caDsMzWBk



### Loading the Base Model

We'll use:
- `AutoModelForSequenceClassification` — a pre-trained DistilBERT model
  with a classification head suitable for binary classification.
- Since we're doing LoRA, the **main transformer weights will stay frozen**
  and only LoRA adapter layers will be trained.

In [None]:
from transformers import AutoModelForSequenceClassification

# Load pre-trained DistilBERT for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2  # Binary classification: Positive / Negative
)

# Quick check of model architecture
print(model)

### Attach LoRA Adapters?

- **LoRA (Low-Rank Adaptation)** adds small trainable matrices inside the
  `attention` layers (specifically in `q_lin` and `v_lin` for DistilBERT).
  - Query (Q): Represented by q_lin, this vector is used to query the other vectors.
  - Key (K): Represented by k_lin, this vector is used to determine how relevant other vectors are to the query.
  - Value (V): Represented by v_lin, this vector contains the information to be extracted from the sequence.
- The **original weights remain frozen**, while LoRA layers learn the task.
- This makes fine-tuning *fast, cheap, and memory-efficient*.

Key parameters:
- `r`: Rank of LoRA matrices (controls size/complexity).
- `lora_alpha`: Scaling factor (influences the effective learning rate for LoRA).
- `target_modules`: Layers where LoRA will be applied (`q_lin`, `v_lin` for DistilBERT attention).
- `lora_dropout`: Dropout applied within LoRA.

In [None]:
from peft import LoraConfig, get_peft_model

# Define LoRA configuration
lora_config = LoraConfig(
    r=8,                      # Rank of the low-rank adapters
    lora_alpha=32,            # Scaling factor for adaptation
    target_modules=["q_lin", "v_lin"],  # Target the attention projection layers
    lora_dropout=0.1,         # Dropout within LoRA layers
    bias="none",              # Don't modify biases
    task_type="SEQ_CLS"       # Task type: Sequence Classification
)

# Apply LoRA to the DistilBERT model
model = get_peft_model(model, lora_config)

# Print number of trainable parameters
model.print_trainable_parameters()


### Trainer Setup for LoRA Fine-Tuning

We'll use Hugging Face's `Trainer` API:
- Handles training loop, evaluation, and saving the model automatically.
- Uses our tokenized dataset, LoRA-enabled model, and standard training arguments.

Key components:
- **TrainingArguments**: controls epochs, learning rate, logging, saving.
- **Trainer**: handles forward pass, backpropagation, optimizer, evaluation.

In [None]:
# Sample toy dataset
texts = [
    "I love this product, it works great!",
    "Absolutely fantastic experience, very satisfied.",
    "Worst purchase ever, totally disappointed.",
    "I will never buy this again.",
    "Amazing quality and fast delivery.",
    "Horrible support, waste of money.",
    "Great value for the price.",
    "Terrible, completely useless."
]

labels = [1, 1, 0, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Tokenize the dataset
tokenized_data = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"  # Keep tensors for PyTorch
)

# Add labels to the tokenized data
tokenized_data["labels"] = labels

In [None]:
!pip install --upgrade transformers

In [None]:
from transformers import TrainingArguments, Trainer
import numpy as np
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Suppose `tokenized_data` is a dictionary containing input_ids, attention_mask, labels
# We'll split into train and validation sets for this toy dataset

train_data, val_data = train_test_split(list(zip(tokenized_data['input_ids'],
                                                 tokenized_data['attention_mask'],
                                                 tokenized_data['labels'])),
                                        test_size=0.2, random_state=42)

train_dataset = Dataset.from_dict({
    "input_ids": [x[0] for x in train_data],
    "attention_mask": [x[1] for x in train_data],
    "labels": [x[2] for x in train_data]
})

val_dataset = Dataset.from_dict({
    "input_ids": [x[0] for x in val_data],
    "attention_mask": [x[1] for x in val_data],
    "labels": [x[2] for x in val_data]
})

# Define training arguments
training_args = TrainingArguments(
    output_dir="./lora_finetune_output",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=5e-4,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"  # Disable wandb or other integrations for now
)

# Define a compute_metrics function
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Start fine-tuning
trainer.train()


### Inference on New Text After Fine-Tuning

After fine-tuning, we want to check:
- How well the model generalizes to unseen data.
- Whether it correctly predicts the sentiment for new sentences.

We'll:
1. Tokenize new unseen sentences.
2. Run them through the fine-tuned model.
3. Interpret the predicted labels.

In [None]:
import torch

# Example unseen sentences
new_sentences = [
    "This product exceeded my expectations!",
    "It was a total waste of money.",
    "Delivery was quick but the quality is average."
]

# Tokenize new data
new_inputs = tokenizer(
    new_sentences,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Move to same device as model (important for Colab/CPU)
new_inputs = {k: v.to(model.device) for k, v in new_inputs.items()}

# Perform inference
model.eval()
with torch.no_grad():
    outputs = model(**new_inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1).cpu().numpy()

# Interpret predictions
label_map = {0: "Negative", 1: "Positive"}
for sentence, pred in zip(new_sentences, predictions):
    print(f"Text: {sentence}")
    print(f"Predicted Sentiment: {label_map[pred]}")
    print("---")

What’s Happening Here?

- Tokenization: Converts each new sentence into token IDs and masks.

- Model Inference: Runs forward pass through DistilBERT + LoRA adapters.

- Prediction: Uses torch.argmax() to pick the most likely label.

- Interpretation: Maps the numeric label to "Positive" or "Negative".

#### Parameter-Wise Inside DistilBERT

- Attention layers have query (Wq) and value (Wv) projection matrices:
- Each is a large d_model × d_model weight matrix (e.g., 768×768).
- LoRA adds two small matrices (A & B) per adapted layer:
  
  W' = W + ΔW

  ΔW = A × B   # Low-rank decomposition
  - A: size [d_model, r]
  - B: size [r, d_model]
  - Rank r is small (e.g., 8), so trainable params ≪ original.

- Original DistilBERT weights remain frozen.
- No update to embeddings or FFN weights.
- The final classification layer (Linear(768 → 2)) is also trainable.
- This is because we need to adapt to the new downstream task.
- `peft.get_peft_model()` injected LoRA layers into the DistilBERT attention blocks.
- We froze all original weights, trained LoRA adapters + classifier head only.




## **Summary**
### **What We Learned**
- Fine-tuning Large Language Models (LLMs) can be **expensive and resource-heavy** if we update all parameters.
- **LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning (PEFT) technique that:
  - Freezes original model weights.
  - Adds small trainable matrices (low-rank adapters) to attention layers.
  - Drastically reduces the number of trainable parameters while maintaining good performance.

### **How We Did It**
1. **Prepared a Toy Sentiment Dataset**
   - Used simple labeled sentences: Positive/Negative sentiment.
   - Tokenized using `AutoTokenizer` with `padding`, `truncation`, and `max_length`.

2. **Loaded Base Model**
   - `DistilBERT (distilbert-base-uncased)` with a sequence classification head.
   - Observed that classification layers were newly initialized.

3. **Attached LoRA Adapters**
   - Applied using `peft.get_peft_model()` on attention layers (`query` and `value` projections).
   - Only LoRA parameters and classifier head were made trainable.

4. **Fine-Tuned with Trainer**
   - Split dataset into training and validation sets (80/20).
   - Used `TrainingArguments` with:
     - `eval_strategy="epoch"`
     - `learning_rate=5e-4`
     - `num_train_epochs=3`
     - `load_best_model_at_end=True`
   - Logged **accuracy** and **F1-score**.

5. **Inference on New Text**
   - Tokenized unseen sentences.
   - Predicted sentiment (`Positive` or `Negative`) using fine-tuned model.

### **Key Concepts & Parameters**
- **LoRA Rank (`r`)**: Controls the size of trainable matrices (small r = fewer parameters).
- **Freezing Base Weights**: Prevents catastrophic forgetting and reduces memory usage.
- **Classification Head**: Always retrained for downstream task.

### **Modules Used**
- `transformers`: AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
- `peft`: LoRA integration
- `datasets`: For train/validation split
- `torch`: Model inference and device management
- `sklearn`: Accuracy and F1 evaluation metrics

### **Why LoRA?**
- **Full fine-tuning**: Updates all parameters (~66M for DistilBERT).
- **LoRA fine-tuning**: Updates only a small fraction (~hundreds of thousands).
- Achieves nearly similar accuracy while being faster and lightweight.

### **Outcome**
- Successfully fine-tuned a DistilBERT model using LoRA adapters.
- Reduced compute requirements for training.
- Validated the model on unseen data and achieved reliable predictions.

---