# Fine-tuning of LLMs
(notebook for CPU)

--------------------

Prepared by Lukáš Bátrla, AI Researcher at Cisco Systems.

If you find this topic interesting and want to collaborate e.g. on a thesis, feel free to send me a message to lbatrla@cisco.com or to [my LinkedIn profile](https://www.linkedin.com/in/lukas-batrla/).

--------------------

## The structure of today's lecture

1. Problem introduction
2. Data preparation and splitting
3. In-context learning
4. Full fine-tuning
5. Parameter efficient fine-tuning

# Phishing Email Classification using LLM
This notebook demonstrates the usage of an LLM model on the phishing email classification. We'll start with introduction of the dataset and then apply LLM on it.
In this notebook, we'll introduce several fine-tuning techniques that will help us incrementally improve the LLM's competency to classify emails as phishing or benign.

<img src="images/email-phishing-example.jpg" width="700">

**Disclaimer:** The notes about the general fine-tuning methods in this notebook are inspired by concepts I learned in the [Generative AI with Large Language Models](https://www.deeplearning.ai/courses/generative-ai-with-llms/) course by DeepLearning.AI. However, the notes, the dataset, implementation, and model choices have been adapted for phishing detection.

## What You'll Learn:
1. Load and prepare the phishing email dataset with stratified splitting into train/validation/test
2. Difference between various **in-context-learning** methods - zero-shot (a.k.a. baseline model), one-shot, and few-shot method
3. Configure and run **full fine-tuning**
4. Configure and run **LoRA fine-tuning**

## Training Configuration (CPU-Optimized):
- **Model**: FLAN-T5 small - 77M parameters (full parameter training)
- **Device Options**: NVIDIA GPU (CUDA), Apple GPU (MPS), or CPU
- **Training Time**: ~1 hour on GPU, ~4-8 hours on CPU
- **Memory Requirements**: ~4-8GB RAM
- **CPU Optimizations**: Reduced dataset (20%), smaller batches, optimized evaluation subsets

## 0. Device Configuration

Training on GPU is significantly faster than CPU. The following code automatically detects and uses the best available device.

**Priority**: CUDA (NVIDIA) > MPS (Apple Silicon) > CPU

- **CUDA**: NVIDIA GPU acceleration - fastest option (~1-2 hours)
- **MPS**: Metal Performance Shaders for M1/M2/M3/M4 Macs (~2-4 hours)
- **CPU**: Fallback option (~4-8 hours with optimized settings for 20% dataset)

**Note**: This notebook uses CPU-optimized parameters including reduced dataset (20%), smaller batch sizes, and limited evaluation samples to make training practical on CPU.

Set `USE_MPS = False` to force CPU-only training if needed.

In [None]:
import torch

USE_MPS = False  # Set to False to force CPU training

# Device detection
if torch.cuda.is_available():
    device_type = "cuda"
    print(f"Using device: CUDA (NVIDIA GPU) - {torch.cuda.get_device_name(0)}")
elif USE_MPS and torch.backends.mps.is_available():
    device_type = "mps"
    print("Using device: MPS (Apple Silicon GPU)")
else:
    device_type = "cpu"
    print("Using device: CPU (⚠️ Full fine-tuning will be slow ~100+ hours)")

device = torch.device(device_type)

## 1. Load Phishing Email Dataset

We will use the [phishing-email-dataset](https://huggingface.co/datasets/zefang-liu/phishing-email-dataset) from HuggingFace, which contains over 18,000 emails labeled as either "Safe Email" or "Phishing Email".

**Expected**: ~18,650 examples with roughly 60% safe emails and 40% phishing emails.

### 1.1. Normalize columns and labels
The dataset will be:
- Normalized with consistent column names (`text` and `label`)
- Augmented with a numeric `label_id` for stratified splitting (0=Safe, 1=Phishing)


In [None]:
from datasets import load_dataset, DatasetDict
import numpy as np
from collections import Counter

# Load dataset
dataset = load_dataset("zefang-liu/phishing-email-dataset")
ds = dataset["train"]

# Normalize column names
ds = ds.rename_columns({"Email Text": "text", "Email Type": "label"})

# Add numeric label for stratified splitting
label_names = ["Safe Email", "Phishing Email"]
label_to_id = {name: i for i, name in enumerate(label_names)}
ds = ds.map(lambda x: {"label_id": label_to_id[x["label"]]})

ds

### 1.2. Create Stratified Train/Validation/Test Splits

To ensure fair evaluation, we perform **stratified splitting** which maintains the same class distribution (Safe vs Phishing ratio) across train, validation, and test sets.

**Split ratios:**
- Training: 80% (~14,900 examples)
- Validation: 10% (~1,860 examples)  
- Test: 10% (~1,860 examples)

This prevents training on an imbalanced split that could bias the model.

In [None]:
def stratified_split(dataset, label_column, train_size=0.8, val_size=0.1, seed=42):
    """Stratified split maintaining class distribution across all splits."""
    labels = np.array(dataset[label_column])
    rng = np.random.default_rng(seed)
    idx_train, idx_val, idx_test = [], [], []

    for label_value in np.unique(labels):
        idx = np.where(labels == label_value)[0]
        rng.shuffle(idx)
        n = len(idx)
        n_train = int(train_size * n)
        n_val = int(val_size * n)
        idx_train.extend(idx[:n_train])
        idx_val.extend(idx[n_train:n_train+n_val])
        idx_test.extend(idx[n_train+n_val:])

    rng.shuffle(idx_train)
    rng.shuffle(idx_val)
    rng.shuffle(idx_test)

    return DatasetDict({
        "train": dataset.select(idx_train),
        "validation": dataset.select(idx_val),
        "test": dataset.select(idx_test)
    })

ds_splits = stratified_split(ds, "label_id", train_size=0.8, val_size=0.1)
ds_splits

## 2. In-Context Learning: Zero-Shot, One-Shot, and Few-Shot methods
- In-context learning refers to a model's ability to solve a task by reasoning over examples given directly within the prompt, rather than through weight updates or fine-tuning.
- Before fine-tuning, we assess the model's **in-context learning** performance — seeing how well it can classify emails when shown examples in the prompt alone.

<img src="images/in-context-learning-diagram.jpg">

Why use in-context learning?
- Improves model performance on new tasks without requiring additional model training or parameter updates (saving time and compute)
- Enables flexible, task-specific adaptation even when labeled data is limited or unavailable for fine-tuning
- Supports rapid prototyping and testing of prompts before committing to more resource-intensive fine-tuning
- Useful in scenarios where model weights cannot be modified (e.g., API usage, deployment constraints)

We test three approaches:
- **Zero-Shot**: No examples (pure baseline)
- **One-Shot**: 1 example in the prompt  
- **Few-Shot**: 4 examples in the prompt (2 safe, 2 phishing)

This comparison helps us understand:
1. The model's baseline capability on our task
2. How much improvement we can get "for free" with just prompting
3. The value of fine-tuning vs. in-context learning

**Note**: Depending on the task, few-shot (and even one-shot) approach can easily fill the whole context window of the model.

**Note**: FLAN-T5 is instruction-tuned, so it performs reasonably well even in zero-shot mode. This makes it much more suitable for fine-tuning on classification tasks compared to base models like GPT-2.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Choose model based on available memory
# model_id = "google/flan-t5-base"  # 250M params
model_id = "google/flan-t5-small"  # 77M params (current) - Seq2Seq, instruction-tuned

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="cpu")

# Create output directory name based on model
model_name_clean = model_id.split('/')[-1].lower().replace('_', '-')
output_full_finetuning_dir = f"./{model_name_clean}_finetuned_phishing"


# ============================================================
# ZERO-SHOT: No Examples
# ============================================================

def classify_zero_shot(text):
    """Zero-shot classification with simple prompt."""
    # Truncate email to fit context better
    text_snippet = text[:400]
    
    prompt = f"Classify this email as 'Safe Email' or 'Phishing Email': {text_snippet}"
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        out = model.generate(
            **inputs, 
            max_new_tokens=10,
            do_sample=False
        )
    
    ans = tokenizer.decode(out[0], skip_special_tokens=True)
    prediction = ans.strip()
    return prediction

# ============================================================
# ONE-SHOT: 1 Example
# ============================================================

def classify_one_shot(text, example_text, example_label):
    """
    One-shot classification: provide ONE example in the prompt.
    """
    # Truncate texts
    ex_snippet = example_text[:150]
    test_snippet = text[:250]
    
    prompt = "Classify as 'Safe Email' or 'Phishing Email'.\n\nExample:/n"
    prompt += f"Email: {ex_snippet}\nAnswer: {example_label}\n\n"
    prompt += f"Now classify:\nEmail: {test_snippet}\nAnswer:"
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        out = model.generate(
            **inputs, 
            max_new_tokens=10,
            do_sample=False
        )
    
    ans = tokenizer.decode(out[0], skip_special_tokens=True)
    prediction = ans.strip()
    return prediction

# ============================================================
# FEW-SHOT: Multiple Examples
# ============================================================

def classify_few_shot(text, examples):
    """
    Few-shot classification: provide multiple examples.
    """
    test_snippet = text[:150] if len(text) > 150 else text
    
    prompt = "Classify as 'Safe Email' or 'Phishing Email'.\n\nExamples:\n"
    for ex in examples[:2]:  # Use fewer examples for T5
        ex_snippet = ex["text"][:150]
        prompt += f"Email: {ex_snippet}\nAnswer: {ex['label']}\n\n"
    prompt += f"Now classify:\nEmail: {test_snippet}\nAnswer:"
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    with torch.no_grad():
        out = model.generate(
            **inputs, 
            max_new_tokens=10,
            do_sample=False
        )
    
    ans = tokenizer.decode(out[0], skip_special_tokens=True)
    prediction = ans.strip()
    return prediction

# Prepare examples for one-shot and few-shot
safe_example = None
phishing_example = None
few_shot_examples = []
safe_count = 0
phishing_count = 0

for item in ds_splits["train"]:
    if item["label"] == "Safe Email":
        if safe_example is None:
            safe_example = {"text": item["text"], "label": item["label"]}
        if safe_count < 2:
            few_shot_examples.append({"text": item["text"], "label": item["label"]})
            safe_count += 1
    elif item["label"] == "Phishing Email":
        if phishing_example is None:
            phishing_example = {"text": item["text"], "label": item["label"]}
        if phishing_count < 2:
            few_shot_examples.append({"text": item["text"], "label": item["label"]})
            phishing_count += 1
    if safe_example and phishing_example and safe_count >= 2 and phishing_count >= 2:
        break

In [None]:
# Test all three approaches on the same example
test_sample = ds_splits["test"][3]

# Run predictions
# To normalize the predictions, use the normalize_prediction function
zero_shot_pred = classify_zero_shot(test_sample["text"])
one_shot_pred = classify_one_shot(test_sample["text"], phishing_example["text"], phishing_example["label"])
few_shot_pred = classify_few_shot(test_sample["text"], few_shot_examples)

# Display results
dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'TEST EMAIL: {test_sample["text"][:400]}...')
print(f'ACTUAL LABEL: {test_sample["label"]}')
print(dash_line)
print(f'ZERO-SHOT:  {zero_shot_pred}')
print(f'ONE-SHOT:   {one_shot_pred}')
print(f'FEW-SHOT:   {few_shot_pred}')
print(dash_line)

In [None]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report
import gc

# Evaluate on full test set or subsample to save time
test_subsample = ds_splits["test"]#.shuffle(seed=42).select(range(10))

print(f'Evaluating zero-shot model on {len(test_subsample)} test examples...')
print('This may take a few minutes...\n')

zero_shot_predictions = []
one_shot_predictions = []
few_shot_predictions = []
true_labels = []

# Process with periodic memory cleanup
for i, example in enumerate(tqdm(test_subsample, desc="Testing")):
    zero_shot_predictions.append(classify_zero_shot(example["text"]))
    one_shot_predictions.append(classify_one_shot(example["text"], phishing_example["text"], phishing_example["label"]))
    few_shot_predictions.append(classify_few_shot(example["text"], few_shot_examples))
    true_labels.append(example["label"])
    
    # Clear cache every 50 examples
    if (i + 1) % 50 == 0:
        if device.type == "cuda":
            torch.cuda.empty_cache()
        elif device.type == "mps":
            torch.mps.empty_cache()
        gc.collect()

# Final cleanup
if device.type == "cuda":
    torch.cuda.empty_cache()
elif device.type == "mps":
    torch.mps.empty_cache()
gc.collect()

# Compute metrics
zero_shot_accuracy = accuracy_score(true_labels, zero_shot_predictions)
one_shot_accuracy = accuracy_score(true_labels, one_shot_predictions)
few_shot_accuracy = accuracy_score(true_labels, few_shot_predictions)

# To avoid UndefinedMetricWarning for precision/recall on labels with no predicted/true samples, set zero_division=0
classification_report_kwargs = {"zero_division": 0}

# Display results for zero-shot model
print('\n' + '='*60)
print('ZERO-SHOT MODEL EVALUATION RESULTS')
print('='*60)
print(f'Test set size: {len(test_subsample)} examples')
print(f'\nOverall Accuracy: {zero_shot_accuracy:.3f} ({zero_shot_accuracy*100:.1f}%)')

print(f'\nClassification Report:')
print(classification_report(true_labels, zero_shot_predictions, **classification_report_kwargs))
print('='*60)

# Display results for one-shot model
print('\n' + '='*60)
print('ONE-SHOT MODEL EVALUATION RESULTS')
print('='*60)
print(f'Test set size: {len(test_subsample)} examples')
print(f'\nOverall Accuracy: {one_shot_accuracy:.3f} ({one_shot_accuracy*100:.1f}%)')

print(f'\nClassification Report:')
print(classification_report(true_labels, one_shot_predictions, **classification_report_kwargs))
print('='*60)

# Display results for few-shot model
print('\n' + '='*60)
print('FEW-SHOT MODEL EVALUATION RESULTS')
print('='*60)
print(f'Test set size: {len(test_subsample)} examples')
print(f'\nOverall Accuracy: {few_shot_accuracy:.3f} ({few_shot_accuracy*100:.1f}%)')

print(f'\nClassification Report:')
print(classification_report(true_labels, few_shot_predictions, **classification_report_kwargs))
print('='*60)

## 3. Full Fine-Tuning

Full fine-tuning is the process of updating **all model parameters** by training on task-specific data. Unlike in-context learning (which uses examples in prompts), fine-tuning modifies the model's weights to specialize it for your specific task.

<img src="images/finetuning-llm-diagram.jpg">

### What is Full Fine-Tuning?

- **Parameter Updates**: All layers of the model (embeddings, attention, feedforward networks) are trained
- **Task Specialization**: The model learns patterns specific to your task (phishing detection)
- **Persistent Learning**: Knowledge is stored in the model weights, not in the prompt
- **Resource Intensive**: Requires more compute and memory than in-context learning or parameter-efficient methods (like LoRA)

### Why Use Full Fine-Tuning?

**Advantages:**
- **Maximum Performance**: Achieves the highest accuracy by leveraging all model parameters
- **Efficient Inference**: No need for long prompts with examples (saves tokens and latency)
- **Task-Specific Expertise**: Model becomes specialized in your domain (e.g. email security)
- **Better Generalization**: Learns underlying patterns rather than memorizing prompt patterns
- **Smaller Context Window**: Doesn't consume context with few-shot examples

**Disadvantages:**
- **Computational Cost**: Requires GPU/TPU and significant training time (often tens or hundreds of hours)
- **Memory Requirements**: Must fit entire model + gradients + optimizer states in memory
  - **Storage**: For models up to several billion parameters (e.g., T5-small to T5-3B, or most FLAN-T5 and Llama-2 variants up to ~7B parameters), a full model checkpoint is typically ~100MB-6GB. For larger models (e.g., 13B, 70B+), storage requirements increase proportionally, often exceeding 10GB-40GB per checkpoint.
- **Requires Labeled Data**: Need sufficient training examples (~10,000+ for best results)

### When to Choose Full Fine-Tuning vs In-Context Learning?

| **Scenario** | **Use In-Context Learning** | **Use Full Fine-Tuning** |
|--------------|----------------------------|--------------------------|
| **Data availability** | Few examples (<100) | Many examples (>1,000) |
| **Inference frequency** | Occasional queries | High-volume production |
| **Deployment** | API-based (OpenAI, etc.) | Self-hosted models |
| **Task complexity** | Simple classification | Complex reasoning |
| **Budget/Resources** | Limited compute | GPU access available |

### Full Fine-Tuning Process

In this section, we'll:
1. **Format data** (Section 3.1): Create prompt-target pairs for supervised learning
2. **Tokenize sequences** (Section 3.2): Convert text to token IDs with proper input/label separation
3. (Optional) **Reduce data size** (Section 3.3): When training a larger model or training on CPU, reduce the amount of data. Even with a 1000 samples, the training can take several hours.
5. **Train the model** (Section 3.4): Set up batch size, learning rate, and epochs. Update all model parameters via gradient descent
6. **Evaluate results** (Section 3.5): Measure accuracy, precision, recall on test set

### Full Fine-Tuning vs LoRA (Parameter-Efficient Fine-Tuning)

| **Aspect** | **Full Fine-Tuning** | **LoRA** |
|-----------|---------------------|---------|
| **Parameters Updated** | 100% (all layers) | ~0.1-1% (low-rank adapters) |
| **Training Time** | 30-60 min (FLAN-T5-small) | 10-20 min |
| **Memory Usage** | High (8-16GB) | Low (2-4GB) |
| **Final Accuracy** | Highest (93-95%) | Slightly lower (91-93%) |
| **Model Size** | Full size (~300MB) | Base + adapters (~5MB extra) |
| **Best For** | Maximum performance | Fast iteration, limited resources |

For this notebook, we use **full fine-tuning** to achieve the best possible phishing detection accuracy.


### 3.1. Format Data for Fine-Tuning
We create **prompt-target pairs** where:
- **Prompt**: The question + email text
- **Target**: The correct label

During fine-tuning, the model learns to generate the correct label when given the prompt. This is different from pre-training where the model just learns next-token prediction on general text.

In [None]:
def format_example(example):
    """Convert each example to a prompt-target pair for training."""
    prompt = f"Classify this email as 'Safe Email' or 'Phishing Email': {example['text']}"
    target = example["label"]
    return {"prompt": prompt, "target": target}

ds_splits_ft = ds_splits.map(format_example)
ds_splits_ft

### 3.2. Tokenize Data

Tokenization converts text into numerical token IDs that the model can process. The approach for our Seq2Seq model:

1. Tokenize inputs (prompts) and outputs (labels) separately
2. Input max length: 512 tokens, Label max length: 32 tokens
3. Labels are the target sequences to generate

In [None]:
def tokenize(batch):
    """Tokenize sequences with DataCollator that handles padding."""
    model_inputs = tokenizer(
        batch["prompt"],
        truncation=True,
        max_length=512
    )
    
    # Tokenize labels
    labels = tokenizer(
        text_target=batch["target"],
        truncation=True,
        max_length=32
    )
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized = ds_splits_ft.map(tokenize, batched=True, remove_columns=ds_splits_ft["train"].column_names)
tokenized

### 3.3. Reduce Dataset Size for CPU Efficiency

For CPU training, we use a reduced subset (20% of data) to make training practical (~4-8 hours instead of 100+ hours).

**CPU-Optimized**: This notebook uses the reduced dataset by default for faster training on CPU. For GPU training or maximum accuracy, you can use the full dataset by modifying Section 3.4.


In [None]:
# Create reduced subsets (20% of data) - CPU-optimized default
train_subset = tokenized["train"].select(range(len(tokenized["train"]) // 5))
val_subset = tokenized["validation"].select(range(len(tokenized["validation"]) // 5))

print(f'Full dataset: {len(tokenized["train"])} train, {len(tokenized["validation"])} val')
print(f'Reduced dataset (CPU-optimized): {len(train_subset)} train, {len(val_subset)} val (20%)')
print(f'\n✓ Using reduced dataset for training to optimize for CPU performance')
print(f'  Estimated training time on CPU: ~4-8 hours')
print(f'  For full dataset training on GPU, replace train_subset/val_subset with tokenized["train"]/tokenized["validation"] in Section 3.4')


### 3.4. Full Fine-Tuning (CPU-Optimized)

Now we train the model on our phishing detection dataset. This will update **all parameters** of the model (unlike LoRA which only updates small adapter matrices - we'll look at that approach later).

**CPU-Optimized Training Configuration:**
- **Dataset**: 20% of training data (~3,000 examples) for practical CPU training
- **Epochs**: 3 full passes through the training data
- **Batch size**: 1 per device with gradient accumulation (effective batch size = 4)
- **Learning rate**: 2e-5 (small to preserve pre-trained knowledge)
- **Gradient checkpointing**: Enabled to reduce memory usage
- **Evaluation**: Disabled during training to save time

**Training Time Estimates:**
- FLAN-T5-small on CPU: ~1 hour (20% dataset)

**Note**: This notebook uses reduced dataset by default. For full dataset training, replace `train_subset` with `tokenized["train"]` below.

In [None]:
import torch
import gc

# Clear memory. In case you are running fine-tuning again, this will clear the memory from the previous run.
if torch.cuda.is_available():
    torch.cuda.empty_cache()
if torch.backends.mps.is_available():
    torch.mps.empty_cache()
gc.collect()

In [None]:
# Move model to device
device = torch.device(device_type)
model = model.to(device)

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

# CPU-Optimized Training Configuration
training_args = Seq2SeqTrainingArguments(
    output_dir=output_full_finetuning_dir,
    per_device_train_batch_size=1,  # Smaller batch size for CPU (1 for memory efficiency)
    per_device_eval_batch_size=1,  # Smaller batch size for evaluation
    gradient_accumulation_steps=4,  # Effective batch size = 1*4 = 4 (CPU-optimized)
    num_train_epochs=3,  # Number of full passes through the training data
    learning_rate=2e-5,  # Small learning rate to preserve pre-trained knowledge
    logging_steps=100,  # Log every 100 steps (less frequent for CPU)
    evaluation_strategy="no",  # Don't evaluate during training (saves significant time on CPU)
    save_strategy="epoch",  # Save model checkpoint at end of each epoch
    gradient_checkpointing=True,  # Enable gradient checkpointing for memory efficiency
    use_cpu=(device_type == "cpu"),
    save_total_limit=1,  # Only keep the last checkpoint to save disk space
    save_only_model=True,  # Only save the model, not the tokenizer
    fp16=False,  # Disable fp16 for CPU compatibility
)

# Create data collator for automatic input padding
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Initialize trainer with REDUCED DATASET (CPU-optimized)
# For full dataset on GPU, replace train_subset/val_subset with tokenized["train"]/tokenized["validation"]
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,  # CPU-optimized: using 20% of data
    eval_dataset=val_subset,  # CPU-optimized: using 20% of data
    data_collator=data_collator
)

In [None]:
# Check device configuration
print(f"Model device: {next(model.parameters()).device}")
print(f"Trainer device: {training_args.device}")
print(f"Model type: {model_id}")

In [None]:
# Start training (CPU-optimized with reduced dataset)
print(f'Starting CPU-optimized training on {len(train_subset)} examples (20% of full dataset)...')
print(f'Device: {device_type.upper()}')
print(f'Checkpoints: {output_full_finetuning_dir}/checkpoint-*\n')

trainer.train()

print('\n✓ Training complete!')

#### Save the Fine-Tuned Model

Save the trained model and tokenizer to a dedicated directory for later use. The model can then be loaded for inference or shared with other notebooks.

In [None]:
final_model_path = f"{output_full_finetuning_dir}/final_model_cpu_20_perc"

trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

print(f'✓ Model saved to: {final_model_path}')
print(f'\nTo load: AutoModelForSeq2SeqLM.from_pretrained("{final_model_path}")')

### 3.5. Evaluate the Fine-Tuned Model

Load the fine-tuned model and test it on examples from the test set. We'll compare the model's predictions with the actual labels.

In [None]:
final_model_path = f"{output_full_finetuning_dir}/final_model_cpu_20_perc"

### Test on one sample

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load fine-tuned model
full_finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(final_model_path)
full_finetuned_tokenizer = AutoTokenizer.from_pretrained(final_model_path)

# Move to device
full_finetuned_model = full_finetuned_model.to(device)  # Device set in section 6.

def classify_email_finetuned(text):
    """Classify email using fine-tuned model."""
    prompt = f"Classify this email as 'Safe Email' or 'Phishing Email': {text}"
    
    inputs = full_finetuned_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = full_finetuned_model.generate(**inputs, max_new_tokens=10)
    
    result = full_finetuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result.strip()

# Quick test on one example
test_email = ds_splits["test"][3]
prediction = classify_email_finetuned(test_email["text"])

print(f'Quick test on one example:')
print(f'Email: {test_email["text"][:400]}...')
print(f'Prediction: {prediction}')
print(f'Actual: {test_email["label"]}')
print(f'Correct: {prediction == test_email["label"]}')

#### Evaluate on Test Set (CPU-Optimized)

On one sample, the result looks good. Now let's evaluate the fine-tuned model on a subset of the test set.

The evaluation computes:
- **Accuracy**: Overall percentage of correct predictions
- **Per-class metrics**: Precision, recall, and F1-score for each class (Safe Email, Phishing Email)

**CPU-Optimized**: Subsample the testset for faster evaluation (~5-10 minutes on CPU). For full evaluation (1,867 samples), change the range in the cell below.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from tqdm import tqdm
import gc

# Memory-efficient classification function
def classify_with_memory_management(text):
    """Classify with memory cleanup to prevent OOM errors."""
    # Truncate text to save memory
    prompt = f"Classify this email as 'Safe Email' or 'Phishing Email': {text[:400]}"
    
    inputs = full_finetuned_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = full_finetuned_model.generate(
            **inputs, 
            max_new_tokens=10,
            do_sample=False,
            pad_token_id=full_finetuned_tokenizer.eos_token_id if hasattr(full_finetuned_tokenizer, 'eos_token_id') else None
        )
    
    # Decode only generated tokens
    result = full_finetuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
    prediction = result.strip()
    return prediction


# CPU-Optimized: Evaluate on subset of test set (300 samples)
# For full evaluation, use: test_subsample = ds_splits["test"]
# test_subsample = ds_splits["test"].shuffle(seed=42).select(range(300))
print(f'Evaluating fine-tuned model on {len(test_subsample)} test examples (CPU-optimized)...')

predictions = []
true_labels = []

# Process with periodic memory cleanup
for i, example in enumerate(tqdm(test_subsample, desc="Testing")):
    pred = classify_with_memory_management(example["text"])
    predictions.append(pred)
    true_labels.append(example["label"])
    
    # Clear cache every 50 examples
    if (i + 1) % 50 == 0:
        if device.type == "cuda":
            torch.cuda.empty_cache()
        elif device.type == "mps":
            torch.mps.empty_cache()
        gc.collect()

# Final cleanup
if device.type == "cuda":
    torch.cuda.empty_cache()
elif device.type == "mps":
    torch.mps.empty_cache()
gc.collect()

# Compute metrics
accuracy = accuracy_score(true_labels, predictions)

# Display results
print('\n' + '='*60)
print('FINE-TUNED MODEL EVALUATION RESULTS')
print('='*60)
print(f'Test set size: {len(test_subsample)} examples')
print(f'\nOverall Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)')

print(f'\nClassification Report:')
print(classification_report(true_labels, predictions))
print('='*60)

### Final notes to full fine-tuning

**Catastrophic forgetting**
- We updated all model parameters => the model likely lost some of it's previous capabilities. It is now specialized for phishing classification but might perform much worse on e.g., summarization, than the base model before fine-tuning.
- In our case (using the model for specific task) it is not an issue. We don't need a general model.
- If you need to maintain the base model's capabilities, use PEFT.

## 4. Parameter-Efficient Fine-Tuning (PEFT) with LoRA
<img src="images/peft-lora-diagram.jpg">

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that enables efficient adaptation of large language models by injecting trainable low-rank matrices into existing model weights, while keeping most of the original model parameters frozen. This significantly reduces the number of trainable parameters and accelerates training without sacrificing performance.

Full fine-tuning vs LoRA in trainable parameters: 
- **Full fine-tuning**
  - *all* model parameters (often hundreds of millions or even billions) are updated during training,
  - requires significant compute and storage
  - grants the model maximum flexibility to adapt to the new task
  - risks overwriting previously learned information (catastrophic forgetting)
- **LoRA**
  - freezes the original (pre-trained) model weights
  - injects small, trainable "adapter" layers (low-rank matrices) into specific parts of the model
    - often less than 1% of the total parameters
  - saves memory
  - accelerates training
  - uses less data
  - more accessible for users with limited hardware

 | Approach               | Trainable Parameters | Storage Required | Base Model Retained? |
 |------------------------|---------------------|------------------|----------------------|
 | Full Fine-Tuning       | 100%                | High             | No                   |
 | LoRA / PEFT            | ~0.1%-2%            | Low              | Yes                  |
 


In the following section, we'll explore how to apply LoRA to our phishing classification model using the `peft` library.

**References:**
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
- [Hugging Face PEFT documentation](https://huggingface.co/docs/peft/index)

### 4.1. Prepare LoRA Training Data

For LoRA training on FLAN-T5, we'll reuse the tokenized dataset from Section 3.2. The same data format works for both full fine-tuning and LoRA since both use the same base model architecture.

In [None]:
# CPU-Optimized: Use 20% of data for practical LoRA training on CPU
# For full training on GPU, use the full tokenized dataset
lora_train_size = len(tokenized["train"]) // 5  # 20% of training data
lora_val_size = len(tokenized["validation"]) // 5  # 20% of validation data

lora_train_ds = tokenized["train"].select(range(lora_train_size))
lora_val_ds = tokenized["validation"].select(range(lora_val_size))

print(f'LoRA Training dataset: {len(lora_train_ds)} examples (20% - CPU-optimized)')
print(f'LoRA Validation dataset: {len(lora_val_ds)} examples (20% - CPU-optimized)')
print(f'Note: For full dataset training, modify lora_train_size and lora_val_size above.')

### 4.2. Load Base Model and Apply LoRA Adapters

We'll load a fresh copy of the base model and apply LoRA adapters to it. For FLAN-T5, we'll target the attention layers (q, k, v, o projections).


In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# Load a fresh base model for LoRA training
lora_base_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Configure LoRA parameters for FLAN-T5
lora_config = LoraConfig(
    r=8,                          # Rank of LoRA adapters (higher = more capacity = more trainable parameters).
    lora_alpha=16,                # Scaling factor for LoRA weights.
    target_modules=['q', 'v'],    # Apply LoRA to attention query/value projections. q (query) is the input to the model, v (value) is the output, k (key) is the output of the previous layer, and o (output) is the input to the next layer.
    lora_dropout=0.05,            # Dropout for regularization
    bias='none',                  # Don't adapt bias terms
    task_type=TaskType.SEQ_2_SEQ_LM  # Task type: sequence-to-sequence language modeling
)

# Apply LoRA adapters to the base model
lora_model = get_peft_model(lora_base_model, lora_config)
print('\nTrainable parameters:')
lora_model.print_trainable_parameters()

### 4.3. Configure Training and Train LoRA Adapters (CPU-Optimized)

LoRA training is much faster than full fine-tuning because we only update ~1% of parameters.

**Training Time Estimate (20% dataset):**
- CPU: ~20 minutes

In [None]:
# Move model to device
lora_model = lora_model.to(device)

# Configure LoRA training arguments (CPU-Optimized)
lora_output_dir = f"{model_name_clean}_lora_cpu_20_perc"

lora_training_args = Seq2SeqTrainingArguments(
    output_dir=lora_output_dir,
    per_device_train_batch_size=2,   # CPU-optimized batch size
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,    # Effective batch size = 8 (CPU-optimized)
    num_train_epochs=3,               # 3 epochs with reduced dataset
    learning_rate=2e-5,               # Standard learning rate for LoRA
    logging_steps=100,                # Less frequent logging for CPU
    evaluation_strategy="no",               # No evaluation during training to save time
    save_strategy="epoch",            # Save at end of each epoch
    use_cpu=(device_type == "cpu"),
    save_total_limit=1,               # Keep only last checkpoint
    save_only_model=True,
    predict_with_generate=True,
    fp16=False,                       # Disable fp16 for CPU compatibility
)

# Create data collator (reuse from full fine-tuning)
lora_data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=lora_model)

# Initialize trainer
lora_trainer = Seq2SeqTrainer(
    model=lora_model,
    args=lora_training_args,
    train_dataset=lora_train_ds,     # Using 20% of data (CPU-optimized)
    eval_dataset=lora_val_ds,        # Using 20% of data (CPU-optimized)
    data_collator=lora_data_collator
)

print(f'LoRA Trainer initialized (CPU-Optimized)')
print(f'Model device: {next(lora_model.parameters()).device}')
print(f'Training samples: {len(lora_train_ds)} (20% of full dataset)')
print(f'Output directory: {lora_output_dir}')

In [None]:
# Start LoRA training (CPU-Optimized)
print(f'Starting LoRA training on {len(lora_train_ds)} examples...')
print(f'Device: {device_type.upper()}')

lora_trainer.train()

print('\n✓ LoRA training complete!')

# Save the LoRA adapter
lora_adapter_path = f"{lora_output_dir}/lora_adapter"
lora_model.save_pretrained(lora_adapter_path)
print(f'✓ LoRA adapter saved to: {lora_adapter_path}')

### 4.4. Evaluate LoRA Model and Compare with Other Approaches (CPU-Optimized)

Now let's evaluate the LoRA model and compare its performance against:
1. **Zero-shot** (base model with no training)
2. **Full fine-tuning** (all parameters updated)
3. **LoRA** (only adapter parameters updated)

**CPU-Optimized**: Use sample of testset for faster evaluation (~10-15 minutes on CPU). For full evaluation, modify eval_size below.

In [None]:
from peft import PeftModel

# Helper function for generating predictions
def generate_prediction(model_to_use, text, max_new_tokens=10):
    """Generate prediction from any model."""
    prompt = f"Classify this email as 'Safe Email' or 'Phishing Email': {text[:400]}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    
    with torch.no_grad():
        outputs = model_to_use.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result.strip()

# Prepare models for evaluation

# 1. Zero-shot model (base model, no training)
zero_shot_model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)
zero_shot_model.eval()

# 2. LoRA model (just trained)
# Load the LoRA model from the output folder and set to eval mode
# lora_model = PeftModel.from_pretrained(zero_shot_model, lora_adapter_path).to(device)
lora_model.eval()

# 3. Full fine-tuned model (we already loaded it in Section 3.5)
# full_finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(final_model_path).to(device)
full_finetuned_model.eval()
print("Models loaded")

In [None]:
# CPU-Optimized: Evaluate on subset of test set (300 samples)
# For full evaluation on GPU, use: eval_size = len(ds_splits["test"])
# eval_size = 300  # CPU-optimized: 300 samples for faster evaluation
# test_subsample = ds_splits["test"].shuffle(seed=42).select(range(eval_size))

print(f'Evaluating on {len(test_subsample)} test samples (CPU-optimized)...')

# Storage for predictions
zero_shot_preds = []
lora_preds = []
full_preds = []
true_labels = []

# Evaluate each model
for i, example in enumerate(tqdm(test_subsample, desc="Evaluating models")):
    text = example["text"]
    true_label = example["label"]
    true_labels.append(true_label)
    
    # Zero-shot prediction
    zs_pred = generate_prediction(zero_shot_model, text)
    zero_shot_preds.append(zs_pred)
    
    # LoRA prediction
    lora_pred = generate_prediction(lora_model, text)
    lora_preds.append(lora_pred)
    
    # Full fine-tuned prediction
    full_pred = generate_prediction(full_finetuned_model, text)
    full_preds.append(full_pred)
    
    # Periodic memory cleanup
    if (i + 1) % 50 == 0:
        if device.type == "cuda":
            torch.cuda.empty_cache()
        elif device.type == "mps":
            torch.mps.empty_cache()
        gc.collect()

# Final cleanup
if device.type == "cuda":
    torch.cuda.empty_cache()
elif device.type == "mps":
    torch.mps.empty_cache()
gc.collect()

print('\n✓ Evaluation complete!')

In [None]:
# Display comparison results
print('\n' + '='*70)
print('COMPARISON: Zero-Shot vs LoRA vs Full Fine-Tuning')
print('='*70)

# Compute accuracies
zero_shot_acc = accuracy_score(true_labels, zero_shot_preds)
lora_acc = accuracy_score(true_labels, lora_preds)

print(f'\n Test set size: {len(test_subsample)} examples\n')

# Zero-shot results
print('1) ZERO-SHOT (No Training)')
print('-' * 70)
print(f'   Accuracy: {zero_shot_acc:.3f} ({zero_shot_acc*100:.1f}%)')
print(f'\n   Classification Report:')
for line in classification_report(true_labels, zero_shot_preds, zero_division=0).split('\n'):
    if line.strip():
        print(f'   {line}')

# LoRA results
print(f'\n2) LoRA (Parameter-Efficient Fine-Tuning)')
print('-' * 70)
print(f'   Accuracy: {lora_acc:.3f} ({lora_acc*100:.1f}%)')
print(f'   Improvement over zero-shot: {(lora_acc - zero_shot_acc)*100:+.1f}%')
print(f'\n   Classification Report:')
for line in classification_report(true_labels, lora_preds, zero_division=0).split('\n'):
    if line.strip():
        print(f'   {line}')

# Full fine-tuned results
full_acc = accuracy_score(true_labels, full_preds)
print(f'\n3) FULL FINE-TUNING (All Parameters Updated)')
print('-' * 70)
print(f'   Accuracy: {full_acc:.3f} ({full_acc*100:.1f}%)')
print(f'   Improvement over zero-shot: {(full_acc - zero_shot_acc)*100:+.1f}%')
print(f'   Improvement over LoRA: {(full_acc - lora_acc)*100:+.1f}%')
print(f'\n   Classification Report:')
for line in classification_report(true_labels, full_preds, zero_division=0).split('\n'):
    if line.strip():
        print(f'   {line}')

# Summary comparison
print(f'\n' + '='*70)
print('SUMMARY')
print('='*70)
print(f'Zero-shot       → LoRA:          {(lora_acc - zero_shot_acc)*100:+.1f}% improvement')
print(f'LoRA            → Full:          {(full_acc - lora_acc)*100:+.1f}% improvement')
print(f'Zero-shot       → Full:          {(full_acc - zero_shot_acc)*100:+.1f}% total improvement')

print(f'\n Training Efficiency:')
print(f'   LoRA:  ~1% parameters trained, {lora_acc*100:.1f}% accuracy')
print(f'   Full: 100% parameters trained, {full_acc*100:.1f}% accuracy')
print(f'   Trade-off: LoRA achieves {(lora_acc/full_acc)*100:.1f}% of full accuracy with 1% of parameters')

### Key Takeaways

**When to use LoRA:**
- Limited computational resources (CPU or smaller GPU)
- Need to maintain multiple task-specific models
- Want faster iteration and experimentation
- Need to preserve base model capabilities (avoid catastrophic forgetting)
- Working with limited training data

**When to use Full Fine-Tuning:**
- Maximum performance is critical
- Have sufficient computational resources (GPU with 8GB+ memory)
- Task is significantly different from pre-training
- Have large amounts of training data (10,000+ examples)
- Don't need to maintain base model capabilities

**Typical Results (for this task with 20% dataset):**
- Zero-shot: ~30-40% accuracy (no training required)
- LoRA: ~85-90% accuracy (1% parameters, ~20 minutes on CPU)
- Full: ~85-90% accuracy (100% parameters, ~1 hour on CPU)

**CPU-Optimized Training Times:**
- Full Fine-Tuning: ~1 hour (20% dataset)
- LoRA: ~20 minutes (20% dataset)
- Evaluation: ~5 minutes per approach (full testset)

**LoRA achieves ~68% of full fine-tuning performance with only ~1% of trainable parameters and ~50% training time!**
