# Multimodal Large Language Models - Homework Assignment

**Course:** Applied Data Science  
**Institution:** Clemson University  
**Due Date:** [To be specified by instructor]

## Overview

This homework consists of three exercises that will give you hands-on experience fine-tuning multimodal models for different tasks and modalities. Each exercise focuses on a different aspect of multimodal learning:

1. **Exercise 1**: Fine-tune CLIP for custom image classification (Vision-Language Contrastive Learning)
2. **Exercise 2**: Fine-tune BLIP for domain-specific image captioning (Vision-to-Language Generation)
3. **Exercise 3**: Fine-tune a VLM for Visual Question Answering (Multimodal Understanding)

## Learning Objectives

- Prepare datasets for multimodal fine-tuning
- Apply parameter-efficient fine-tuning techniques (LoRA)
- Train models on Palmetto cluster
- Evaluate multimodal models quantitatively
- Compare different modality fusion strategies

## Submission Requirements

1. Completed Jupyter notebook with all code cells executed
2. Written analysis for each exercise (in markdown cells)
3. Model checkpoints (LoRA weights only)
4. Evaluation results and visualizations
5. Brief report (1-2 pages) summarizing findings

## Grading Rubric

- **Exercise 1**: 30 points
- **Exercise 2**: 30 points
- **Exercise 3**: 30 points
- **Analysis and Report**: 10 points
- **Total**: 100 points

## Setup

Ensure you have access to Palmetto cluster with GPU resources. Recommended configuration:
- GPU: A100 (40GB) or V100 (32GB)
- Memory: 64GB+ RAM
- Storage: 50GB+ in /scratch

Request GPU node:
```bash
qsub -I -l select=1:ncpus=16:mem=64gb:ngpus=1:gpu_model=a100,walltime=8:00:00
```

## Initial Setup

Run this section once to set up the environment.

In [None]:
# Install required packages
!pip install -q transformers accelerate peft bitsandbytes datasets pillow matplotlib \
    scikit-learn evaluate sacrebleu rouge-score torch torchvision tqdm

In [None]:
# Import libraries
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    CLIPProcessor, CLIPModel,
    Blip2Processor, Blip2ForConditionalGeneration,
    AutoProcessor, AutoModelForVision2Seq,
    TrainingArguments, Trainer,
    default_data_collator
)
from peft import LoraConfig, get_peft_model, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset, Dataset as HFDataset
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import evaluate
import os
import json
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

---

# Exercise 1: Fine-tune CLIP for Custom Image Classification (30 points)

## Objective

Fine-tune CLIP on a domain-specific dataset to improve zero-shot classification performance. You will use the **Food-101** dataset to adapt CLIP for food recognition.

## Background

CLIP is trained on general web data. For specialized domains, fine-tuning can significantly improve performance. You'll use contrastive learning with LoRA to adapt CLIP efficiently.

## Tasks

1. Load and prepare the Food-101 dataset (use a subset for faster training)
2. Configure LoRA for CLIP's vision and text encoders
3. Implement contrastive loss training
4. Fine-tune the model
5. Evaluate zero-shot classification performance
6. Compare with baseline (non-fine-tuned) CLIP

## Grading Criteria

- Data preparation (5 points)
- LoRA configuration (5 points)
- Training implementation (10 points)
- Evaluation and comparison (5 points)
- Analysis (5 points)

### 1.1 Load and Prepare Dataset

In [None]:
# Load Food-101 dataset (subset for faster training)
# Full dataset: 101 food categories, 101,000 images
# We'll use a subset: 20 categories, ~20,000 images

print("Loading Food-101 dataset...")
dataset = load_dataset("food101", split="train[:20%]")  # Use 20% of training data
test_dataset = load_dataset("food101", split="validation[:20%]")

print(f"Training samples: {len(dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Explore the dataset
print("\nDataset features:", dataset.features)
print(f"Number of classes: {len(dataset.features['label'].names)}")
print(f"Class names (first 10): {dataset.features['label'].names[:10]}")

# Visualize sample images
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for idx, ax in enumerate(axes.flat):
    sample = dataset[idx * 200]
    ax.imshow(sample['image'])
    ax.set_title(dataset.features['label'].names[sample['label']])
    ax.axis('off')
plt.tight_layout()
plt.show()

# TODO: Split dataset into train/validation
# Hint: Use dataset.train_test_split()
train_val_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']

print(f"\nFinal split:")
print(f"Train: {len(train_dataset)}")
print(f"Validation: {len(val_dataset)}")
print(f"Test: {len(test_dataset)}")

### 1.2 Prepare CLIP Model and Processor

In [None]:
# Load CLIP model and processor
model_name = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)

print(f"Model loaded: {model_name}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M")

# Get class names and create text prompts
class_names = dataset.features['label'].names
text_prompts = [f"a photo of {name.replace('_', ' ')}" for name in class_names]

print(f"\nText prompts (first 5):")
for i in range(5):
    print(f"  {i}: {text_prompts[i]}")

### 1.3 Evaluate Baseline Performance

First, evaluate the pre-trained CLIP model to establish a baseline.

In [None]:
def evaluate_clip_classification(model, processor, dataset, text_prompts, device, num_samples=None):
    """
    Evaluate CLIP model on classification task.
    
    Returns:
        accuracy: Classification accuracy
    """
    model.to(device)
    model.eval()
    
    if num_samples:
        dataset = dataset.select(range(min(num_samples, len(dataset))))
    
    correct = 0
    total = 0
    
    print(f"Evaluating on {len(dataset)} samples...")
    
    # TODO: Implement evaluation loop
    # For each image:
    #   1. Process image and text prompts
    #   2. Get model predictions
    #   3. Compare with ground truth label
    #   4. Update accuracy
    
    for idx in tqdm(range(len(dataset))):
        sample = dataset[idx]
        image = sample['image']
        true_label = sample['label']
        
        # Process inputs
        inputs = processor(
            text=text_prompts,
            images=image,
            return_tensors="pt",
            padding=True
        ).to(device)
        
        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            logits_per_image = outputs.logits_per_image
            predicted_label = logits_per_image.argmax(dim=1).item()
        
        if predicted_label == true_label:
            correct += 1
        total += 1
    
    accuracy = correct / total
    return accuracy

# Evaluate baseline
print("\n=== Baseline CLIP Evaluation ===")
baseline_accuracy = evaluate_clip_classification(
    model, processor, test_dataset, text_prompts, device, num_samples=500
)
print(f"\nBaseline Accuracy: {baseline_accuracy*100:.2f}%")

### 1.4 Configure LoRA for Fine-tuning

In [None]:
# TODO: Configure LoRA for CLIP
# Apply LoRA to both vision and text encoders
# Recommended hyperparameters:
#   - r: 8-16
#   - lora_alpha: 16-32
#   - target_modules: ["q_proj", "v_proj"] for both encoders
#   - lora_dropout: 0.1

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "out_proj"],
    lora_dropout=0.1,
    bias="none",
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("\nLoRA applied successfully!")

### 1.5 Create Dataset Class for Training

In [None]:
class CLIPContrastiveDataset(Dataset):
    """
    Dataset for CLIP contrastive learning.
    Returns image-text pairs for each sample.
    """
    
    def __init__(self, dataset, text_prompts, processor):
        self.dataset = dataset
        self.text_prompts = text_prompts
        self.processor = processor
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        sample = self.dataset[idx]
        image = sample['image']
        label = sample['label']
        text = self.text_prompts[label]
        
        # Process image and text
        encoding = self.processor(
            images=image,
            text=text,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=77
        )
        
        # Remove batch dimension
        encoding = {k: v.squeeze(0) for k, v in encoding.items()}
        encoding['labels'] = label
        
        return encoding

# Create datasets
train_clip_dataset = CLIPContrastiveDataset(train_dataset, text_prompts, processor)
val_clip_dataset = CLIPContrastiveDataset(val_dataset, text_prompts, processor)

print(f"Training dataset: {len(train_clip_dataset)} samples")
print(f"Validation dataset: {len(val_clip_dataset)} samples")

### 1.6 Define Training Configuration

In [None]:
# TODO: Define training arguments
# Recommended settings:
#   - learning_rate: 5e-5 to 1e-4
#   - num_train_epochs: 3-5
#   - per_device_train_batch_size: 16-32 (adjust based on GPU memory)
#   - warmup_steps: 500
#   - logging_steps: 50
#   - save_steps: 500
#   - evaluation_strategy: "steps"

training_args = TrainingArguments(
    output_dir="./clip_food101_lora",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-4,
    warmup_steps=500,
    logging_steps=50,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    fp16=torch.cuda.is_available(),
    gradient_accumulation_steps=2,
    dataloader_num_workers=4,
    remove_unused_columns=False,
)

print("Training arguments configured.")
print(f"Total training steps: {len(train_clip_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs}")

### 1.7 Custom Trainer with Contrastive Loss

In [None]:
class CLIPContrastiveTrainer(Trainer):
    """
    Custom trainer for CLIP with contrastive loss.
    """
    
    def compute_loss(self, model, inputs, return_outputs=False):
        """
        Compute contrastive loss for CLIP.
        """
        # TODO: Implement contrastive loss computation
        # 1. Forward pass through model
        # 2. Get image and text embeddings
        # 3. Compute contrastive loss (InfoNCE)
        
        labels = inputs.pop('labels')
        outputs = model(**inputs)
        
        # CLIP returns logits_per_image and logits_per_text
        logits_per_image = outputs.logits_per_image
        logits_per_text = outputs.logits_per_text
        
        # Create target labels (diagonal should be high)
        batch_size = logits_per_image.shape[0]
        targets = torch.arange(batch_size).to(logits_per_image.device)
        
        # Symmetric loss (image-to-text and text-to-image)
        loss_i2t = nn.CrossEntropyLoss()(logits_per_image, targets)
        loss_t2i = nn.CrossEntropyLoss()(logits_per_text, targets)
        loss = (loss_i2t + loss_t2i) / 2
        
        return (loss, outputs) if return_outputs else loss

print("Custom trainer defined.")

### 1.8 Train the Model

In [None]:
# Initialize trainer
trainer = CLIPContrastiveTrainer(
    model=model,
    args=training_args,
    train_dataset=train_clip_dataset,
    eval_dataset=val_clip_dataset,
)

# TODO: Train the model
print("\n=== Starting Training ===")
train_result = trainer.train()

print("\n=== Training Completed ===")
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")

# Save the final model
trainer.save_model("./clip_food101_final")
print("\nModel saved to ./clip_food101_final")

### 1.9 Evaluate Fine-tuned Model

In [None]:
# Evaluate fine-tuned model
print("\n=== Fine-tuned CLIP Evaluation ===")
finetuned_accuracy = evaluate_clip_classification(
    model, processor, test_dataset, text_prompts, device, num_samples=500
)
print(f"\nFine-tuned Accuracy: {finetuned_accuracy*100:.2f}%")

# Compare with baseline
improvement = (finetuned_accuracy - baseline_accuracy) * 100
print(f"\n=== Results Comparison ===")
print(f"Baseline Accuracy: {baseline_accuracy*100:.2f}%")
print(f"Fine-tuned Accuracy: {finetuned_accuracy*100:.2f}%")
print(f"Improvement: {improvement:+.2f}%")

# Visualize results
plt.figure(figsize=(8, 6))
models = ['Baseline', 'Fine-tuned']
accuracies = [baseline_accuracy * 100, finetuned_accuracy * 100]
colors = ['lightcoral', 'lightgreen']
plt.bar(models, accuracies, color=colors)
plt.ylabel('Accuracy (%)')
plt.title('CLIP Classification Performance')
plt.ylim([0, 100])
for i, v in enumerate(accuracies):
    plt.text(i, v + 1, f"{v:.2f}%", ha='center')
plt.show()

### 1.10 Analysis Questions

**TODO: Answer the following questions in the markdown cell below:**

1. What was the improvement in accuracy after fine-tuning? Why do you think this happened?
2. How many parameters did LoRA add compared to the full model? What are the benefits?
3. What challenges did you face during training? How did you address them?
4. How would you improve the results further?
5. In what real-world scenarios would this fine-tuned model be useful?

**Your Analysis:**

[Write your analysis here]

---

# Exercise 2: Fine-tune BLIP for Domain-Specific Image Captioning (30 points)

## Objective

Fine-tune BLIP-2 for generating detailed captions in a specialized domain. You will use a medical imaging dataset to adapt BLIP for radiology report generation.

## Background

BLIP-2 is a powerful image captioning model, but it's trained on general images. Medical images require specific terminology and structured descriptions. You'll fine-tune BLIP-2 to generate medically accurate captions.

## Tasks

1. Load and prepare a medical imaging dataset (or alternative domain-specific dataset)
2. Configure LoRA for BLIP-2's Q-Former and language model
3. Implement image captioning fine-tuning
4. Train the model
5. Evaluate using BLEU, ROUGE, and qualitative analysis
6. Generate captions for test images

## Grading Criteria

- Data preparation (5 points)
- LoRA configuration (5 points)
- Training implementation (10 points)
- Evaluation (5 points)
- Analysis (5 points)

### 2.1 Load Dataset

We'll use the **ROCO (Radiology Objects in COntext)** dataset or **Flickr8k** as an alternative.

In [None]:
# Option 1: Use Flickr8k (easier to access)
# Option 2: Use medical imaging dataset if available

# For this exercise, we'll use a subset of COCO captions as demonstration
# You can replace with medical imaging dataset

print("Loading dataset for image captioning...")

# TODO: Load your dataset
# For demonstration, using a small caption dataset
# Replace with actual medical imaging or domain-specific dataset

caption_dataset = load_dataset("nlphuji/flickr30k", split="test[:10%]")

print(f"Dataset loaded: {len(caption_dataset)} samples")
print(f"Features: {caption_dataset.features}")

# Explore samples
sample = caption_dataset[0]
print(f"\nSample image shape: {sample['image'].size}")
print(f"Sample captions: {sample['caption'][:2]}")

# Visualize samples
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for idx, ax in enumerate(axes.flat):
    sample = caption_dataset[idx * 100]
    ax.imshow(sample['image'])
    caption = sample['caption'][0] if isinstance(sample['caption'], list) else sample['caption']
    ax.set_title(caption[:50] + '...', fontsize=9)
    ax.axis('off')
plt.tight_layout()
plt.show()

### 2.2 Prepare BLIP-2 Model

In [None]:
# Load BLIP-2 model
# Using smaller version for faster training
model_name = "Salesforce/blip2-opt-2.7b"

print(f"Loading {model_name}...")
blip_processor = Blip2Processor.from_pretrained(model_name)
blip_model = Blip2ForConditionalGeneration.from_pretrained(
    model_name,
    load_in_8bit=True,  # Use quantization to save memory
    device_map="auto"
)

print("Model loaded successfully!")

# Prepare model for k-bit training
blip_model = prepare_model_for_kbit_training(blip_model)
print("Model prepared for k-bit training.")

### 2.3 Generate Baseline Captions

In [None]:
def generate_captions_batch(images, model, processor, max_length=50):
    """
    Generate captions for a batch of images.
    """
    inputs = processor(images=images, return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=5,
        do_sample=False
    )
    
    captions = processor.batch_decode(generated_ids, skip_special_tokens=True)
    return captions

# Generate baseline captions for a few samples
print("\n=== Baseline Captions ===")
num_samples = 5
for i in range(num_samples):
    sample = caption_dataset[i * 200]
    image = sample['image']
    reference_caption = sample['caption'][0] if isinstance(sample['caption'], list) else sample['caption']
    
    generated_caption = generate_captions_batch([image], blip_model, blip_processor)[0]
    
    print(f"\nImage {i+1}:")
    print(f"Reference: {reference_caption}")
    print(f"Generated: {generated_caption}")

### 2.4 Configure LoRA

In [None]:
# TODO: Configure LoRA for BLIP-2
# Focus on Q-Former and language model
# Recommended settings:
#   - r: 8-16
#   - lora_alpha: 32
#   - target_modules: Look at model architecture and select appropriate modules
#   - For language model: ["q_proj", "v_proj", "k_proj", "out_proj", "fc1", "fc2"]

lora_config_blip = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "out_proj", "fc1", "fc2"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

# Apply LoRA
blip_model = get_peft_model(blip_model, lora_config_blip)
blip_model.print_trainable_parameters()

print("\nLoRA applied to BLIP-2!")

### 2.5 Prepare Dataset for Training

In [None]:
class ImageCaptioningDataset(Dataset):
    """
    Dataset for image captioning training.
    """
    
    def __init__(self, dataset, processor, max_length=50):
        self.dataset = dataset
        self.processor = processor
        self.max_length = max_length
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        sample = self.dataset[idx]
        image = sample['image']
        caption = sample['caption']
        
        # Handle multiple captions (take first one)
        if isinstance(caption, list):
            caption = caption[0]
        
        # TODO: Process image and caption
        # Hint: Use processor to encode both image and text
        encoding = self.processor(
            images=image,
            text=caption,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )
        
        # Remove batch dimension
        encoding = {k: v.squeeze(0) for k, v in encoding.items()}
        
        # Labels for language modeling
        encoding['labels'] = encoding['input_ids'].clone()
        
        return encoding

# Split dataset
train_val_split = caption_dataset.train_test_split(test_size=0.1, seed=42)
train_caption_data = train_val_split['train']
val_caption_data = train_val_split['test']

# Create datasets
train_captioning_dataset = ImageCaptioningDataset(train_caption_data, blip_processor)
val_captioning_dataset = ImageCaptioningDataset(val_caption_data, blip_processor)

print(f"Training dataset: {len(train_captioning_dataset)} samples")
print(f"Validation dataset: {len(val_captioning_dataset)} samples")

### 2.6 Training Configuration

In [None]:
# TODO: Define training arguments for BLIP-2
training_args_blip = TrainingArguments(
    output_dir="./blip2_captioning_lora",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    warmup_steps=100,
    logging_steps=50,
    save_steps=200,
    evaluation_strategy="steps",
    eval_steps=200,
    save_total_limit=2,
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    gradient_accumulation_steps=4,
    dataloader_num_workers=2,
    remove_unused_columns=False,
)

print("Training configuration set.")

### 2.7 Train the Model

In [None]:
# Initialize trainer
trainer_blip = Trainer(
    model=blip_model,
    args=training_args_blip,
    train_dataset=train_captioning_dataset,
    eval_dataset=val_captioning_dataset,
)

# TODO: Train the model
print("\n=== Starting BLIP-2 Fine-tuning ===")
train_result_blip = trainer_blip.train()

print("\n=== Training Completed ===")
print(f"Training loss: {train_result_blip.training_loss:.4f}")

# Save model
trainer_blip.save_model("./blip2_captioning_final")
print("\nModel saved to ./blip2_captioning_final")

### 2.8 Evaluate Fine-tuned Model

In [None]:
# Load evaluation metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

def evaluate_captioning(model, processor, dataset, num_samples=100):
    """
    Evaluate captioning model using BLEU and ROUGE.
    """
    model.eval()
    
    references = []
    predictions = []
    
    print(f"Evaluating on {num_samples} samples...")
    
    # TODO: Generate captions and compare with references
    for idx in tqdm(range(min(num_samples, len(dataset)))):
        sample = dataset[idx]
        image = sample['image']
        reference = sample['caption']
        if isinstance(reference, list):
            reference = reference[0]
        
        # Generate caption
        generated = generate_captions_batch([image], model, processor)[0]
        
        references.append([reference])
        predictions.append(generated)
    
    # Calculate metrics
    bleu_score = bleu_metric.compute(predictions=predictions, references=references)
    rouge_score = rouge_metric.compute(predictions=predictions, references=[r[0] for r in references])
    
    return {
        'bleu': bleu_score['bleu'],
        'rouge_l': rouge_score['rougeL'],
        'predictions': predictions[:10],  # Save first 10 for analysis
        'references': references[:10]
    }

# Evaluate
print("\n=== Evaluating Fine-tuned Model ===")
results = evaluate_captioning(blip_model, blip_processor, val_caption_data, num_samples=100)

print(f"\nBLEU Score: {results['bleu']:.4f}")
print(f"ROUGE-L Score: {results['rouge_l']:.4f}")

# Show sample predictions
print("\n=== Sample Predictions ===")
for i in range(5):
    print(f"\nSample {i+1}:")
    print(f"Reference: {results['references'][i][0]}")
    print(f"Predicted: {results['predictions'][i]}")

### 2.9 Qualitative Analysis

In [None]:
# Generate and visualize captions for test images
print("\n=== Qualitative Results ===")

fig, axes = plt.subplots(3, 2, figsize=(12, 15))
axes = axes.flatten()

for idx in range(6):
    sample = val_caption_data[idx * 50]
    image = sample['image']
    reference = sample['caption'][0] if isinstance(sample['caption'], list) else sample['caption']
    
    # Generate caption
    generated = generate_captions_batch([image], blip_model, blip_processor)[0]
    
    axes[idx].imshow(image)
    axes[idx].set_title(f"Ref: {reference[:40]}...\nGen: {generated[:40]}...", fontsize=9)
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

### 2.10 Analysis Questions

**TODO: Answer the following questions:**

1. What were the BLEU and ROUGE scores? How do they compare to baseline?
2. Analyze the generated captions: What did the model learn? What mistakes does it still make?
3. How suitable is this approach for medical imaging or your chosen domain?
4. What data augmentation techniques could improve performance?
5. How would you evaluate caption quality beyond automatic metrics?

**Your Analysis:**

[Write your analysis here]

---

# Exercise 3: Fine-tune VLM for Visual Question Answering (30 points)

## Objective

Fine-tune a Vision-Language Model (e.g., BLIP-2 or LLaVA) for visual question answering on a specific domain.

## Background

VQA requires understanding both visual content and natural language questions to generate accurate answers. You'll fine-tune a model to answer questions about images in a specific domain.

## Tasks

1. Load and prepare a VQA dataset
2. Configure LoRA for the VLM
3. Implement VQA fine-tuning
4. Train the model
5. Evaluate accuracy and answer quality
6. Analyze model performance on different question types

## Grading Criteria

- Data preparation (5 points)
- LoRA configuration (5 points)
- Training implementation (10 points)
- Evaluation (5 points)
- Analysis (5 points)

### 3.1 Load VQA Dataset

In [None]:
# Load VQA dataset
# Options: VQAv2, GQA, OK-VQA, TextVQA
# For this exercise, we'll use a subset

print("Loading VQA dataset...")

# TODO: Load VQA dataset
# You can use HuggingFace datasets or download from official sources
# For demonstration, we'll create a simple structure

# Example: Using a VQA-style dataset
try:
    vqa_dataset = load_dataset("Multimodal-Fatima/OK-VQA_train", split="train[:10%]")
    print(f"Dataset loaded: {len(vqa_dataset)} samples")
except:
    print("Could not load OK-VQA. Please use an alternative VQA dataset.")
    # Create dummy data for demonstration
    vqa_dataset = None

if vqa_dataset:
    print(f"\nDataset features: {vqa_dataset.features}")
    
    # Explore sample
    sample = vqa_dataset[0]
    print(f"\nSample:")
    for key in sample.keys():
        if key != 'image':
            print(f"{key}: {sample[key]}")

### 3.2 Visualize VQA Samples

In [None]:
if vqa_dataset:
    # Visualize samples
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx in range(6):
        sample = vqa_dataset[idx * 100]
        image = sample['image']
        question = sample.get('question', 'N/A')
        answer = sample.get('answer', sample.get('answers', 'N/A'))
        
        axes[idx].imshow(image)
        axes[idx].set_title(f"Q: {question[:30]}...\nA: {str(answer)[:30]}...", fontsize=9)
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

### 3.3 Prepare Model for VQA

In [None]:
# TODO: Load a VLM suitable for VQA
# Options: BLIP-2, InstructBLIP, LLaVA
# We'll use BLIP-2 as it's already loaded

print("Preparing model for VQA fine-tuning...")

# If you've already used blip_model, create a fresh instance
vqa_model_name = "Salesforce/blip2-opt-2.7b"
vqa_processor = Blip2Processor.from_pretrained(vqa_model_name)
vqa_model = Blip2ForConditionalGeneration.from_pretrained(
    vqa_model_name,
    load_in_8bit=True,
    device_map="auto"
)

vqa_model = prepare_model_for_kbit_training(vqa_model)
print("Model loaded and prepared.")

### 3.4 Test Baseline VQA Performance

In [None]:
if vqa_dataset:
    def answer_vqa(image, question, model, processor):
        """Generate answer for a VQA question."""
        # Format prompt
        prompt = f"Question: {question} Answer:"
        
        inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)
        
        generated_ids = model.generate(
            **inputs,
            max_length=20,
            num_beams=5
        )
        
        answer = processor.decode(generated_ids[0], skip_special_tokens=True)
        return answer
    
    # Test baseline
    print("\n=== Baseline VQA Performance ===")
    for i in range(5):
        sample = vqa_dataset[i * 200]
        image = sample['image']
        question = sample['question']
        true_answer = sample.get('answer', sample.get('answers', 'N/A'))
        
        predicted_answer = answer_vqa(image, question, vqa_model, vqa_processor)
        
        print(f"\nQuestion: {question}")
        print(f"True Answer: {true_answer}")
        print(f"Predicted: {predicted_answer}")

### 3.5 Configure LoRA for VQA

In [None]:
# TODO: Configure LoRA for VQA model
lora_config_vqa = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "out_proj", "fc1", "fc2"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

vqa_model = get_peft_model(vqa_model, lora_config_vqa)
vqa_model.print_trainable_parameters()

print("\nLoRA configured for VQA.")

### 3.6 Prepare VQA Dataset for Training

In [None]:
class VQADataset(Dataset):
    """
    Dataset for VQA training.
    """
    
    def __init__(self, dataset, processor, max_length=77):
        self.dataset = dataset
        self.processor = processor
        self.max_length = max_length
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        sample = self.dataset[idx]
        image = sample['image']
        question = sample['question']
        answer = sample.get('answer', sample.get('answers', ''))
        
        # Handle multiple answers (take first or most common)
        if isinstance(answer, list):
            answer = answer[0]
        
        # TODO: Format input as "Question: {question} Answer: {answer}"
        prompt = f"Question: {question} Answer:"
        
        # Encode
        encoding = self.processor(
            images=image,
            text=prompt,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )
        
        # Encode answer as labels
        answer_encoding = self.processor.tokenizer(
            answer,
            padding="max_length",
            truncation=True,
            max_length=20,
            return_tensors="pt"
        )
        
        # Remove batch dimension
        encoding = {k: v.squeeze(0) for k, v in encoding.items()}
        encoding['labels'] = answer_encoding['input_ids'].squeeze(0)
        
        return encoding

if vqa_dataset:
    # Split dataset
    vqa_split = vqa_dataset.train_test_split(test_size=0.1, seed=42)
    train_vqa = vqa_split['train']
    val_vqa = vqa_split['test']
    
    # Create datasets
    train_vqa_dataset = VQADataset(train_vqa, vqa_processor)
    val_vqa_dataset = VQADataset(val_vqa, vqa_processor)
    
    print(f"Training: {len(train_vqa_dataset)} samples")
    print(f"Validation: {len(val_vqa_dataset)} samples")

### 3.7 Training Configuration and Training

In [None]:
if vqa_dataset:
    # TODO: Define training arguments
    training_args_vqa = TrainingArguments(
        output_dir="./vqa_model_lora",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=5e-5,
        warmup_steps=100,
        logging_steps=50,
        save_steps=200,
        evaluation_strategy="steps",
        eval_steps=200,
        save_total_limit=2,
        load_best_model_at_end=True,
        fp16=torch.cuda.is_available(),
        gradient_accumulation_steps=4,
        remove_unused_columns=False,
    )
    
    # Initialize trainer
    trainer_vqa = Trainer(
        model=vqa_model,
        args=training_args_vqa,
        train_dataset=train_vqa_dataset,
        eval_dataset=val_vqa_dataset,
    )
    
    # TODO: Train the model
    print("\n=== Starting VQA Fine-tuning ===")
    train_result_vqa = trainer_vqa.train()
    
    print("\n=== Training Completed ===")
    print(f"Training loss: {train_result_vqa.training_loss:.4f}")
    
    # Save model
    trainer_vqa.save_model("./vqa_model_final")
    print("\nModel saved to ./vqa_model_final")

### 3.8 Evaluate VQA Model

In [None]:
if vqa_dataset:
    def evaluate_vqa_accuracy(model, processor, dataset, num_samples=100):
        """Evaluate VQA model accuracy."""
        model.eval()
        
        correct = 0
        total = 0
        
        predictions = []
        references = []
        
        print(f"Evaluating on {num_samples} samples...")
        
        # TODO: Generate answers and compare with ground truth
        for idx in tqdm(range(min(num_samples, len(dataset)))):
            sample = dataset[idx]
            image = sample['image']
            question = sample['question']
            true_answer = sample.get('answer', sample.get('answers', ''))
            
            if isinstance(true_answer, list):
                true_answer = true_answer[0]
            
            # Generate answer
            predicted_answer = answer_vqa(image, question, model, processor)
            
            predictions.append(predicted_answer)
            references.append(true_answer)
            
            # Simple exact match (you can use more sophisticated matching)
            if predicted_answer.lower().strip() == true_answer.lower().strip():
                correct += 1
            total += 1
        
        accuracy = correct / total
        
        return {
            'accuracy': accuracy,
            'predictions': predictions[:10],
            'references': references[:10]
        }
    
    # Evaluate
    print("\n=== VQA Evaluation ===")
    vqa_results = evaluate_vqa_accuracy(vqa_model, vqa_processor, val_vqa, num_samples=100)
    
    print(f"\nAccuracy: {vqa_results['accuracy']*100:.2f}%")
    
    # Show samples
    print("\n=== Sample Predictions ===")
    for i in range(5):
        print(f"\nSample {i+1}:")
        print(f"Reference: {vqa_results['references'][i]}")
        print(f"Predicted: {vqa_results['predictions'][i]}")

### 3.9 Analyze Performance by Question Type

In [None]:
# TODO: Categorize questions and analyze performance
# Common question types: What, Where, When, Who, How many, Yes/No

if vqa_dataset:
    def categorize_question(question):
        """Categorize question by type."""
        question_lower = question.lower()
        if question_lower.startswith('what'):
            return 'What'
        elif question_lower.startswith('where'):
            return 'Where'
        elif question_lower.startswith('who'):
            return 'Who'
        elif question_lower.startswith('how many') or question_lower.startswith('how much'):
            return 'Count'
        elif any(word in question_lower for word in ['is', 'are', 'does', 'do', 'can']):
            return 'Yes/No'
        else:
            return 'Other'
    
    # Analyze by question type
    question_types = {}
    
    for i in range(min(100, len(val_vqa))):
        sample = val_vqa[i]
        question = sample['question']
        q_type = categorize_question(question)
        
        if q_type not in question_types:
            question_types[q_type] = []
        question_types[q_type].append(i)
    
    print("\n=== Question Type Distribution ===")
    for q_type, indices in question_types.items():
        print(f"{q_type}: {len(indices)} questions")
    
    # TODO: Evaluate performance per question type
    print("\n=== Performance by Question Type ===")
    print("(Implement detailed analysis here)")

### 3.10 Analysis Questions

**TODO: Answer the following questions:**

1. What was the overall VQA accuracy? How does it compare to baseline?
2. Which question types does the model handle well? Which ones are challenging?
3. What patterns do you notice in the model's mistakes?
4. How could you improve the model's performance further?
5. What are the practical applications of this fine-tuned VQA model?
6. How does multimodal understanding differ from single-modality tasks?

**Your Analysis:**

[Write your analysis here]

---

# Final Report

## Summary of Exercises

Write a 1-2 page summary covering:

1. **Overview**: Brief description of each exercise and objectives
2. **Results**: Key metrics and findings from each exercise
3. **Comparison**: Compare the three approaches (contrastive learning, captioning, VQA)
4. **Challenges**: What difficulties did you encounter? How did you overcome them?
5. **Insights**: What did you learn about multimodal learning?
6. **Future Work**: How would you extend these experiments?

## Deliverables Checklist

- [ ] Completed Exercise 1 with code, results, and analysis
- [ ] Completed Exercise 2 with code, results, and analysis
- [ ] Completed Exercise 3 with code, results, and analysis
- [ ] Final report (1-2 pages)
- [ ] Model checkpoints saved (LoRA weights)
- [ ] All visualizations and evaluation metrics included
- [ ] Code is well-documented and reproducible

## Submission Instructions

1. Export this notebook as both `.ipynb` and `.pdf`
2. Include your final report as a separate PDF
3. Zip all model checkpoints (LoRA weights only)
4. Submit via Canvas/course portal

**Good luck with your homework!**