# Multimodal Large Language Models - Lab Session

**Course:** Applied Data Science  
**Institution:** Clemson University  
**Duration:** 75 minutes

## Learning Objectives

In this lab, you will:
1. Set up the environment for multimodal LLMs
2. Load and use pre-trained multimodal models (CLIP, BLIP-2, LLaVA)
3. Perform zero-shot image classification with CLIP
4. Generate image captions with BLIP-2
5. Perform visual question answering
6. Fine-tune a multimodal model using LoRA
7. Deploy and evaluate models

## Prerequisites

- Understanding of Transformer architecture
- Familiarity with PyTorch
- Access to Palmetto cluster (or GPU with 16GB+ VRAM)

## Part 1: Environment Setup (5 minutes)

First, let's set up our environment with all necessary libraries.

In [None]:
# Install required packages
# Run this cell only once
!pip install -q transformers accelerate peft bitsandbytes pillow matplotlib requests torch torchvision

# For Palmetto cluster, ensure you're using a GPU node
# Request GPU node: qsub -I -l select=1:ncpus=16:mem=64gb:ngpus=1:gpu_model=a100,walltime=4:00:00

In [None]:
# Import libraries
import torch
import requests
from PIL import Image
import matplotlib.pyplot as plt
from io import BytesIO
import numpy as np
from transformers import (
    CLIPProcessor, CLIPModel,
    Blip2Processor, Blip2ForConditionalGeneration,
    AutoProcessor, AutoModelForVision2Seq,
    TrainingArguments, Trainer
)
from peft import LoraConfig, get_peft_model
import warnings
warnings.filterwarnings('ignore')

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Part 2: Helper Functions (5 minutes)

Let's create some utility functions for loading and displaying images.

In [None]:
def load_image_from_url(url):
    """Load an image from a URL."""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content)).convert('RGB')
    return img

def display_image(image, title="Image"):
    """Display an image with matplotlib."""
    plt.figure(figsize=(8, 6))
    plt.imshow(image)
    plt.title(title)
    plt.axis('off')
    plt.show()

def display_images_grid(images, titles=None, cols=3):
    """Display multiple images in a grid."""
    n = len(images)
    rows = (n + cols - 1) // cols
    fig, axes = plt.subplots(rows, cols, figsize=(4*cols, 4*rows))
    axes = axes.flatten() if n > 1 else [axes]
    
    for idx, img in enumerate(images):
        axes[idx].imshow(img)
        if titles:
            axes[idx].set_title(titles[idx])
        axes[idx].axis('off')
    
    # Hide extra subplots
    for idx in range(n, len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

# Test with sample images
sample_urls = [
    "https://images.unsplash.com/photo-1574158622682-e40e69881006",  # Cat
    "https://images.unsplash.com/photo-1552053831-71594a27632d",  # Dog
    "https://images.unsplash.com/photo-1542838132-92c53300491e"   # Beach
]

print("Loading sample images...")
sample_images = [load_image_from_url(url) for url in sample_urls]
display_images_grid(sample_images, titles=["Cat", "Dog", "Beach"])

## Part 3: CLIP - Zero-Shot Image Classification (15 minutes)

CLIP (Contrastive Language-Image Pre-training) learns a joint embedding space for images and text.

### 3.1 Load CLIP Model

In [None]:
print("Loading CLIP model...")
clip_model_name = "openai/clip-vit-base-patch32"
clip_processor = CLIPProcessor.from_pretrained(clip_model_name)
clip_model = CLIPModel.from_pretrained(clip_model_name).to(device)
print(f"CLIP model loaded on {device}")

# Model architecture overview
print(f"\nCLIP Model Architecture:")
print(f"Vision encoder: {clip_model.vision_model.__class__.__name__}")
print(f"Text encoder: {clip_model.text_model.__class__.__name__}")
print(f"Total parameters: {sum(p.numel() for p in clip_model.parameters()) / 1e6:.2f}M")

### 3.2 Zero-Shot Image Classification

CLIP can classify images into any text categories without task-specific training!

In [None]:
def classify_image_clip(image, candidate_labels, model, processor, device):
    """
    Perform zero-shot image classification using CLIP.
    
    Args:
        image: PIL Image
        candidate_labels: List of text labels
        model: CLIP model
        processor: CLIP processor
        device: Device to run on
    
    Returns:
        Dictionary with labels and probabilities
    """
    # Process inputs
    inputs = processor(
        text=candidate_labels,
        images=image,
        return_tensors="pt",
        padding=True
    ).to(device)
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Calculate probabilities
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1).cpu().numpy()[0]
    
    # Create results dictionary
    results = {label: float(prob) for label, prob in zip(candidate_labels, probs)}
    results = dict(sorted(results.items(), key=lambda x: x[1], reverse=True))
    
    return results

# Test classification
test_image = sample_images[0]  # Cat image
labels = ["a cat", "a dog", "a bird", "a car", "a beach"]

print("\n=== Zero-Shot Classification ===")
display_image(test_image, "Test Image")
results = classify_image_clip(test_image, labels, clip_model, clip_processor, device)

print("\nPredictions:")
for label, prob in results.items():
    print(f"{label}: {prob*100:.2f}%")

# Visualize results
plt.figure(figsize=(10, 4))
plt.bar(results.keys(), results.values())
plt.xlabel('Labels')
plt.ylabel('Probability')
plt.title('CLIP Zero-Shot Classification Results')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 3.3 Image-Text Similarity

CLIP can compute similarity between any image-text pairs.

In [None]:
def compute_image_text_similarity(image, texts, model, processor, device):
    """Compute similarity scores between image and multiple texts."""
    inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        
    # Get similarity scores
    image_embeds = outputs.image_embeds
    text_embeds = outputs.text_embeds
    
    # Normalize embeddings
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
    text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
    
    # Compute cosine similarity
    similarity = (image_embeds @ text_embeds.T).cpu().numpy()[0]
    
    return {text: float(sim) for text, sim in zip(texts, similarity)}

# Test with descriptive sentences
descriptions = [
    "a photo of a cute cat sitting",
    "a professional photograph of an animal",
    "a car on a highway",
    "a scenic beach view",
    "a domestic pet indoors"
]

similarities = compute_image_text_similarity(
    sample_images[0], descriptions, clip_model, clip_processor, device
)

print("\n=== Image-Text Similarity Scores ===")
for desc, sim in sorted(similarities.items(), key=lambda x: x[1], reverse=True):
    print(f"{sim:.4f} - {desc}")

### 3.4 Image Retrieval

Use CLIP to find the best matching image for a text query.

In [None]:
def retrieve_images(query, images, model, processor, device):
    """Retrieve images that best match the text query."""
    # Process all images at once
    inputs = processor(text=[query], images=images, return_tensors="pt", padding=True).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get similarity scores
    logits_per_text = outputs.logits_per_text
    probs = logits_per_text.softmax(dim=1).cpu().numpy()[0]
    
    # Sort by similarity
    sorted_indices = np.argsort(probs)[::-1]
    
    return sorted_indices, probs[sorted_indices]

# Test image retrieval
query = "a fluffy pet animal"
indices, scores = retrieve_images(query, sample_images, clip_model, clip_processor, device)

print(f"\n=== Image Retrieval for query: '{query}' ===")
retrieved_images = [sample_images[i] for i in indices]
titles = [f"Rank {i+1}: {scores[i]:.4f}" for i in range(len(indices))]
display_images_grid(retrieved_images, titles=titles)

## Part 4: BLIP-2 - Image Captioning (15 minutes)

BLIP-2 uses a Q-Former to bridge frozen vision and language models for efficient multimodal generation.

### 4.1 Load BLIP-2 Model

In [None]:
print("Loading BLIP-2 model...")
blip2_model_name = "Salesforce/blip2-opt-2.7b"

# Note: BLIP-2 is larger, may need 8-bit quantization on smaller GPUs
blip2_processor = Blip2Processor.from_pretrained(blip2_model_name)
blip2_model = Blip2ForConditionalGeneration.from_pretrained(
    blip2_model_name,
    load_in_8bit=True,  # Use 8-bit quantization to save memory
    device_map="auto"
)
print(f"BLIP-2 model loaded with 8-bit quantization")

print(f"\nBLIP-2 Architecture:")
print(f"Vision encoder: Frozen pre-trained ViT")
print(f"Q-Former: Learnable queries + cross-attention")
print(f"Language model: OPT-2.7B (frozen)")

### 4.2 Generate Image Captions

In [None]:
def generate_caption(image, model, processor, max_length=50, num_beams=5):
    """
    Generate a caption for an image using BLIP-2.
    
    Args:
        image: PIL Image
        model: BLIP-2 model
        processor: BLIP-2 processor
        max_length: Maximum caption length
        num_beams: Number of beams for beam search
    
    Returns:
        Generated caption string
    """
    inputs = processor(images=image, return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=num_beams,
        do_sample=False
    )
    
    caption = processor.decode(generated_ids[0], skip_special_tokens=True)
    return caption

# Generate captions for all sample images
print("\n=== Image Captioning with BLIP-2 ===")
captions = []
for idx, img in enumerate(sample_images):
    print(f"\nGenerating caption for image {idx+1}...")
    caption = generate_caption(img, blip2_model, blip2_processor)
    captions.append(caption)
    print(f"Caption: {caption}")

# Display images with captions
display_images_grid(sample_images, titles=captions)

### 4.3 Visual Question Answering with BLIP-2

In [None]:
def answer_question(image, question, model, processor, max_length=50):
    """
    Answer a question about an image using BLIP-2.
    
    Args:
        image: PIL Image
        question: Question string
        model: BLIP-2 model
        processor: BLIP-2 processor
        max_length: Maximum answer length
    
    Returns:
        Answer string
    """
    # Format prompt for VQA
    prompt = f"Question: {question} Answer:"
    
    inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=5,
        do_sample=False
    )
    
    answer = processor.decode(generated_ids[0], skip_special_tokens=True)
    return answer

# Test VQA
print("\n=== Visual Question Answering ===")
test_image = sample_images[0]  # Cat image
display_image(test_image, "Test Image for VQA")

questions = [
    "What animal is in the image?",
    "What color is the animal?",
    "Is the animal indoors or outdoors?",
    "What is the animal doing?"
]

print("\nQuestions and Answers:")
for question in questions:
    answer = answer_question(test_image, question, blip2_model, blip2_processor)
    print(f"Q: {question}")
    print(f"A: {answer}\n")

## Part 5: Fine-tuning with LoRA (20 minutes)

Now let's learn how to fine-tune a multimodal model efficiently using LoRA (Low-Rank Adaptation).

### 5.1 Prepare a Custom Dataset

For demonstration, we'll create a small synthetic dataset. In practice, you'd use real data.

In [None]:
from torch.utils.data import Dataset, DataLoader

class ImageCaptionDataset(Dataset):
    """Simple dataset for image captioning."""
    
    def __init__(self, images, captions, processor):
        self.images = images
        self.captions = captions
        self.processor = processor
    
    def __len__(self):
        return len(self.images)
    
    def __getitem__(self, idx):
        image = self.images[idx]
        caption = self.captions[idx]
        
        # Process image and text
        encoding = self.processor(
            images=image,
            text=caption,
            padding="max_length",
            truncation=True,
            max_length=50,
            return_tensors="pt"
        )
        
        # Remove batch dimension
        encoding = {k: v.squeeze(0) for k, v in encoding.items()}
        
        # Add labels (for language modeling)
        encoding["labels"] = encoding["input_ids"].clone()
        
        return encoding

# Create a small synthetic dataset
# In practice, load your actual dataset here
train_images = sample_images * 2  # Duplicate for demo
train_captions = [
    "a close-up photo of a cat",
    "a golden retriever dog",
    "a beautiful beach with waves",
    "a cute feline sitting",
    "a friendly dog outdoors",
    "ocean waves at sunset"
]

print(f"Dataset size: {len(train_images)} images")

### 5.2 Set up LoRA Configuration

LoRA reduces the number of trainable parameters by learning low-rank updates to weight matrices.

In [None]:
# Create a fresh model for fine-tuning
print("Setting up model for fine-tuning...")

# For this demo, we'll use a smaller CLIP model
# In practice, you might fine-tune BLIP-2 or LLaVA
from transformers import CLIPVisionModel

# Load a vision model
vision_model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # Rank of the low-rank matrices
    lora_alpha=16,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    bias="none",
    task_type="FEATURE_EXTRACTION"
)

print("\nLoRA Configuration:")
print(f"Rank (r): {lora_config.r}")
print(f"Alpha: {lora_config.lora_alpha}")
print(f"Target modules: {lora_config.target_modules}")
print(f"Dropout: {lora_config.lora_dropout}")

# Apply LoRA to model
lora_model = get_peft_model(vision_model, lora_config)
lora_model.to(device)

# Print trainable parameters
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in lora_model.parameters())
print(f"\nTrainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")
print(f"Total parameters: {total_params:,}")
print(f"\nMemory savings: Only {trainable_params/total_params*100:.2f}% of parameters need gradients!")

### 5.3 Fine-tuning Loop

Here's a simplified fine-tuning example. In practice, you'd use HuggingFace Trainer for more robust training.

In [None]:
# Note: This is a simplified demonstration
# For actual fine-tuning, refer to the homework notebook

print("\n=== Fine-tuning Demonstration ===")
print("\nKey steps for fine-tuning:")
print("1. Prepare your dataset (image-caption pairs)")
print("2. Configure LoRA with appropriate hyperparameters")
print("3. Set up training arguments:")
print("   - Learning rate: 1e-4 to 5e-5")
print("   - Batch size: 4-16 depending on GPU memory")
print("   - Epochs: 3-10")
print("   - Warmup steps: 10% of total steps")
print("4. Use HuggingFace Trainer for robust training")
print("5. Save only LoRA weights (very small file!)")
print("6. Evaluate on validation set")

# Example training arguments
training_args = TrainingArguments(
    output_dir="./lora_checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-5,
    warmup_steps=100,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),  # Mixed precision training
)

print("\nTraining Arguments:")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"FP16: {training_args.fp16}")

### 5.4 Saving and Loading LoRA Weights

In [None]:
# Save LoRA weights
lora_output_dir = "./clip_lora_demo"
lora_model.save_pretrained(lora_output_dir)
print(f"LoRA weights saved to {lora_output_dir}")

# Check file size
import os
lora_file = os.path.join(lora_output_dir, "adapter_model.bin")
if os.path.exists(lora_file):
    size_mb = os.path.getsize(lora_file) / (1024 * 1024)
    print(f"LoRA adapter size: {size_mb:.2f} MB")
    print("\nNotice: LoRA weights are tiny compared to full model!")

# To load LoRA weights later:
print("\nTo load LoRA weights:")
print("from peft import PeftModel")
print("base_model = CLIPVisionModel.from_pretrained('openai/clip-vit-base-patch32')")
print("model = PeftModel.from_pretrained(base_model, './clip_lora_demo')")

## Part 6: Advanced Topics (10 minutes)

### 6.1 Multi-GPU Training

For larger datasets, you'll want to use multiple GPUs.

In [None]:
print("=== Multi-GPU Training on Palmetto ===")
print("\nTo request multiple GPUs:")
print("qsub -I -l select=1:ncpus=32:mem=128gb:ngpus=4:gpu_model=a100,walltime=8:00:00")
print("\nIn your Python script:")
print("""
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

# Training loop
for batch in train_dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()
    optimizer.zero_grad()
""")
print("\nAccelerate handles device placement and gradient synchronization automatically!")

### 6.2 Model Quantization

Reduce memory usage with quantization.

In [None]:
print("=== Model Quantization ===")
print("\n8-bit quantization (load_in_8bit=True):")
print("- Reduces memory by ~50%")
print("- Minimal performance loss")
print("- Slower inference")
print("\n4-bit quantization (load_in_4bit=True):")
print("- Reduces memory by ~75%")
print("- Some performance loss")
print("- Even slower inference")
print("\nExample:")
print("""
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModel.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)
""")
print("\nUse quantization when GPU memory is limited!")

### 6.3 Batch Inference for Efficiency

In [None]:
def batch_generate_captions(images, model, processor, batch_size=4):
    """Generate captions for multiple images efficiently."""
    all_captions = []
    
    for i in range(0, len(images), batch_size):
        batch_images = images[i:i+batch_size]
        inputs = processor(images=batch_images, return_tensors="pt").to(device)
        
        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_length=50, num_beams=5)
        
        captions = processor.batch_decode(generated_ids, skip_special_tokens=True)
        all_captions.extend(captions)
    
    return all_captions

print("\n=== Batch Inference ===")
print("Batch processing is much faster than processing one image at a time!")
print(f"\nProcessing {len(sample_images)} images...")

import time
start = time.time()
batch_captions = batch_generate_captions(sample_images, blip2_model, blip2_processor, batch_size=3)
end = time.time()

print(f"Time taken: {end-start:.2f} seconds")
print("\nGenerated captions:")
for i, caption in enumerate(batch_captions):
    print(f"{i+1}. {caption}")

## Part 7: Evaluation and Benchmarking (5 minutes)

### 7.1 Caption Quality Metrics

In [None]:
# Install evaluation metrics
!pip install -q evaluate sacrebleu rouge-score

import evaluate

# Load metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

# Example: Evaluate generated vs. reference captions
references = [
    ["a close-up photo of a cat"],
    ["a golden retriever dog"],
    ["a beautiful beach with waves"]
]

predictions = [
    "a photo of a cat",
    "a dog sitting on grass",
    "ocean waves at sunset"
]

# Calculate BLEU score
bleu_score = bleu.compute(predictions=predictions, references=references)
print(f"\nBLEU Score: {bleu_score['bleu']:.4f}")

# Calculate ROUGE score
rouge_score = rouge.compute(predictions=predictions, references=[r[0] for r in references])
print(f"ROUGE-L: {rouge_score['rougeL']:.4f}")

print("\nNote: For comprehensive evaluation, also consider:")
print("- CIDEr: Consensus-based metric for captioning")
print("- SPICE: Semantic similarity metric")
print("- METEOR: Accounts for synonyms and paraphrases")
print("- Human evaluation: Always the gold standard!")

## Part 8: Summary and Next Steps (5 minutes)

### What We've Learned

1. **CLIP**: Zero-shot classification, image-text similarity
2. **BLIP-2**: Image captioning, visual question answering
3. **LoRA**: Parameter-efficient fine-tuning
4. **Practical considerations**: Quantization, batch inference, evaluation

### Key Takeaways

- Pre-trained multimodal models are powerful and accessible
- HuggingFace Transformers makes it easy to use these models
- LoRA enables fine-tuning with minimal resources
- Quantization helps fit larger models in memory
- Batch processing improves efficiency

### Next Steps: Homework

In the homework notebook, you will:
1. Fine-tune CLIP on a custom image classification dataset
2. Fine-tune BLIP for domain-specific image captioning
3. Fine-tune a VLM for visual question answering

### Resources

- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [CLIP Paper](https://arxiv.org/abs/2103.00020)
- [BLIP-2 Paper](https://arxiv.org/abs/2301.12597)
- [LLaVA Paper](https://arxiv.org/abs/2304.08485)
- [Palmetto Documentation](https://docs.rcd.clemson.edu/palmetto/)

## Cleanup

In [None]:
# Free up GPU memory
import gc

del clip_model, blip2_model, lora_model
gc.collect()
torch.cuda.empty_cache()

print("GPU memory cleared. Ready for homework exercises!")