# Notebook 09: Multimodal - Image-to-Text (Image Captioning)

**Learning Objectives:**
- Generate text descriptions from images
- Use BLIP (Bootstrapping Language-Image Pre-training)
- Understand multimodal models
- Apply to image accessibility and content understanding

## Prerequisites

### Hardware Requirements

| Model Option | Model Name | Size | Min RAM | Recommended Setup | Notes |
|--------------|------------|------|---------|-------------------|-------|
| **small (CPU-friendly)** | Salesforce/blip-image-captioning-base | 990MB | 6GB | 6GB RAM, CPU | Good quality |
| **large (GPU-optimized)** | Salesforce/blip-image-captioning-large | 1.9GB | 8GB | 10GB VRAM (RTX 4080) | Better captions |

### Software Requirements
- Python 3.8+
- Libraries: `transformers`, `torch`, `PIL`

In [None]:
import torch
from transformers import AutoProcessor, BlipForConditionalGeneration, set_seed
from PIL import Image
import requests
from io import BytesIO
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(1103)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## Expected Behaviors

### First Time Running
- **Model Download**: ~990MB for BLIP-base (~5-8 minutes)
- Large model combining vision and language
- Cached for subsequent runs

### Setup Cell Output
```
PyTorch version: 2.x.x
CUDA available: True/False
```

### Model Loading
```
Model loaded on: cpu (or cuda)
```
- **CPU**: 15-20 seconds (large multimodal model)
- **GPU**: 8-12 seconds

### Caption Output
- Returns natural language description
- Example: `"a cat sitting on a couch looking at the camera"`

### Caption Quality
- **Clear, single-subject images**: Very accurate, descriptive
- **Complex scenes**: Captures main elements, may miss details
- **Multiple objects**: Describes most prominent objects
- **Actions**: Often captures what's happening in scene

### Expected Caption Length
- **Unconditional**: 5-15 words typically
- **With prompt**: Longer, more specific descriptions
- Controlled by `max_length` parameter

### Performance
- **Single image**:
  - CPU: 5-8 seconds
  - GPU: 1-2 seconds
- **Batch of 5 images**:
  - CPU: 20-30 seconds
  - GPU: 4-6 seconds

### Caption Style
- **Factual and descriptive**
- Uses common language
- Focuses on visible elements
- Sometimes includes colors, positions, activities

### Conditional Captioning
- Can provide text prompts to guide captions
- Example prompts: "a photograph of", "this image shows"
- Helps steer caption style and content

### Sampling for Variety
- `do_sample=True` generates diverse captions
- Same image can produce different valid captions
- Useful for creative applications

### Common Observations
- Accurate for common objects/scenes (people, animals, vehicles)
- May hallucinate details not actually present
- Sometimes generic for unusual images
- Better on photos than drawings/artwork

### Multimodal Understanding
- Combines vision (what's in image) + language (how to describe it)
- Trained on millions of image-caption pairs
- Can describe relationships ("person holding phone")

In [None]:
# CHOOSE YOUR MODEL:

# Option 1: small model (CPU-friendly)
MODEL_NAME = "Salesforce/blip-image-captioning-base"  # 990MB

# Option 2: large model (GPU-optimized, better quality)
# MODEL_NAME = "Salesforce/blip-image-captioning-large"  # 1.9GB

print(f"Selected model: {MODEL_NAME}")

In [None]:
# Load model and processor
print(f"Loading {MODEL_NAME}...")
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = BlipForConditionalGeneration.from_pretrained(MODEL_NAME)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded on: {device}")

In [None]:
def load_image_from_url(url):
    """Load an image from a URL."""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content)).convert("RGB")
    return img

def generate_caption(image, max_length=50):
    """
    Generate a caption for an image.
    """
    inputs = processor(images=image, return_tensors="pt").to(device)
    
    generated_ids = model.generate(**inputs, max_length=max_length)
    caption = processor.decode(generated_ids[0], skip_special_tokens=True)
    
    return caption

In [None]:
# Example: Caption a single image
image_url = "https://images.unsplash.com/photo-1518791841217-8f162f1e1131?w=500"
image = load_image_from_url(image_url)

print(f"Image size: {image.size}")

caption = generate_caption(image)

print(f"\n=== GENERATED CAPTION ===")
print(caption)

# Display image (in Jupyter)
image

In [None]:
# Generate conditional captions
def generate_conditional_caption(image, prompt_text):
    """
    Generate caption conditioned on a text prompt.
    """
    inputs = processor(images=image, text=prompt_text, return_tensors="pt").to(device)
    
    generated_ids = model.generate(**inputs)
    caption = processor.decode(generated_ids[0], skip_special_tokens=True)
    
    return caption

# Test with prompts
prompts = [
    "a photograph of",
    "this image shows",
    "the picture depicts"
]

print("\n=== CONDITIONAL CAPTIONS ===")
for prompt in prompts:
    caption = generate_conditional_caption(image, prompt)
    print(f"\nPrompt: '{prompt}'")
    print(f"Caption: {caption}")

In [None]:
# Caption multiple images
test_urls = [
    "https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=500",  # mountain
    "https://images.unsplash.com/photo-1546527868-ccb7ee7dfa6a?w=500",  # car
    "https://images.unsplash.com/photo-1551782450-a2132b4ba21d?w=500",  # burger
    "https://images.unsplash.com/photo-1552053831-71594a27632d?w=500"   # dog
]

print("\n=== MULTIPLE IMAGE CAPTIONS ===")
for i, url in enumerate(test_urls, 1):
    try:
        img = load_image_from_url(url)
        caption = generate_caption(img)
        print(f"\n{i}. {caption}")
    except Exception as e:
        print(f"\n{i}. Error: {e}")

In [None]:
# Generate multiple captions for same image (with sampling)
def generate_multiple_captions(image, num_captions=3):
    """
    Generate multiple diverse captions for an image.
    """
    inputs = processor(images=image, return_tensors="pt").to(device)
    
    captions = []
    for _ in range(num_captions):
        generated_ids = model.generate(
            **inputs,
            max_length=50,
            num_beams=5,
            do_sample=True,
            temperature=0.7
        )
        caption = processor.decode(generated_ids[0], skip_special_tokens=True)
        captions.append(caption)
    
    return captions

# Test
image = load_image_from_url("https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=500")
captions = generate_multiple_captions(image, num_captions=3)

print("\n=== MULTIPLE CAPTION VARIATIONS ===")
for i, caption in enumerate(captions, 1):
    print(f"{i}. {caption}")

In [None]:
# Local images
import os

sample_data_path = "../sample_data"

if os.path.exists(sample_data_path):
    image_files = [f for f in os.listdir(sample_data_path) 
                   if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
    
    if image_files:
        print("\n=== CAPTIONING LOCAL IMAGES ===")
        for img_file in image_files[:3]:
            img_path = os.path.join(sample_data_path, img_file)
            img = Image.open(img_path).convert("RGB")
            caption = generate_caption(img)
            print(f"\n{img_file}: {caption}")
    else:
        print("\nNo images in sample_data/. Add some to test!")

## State-of-the-Art Open Models (Not Covered)

While BLIP is excellent for image captioning, there are powerful vision-language models that go far beyond simple captioning. These models can answer questions, follow instructions, engage in visual reasoning, and handle complex multimodal tasks.

### Top SOTA Vision-Language Models

#### 1. üëÅÔ∏è LLaVA (Microsoft/Wisconsin-Madison)
**Large Language and Vision Assistant with instruction tuning**
- **Why it's special**: Can answer questions about images, follow complex instructions, visual reasoning
- **Performance**: 85.1% accuracy on ScienceQA, strong zero-shot capabilities
- **Model Card**: [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
- **Paper**: [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)
- **Size**: 13GB (7B parameters)

#### 2. üîç BLIP-2 (Salesforce)
**Bootstrapping Language-Image Pre-training with frozen LLMs**
- **Why it's special**: Efficient training by freezing vision and language models, Q-Former architecture
- **Performance**: State-of-the-art on VQA, image captioning, and image-text retrieval
- **Model Card**: [Salesforce/blip2-opt-2.7b](https://huggingface.co/Salesforce/blip2-opt-2.7b)
- **Paper**: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
- **Size**: 5.5GB (2.7B parameters)

#### 3. üìö InstructBLIP (Salesforce)
**Instruction-aware vision-language model**
- **Why it's special**: Follows natural language instructions for diverse vision-language tasks
- **Performance**: Excellent on instruction-following benchmarks, flexible task handling
- **Model Card**: [Salesforce/instructblip-vicuna-7b](https://huggingface.co/Salesforce/instructblip-vicuna-7b)
- **Paper**: [InstructBLIP: Towards General-purpose Vision-Language Models](https://arxiv.org/abs/2305.06500)
- **Size**: 13GB (7B parameters)

#### 4. üåê Qwen-VL (Alibaba Cloud)
**Multilingual vision-language model**
- **Why it's special**: Strong multilingual support (English + Chinese), grounding, OCR capabilities
- **Performance**: 78.5% on TextVQA, excellent on Chinese benchmarks
- **Model Card**: [Qwen/Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
- **Paper**: [Qwen-VL: A Versatile Vision-Language Model](https://arxiv.org/abs/2308.12966)
- **Size**: 20GB (9.6B parameters)

#### 5. üß† CogVLM (Zhipu AI)
**Visual expert language model**
- **Why it's special**: Achieves SOTA on many VQA benchmarks, strong visual grounding
- **Performance**: 92.5% on TextVQA, 87.7% on ScienceQA
- **Model Card**: [THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf)
- **Paper**: [CogVLM: Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079)
- **Size**: 20GB (17B parameters)

### Why Not Covered?

These advanced models require:
- **GPU Memory**: 24-80GB VRAM (A100/H100 GPUs or multi-GPU setup)
- **Inference Time**: 5-20 seconds per image-text pair
- **Disk Space**: 5-20GB per model
- **Complex Prompting**: Need careful instruction design for best results
- **Computational Resources**: Quantization (4-bit/8-bit) often needed

BLIP provides an excellent foundation for learning vision-language concepts!

### Learning Path Recommendation

1. **Start here**: Master BLIP (this notebook)
2. **Next level**: Try BLIP-2 for visual question answering
3. **Instruction following**: Explore InstructBLIP for diverse tasks
4. **Conversational**: Experiment with LLaVA for image chat
5. **Advanced**: Try CogVLM for state-of-the-art performance

### Benchmarks & Leaderboards

- **VQAv2** (Visual Question Answering):
  - BLIP-base: 77.5% accuracy
  - BLIP-2: 82.2% accuracy
  - InstructBLIP: 82.8% accuracy
  - LLaVA-1.5: 80.0% accuracy
  - CogVLM: 83.6% accuracy

- **TextVQA** (Reading text in images):
  - BLIP-base: 67.5%
  - BLIP-2: 71.7%
  - Qwen-VL: 78.5%
  - CogVLM: 92.5%

- **Image Captioning** (CIDEr score on COCO):
  - BLIP-base: 136.7
  - BLIP-2: 144.5
  - InstructBLIP: 142.8

- **Explore rankings**: [Papers With Code - Visual Question Answering](https://paperswithcode.com/task/visual-question-answering)

### Quick Comparison Table

| Model | Size | Speed | VQA Score | Capabilities | Best For |
|-------|------|-------|-----------|--------------|----------|
| **BLIP** ‚≠ê | 990MB | Fast | 77.5% | Captioning, retrieval | Learning basics |
| **BLIP-2** | 5.5GB | Medium | 82.2% | VQA, captioning | Efficient VL tasks |
| **InstructBLIP** | 13GB | Slow | 82.8% | Instruction following | Flexible task handling |
| **LLaVA** | 13GB | Slow | 80.0% | Visual chat, reasoning | Conversational AI |
| **Qwen-VL** | 20GB | Very Slow | 78.5% | Multilingual, OCR | Chinese + English |
| **CogVLM** | 20GB | Very Slow | 83.6% | SOTA performance | Research, benchmarks |

### Capabilities Comparison

| Model | Captioning | VQA | Instructions | Reasoning | OCR | Multilingual |
|-------|------------|-----|--------------|-----------|-----|--------------|
| **BLIP** | ‚úÖ | ‚ö†Ô∏è | ‚ùå | ‚ùå | ‚ùå | ‚ùå |
| **BLIP-2** | ‚úÖ | ‚úÖ | ‚ö†Ô∏è | ‚ö†Ô∏è | ‚ö†Ô∏è | ‚ùå |
| **InstructBLIP** | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚ö†Ô∏è | ‚ùå |
| **LLaVA** | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚ö†Ô∏è | ‚ö†Ô∏è |
| **Qwen-VL** | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ |
| **CogVLM** | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚ùå |

### Example Capabilities

**What these models can do beyond BLIP:**

**LLaVA:**
```python
"Describe this image in detail."
"What's unusual about this image?"
"Count the number of people in this image."
```

**InstructBLIP:**
```python
"Question: Is this safe to eat? Answer:"
"Describe the emotion of the person in this photo."
"List all the objects you can see."
```

**Qwen-VL:**
```python
"ËøôÂº†ÂõæÁâáÈáåÊúâ‰ªÄ‰πàÔºü" (What's in this image? - Chinese)
"Read the text in this image."
"Where is the cat in relation to the sofa?"
```

**CogVLM:**
```python
"Analyze the scientific diagram and explain the process."
"What equations are shown in this math problem?"
```

### Use Case Guide

**Choose based on your application:**

- **Simple captioning**: BLIP (this notebook)
- **Visual Q&A**: BLIP-2 or InstructBLIP
- **Chatbot with images**: LLaVA
- **Chinese language**: Qwen-VL
- **Document understanding**: CogVLM or Qwen-VL
- **Research/benchmarks**: CogVLM (best performance)

### Hardware Requirements

| Model | Min VRAM | Recommended | Quantization Option |
|-------|----------|-------------|---------------------|
| **BLIP** | 4GB | 8GB | Not needed |
| **BLIP-2** | 12GB | 16GB | 8-bit: 8GB |
| **InstructBLIP** | 24GB | 32GB | 4-bit: 12GB |
| **LLaVA** | 24GB | 32GB | 4-bit: 12GB |
| **Qwen-VL** | 32GB | 40GB | 4-bit: 16GB |
| **CogVLM** | 40GB | 48GB | 4-bit: 20GB |

**üí° Tip**: For production applications with complex vision-language needs, LLaVA offers the best balance of capability and accessibility. For research requiring SOTA performance, CogVLM is unmatched. For beginners and simple tasks, BLIP is perfect!

## Exercises

1. **Diverse Images**: Test with various image types (animals, landscapes, objects, people)
2. **Quality Assessment**: Compare base vs large model captions
3. **Custom Images**: Caption your own photos
4. **Caption Length**: Experiment with `max_length` parameter
5. **Batch Processing**: Process multiple images efficiently

In [None]:
# Your code here for exercises


## Key Takeaways

‚úÖ **BLIP** bridges vision and language for image captioning

‚úÖ **Multimodal models** process both images and text

‚úÖ Can generate **unconditional** or **conditional** captions

‚úÖ Useful for **accessibility** and **content understanding**

‚úÖ Sampling generates diverse captions for same image

## Next Steps

- Try **Notebook 10**: Ollama Integration
- Explore [vision-language models](https://huggingface.co/models?pipeline_tag=image-to-text)
- Learn about Visual Question Answering (VQA)

## Resources

- [BLIP Paper](https://arxiv.org/abs/2201.12086)
- [Image-to-Text Guide](https://huggingface.co/docs/transformers/tasks/image_captioning)
- [BLIP Model Card](https://huggingface.co/Salesforce/blip-image-captioning-base)