# Notebook 18: VLM Inference -- Image + Text

---

## Inference Engineering Course

Welcome to Notebook 18! Here we explore **Vision-Language Models (VLMs)** -- models that can understand both images and text simultaneously.

### What You Will Learn

| Topic | Description |
|-------|-------------|
| **VLM Architecture** | How images become visual tokens |
| **Running a VLM** | Process image+text queries with a small VLM |
| **Visual Tokens** | Understanding the image token pipeline |
| **Attention Visualization** | See how the model attends to image patches |
| **Resolution Impact** | How image resolution affects tokens and speed |
| **Context Budget** | Image tokens vs text tokens in the context window |

### How VLMs Work

```
Image --> Vision Encoder --> Visual Tokens --> 
                                               --> LLM --> Text Output
Text  --> Tokenizer     --> Text Tokens   -->
```

The key insight: images are converted into a sequence of **visual tokens** that the LLM processes alongside text tokens.

---

## Part 1: Setup & Installations

In [None]:
%%capture
!pip install transformers accelerate torch torchvision Pillow matplotlib numpy requests bitsandbytes

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from PIL import Image
import requests
from io import BytesIO
import time
import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    mem_gb = torch.cuda.get_device_properties(0).total_mem / 1e9
    print(f"Memory: {mem_gb:.1f} GB")
else:
    print("WARNING: GPU recommended for this notebook.")

## Part 2: Understanding the VLM Pipeline

Before loading a real model, let's understand the conceptual pipeline.

### Image to Visual Tokens

1. **Image input**: e.g., 224x224 RGB image
2. **Patch extraction**: Split into patches (e.g., 14x14 patches of 16x16 pixels)
3. **Vision encoder**: Each patch becomes a feature vector (visual token)
4. **Projection**: Visual tokens projected to LLM's embedding dimension
5. **Concatenation**: Visual tokens + text tokens form the full input sequence

### Token Count Math

For an image of size $H \times W$ with patch size $P$:

$$\text{Visual tokens} = \frac{H}{P} \times \frac{W}{P}$$

Example: 224x224 image, 16x16 patches = (224/16) x (224/16) = 14 x 14 = **196 visual tokens**

In [None]:
# Visualize the patching process
def visualize_patches(image_size=224, patch_size=16):
    """Show how an image is split into patches for the vision encoder."""
    
    # Create a sample image (gradient for visualization)
    img = np.zeros((image_size, image_size, 3))
    for i in range(image_size):
        for j in range(image_size):
            img[i, j, 0] = i / image_size  # Red gradient
            img[i, j, 2] = j / image_size  # Blue gradient
    
    num_patches_h = image_size // patch_size
    num_patches_w = image_size // patch_size
    total_patches = num_patches_h * num_patches_w
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Left: Original image
    ax = axes[0]
    ax.imshow(img)
    ax.set_title(f'Original Image\n({image_size}x{image_size})', fontsize=13, fontweight='bold')
    ax.axis('off')
    
    # Middle: Image with patch grid
    ax = axes[1]
    ax.imshow(img)
    for i in range(0, image_size, patch_size):
        ax.axhline(y=i, color='white', linewidth=0.5)
        ax.axvline(x=i, color='white', linewidth=0.5)
    ax.set_title(f'Patch Grid\n({num_patches_h}x{num_patches_w} = {total_patches} patches)', 
                 fontsize=13, fontweight='bold')
    ax.axis('off')
    
    # Right: Token sequence visualization
    ax = axes[2]
    token_colors = np.zeros((num_patches_h, num_patches_w, 3))
    for i in range(num_patches_h):
        for j in range(num_patches_w):
            token_colors[i, j, 0] = i / num_patches_h
            token_colors[i, j, 2] = j / num_patches_w
    
    ax.imshow(token_colors, interpolation='nearest')
    for i in range(num_patches_h):
        for j in range(num_patches_w):
            idx = i * num_patches_w + j
            if num_patches_h <= 14:
                ax.text(j, i, str(idx), ha='center', va='center', 
                        fontsize=6, color='white', fontweight='bold')
    ax.set_title(f'Visual Token Indices\n({total_patches} tokens in sequence)', 
                 fontsize=13, fontweight='bold')
    ax.axis('off')
    
    plt.suptitle(f'Image Patching: {image_size}x{image_size} image -> {total_patches} visual tokens',
                 fontsize=15, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    return total_patches

# Show patching for different configurations
for img_size, patch_size in [(224, 16), (336, 14), (448, 16)]:
    n = visualize_patches(img_size, patch_size)
    print(f"  {img_size}x{img_size} with {patch_size}x{patch_size} patches -> {n} visual tokens\n")

## Part 3: Loading a Vision-Language Model

We will use **BLIP-2** -- an efficient VLM that bridges a frozen vision encoder with a frozen LLM using a lightweight query transformer.

BLIP-2 architecture:
```
Image -> ViT (frozen) -> Q-Former (trained) -> LLM (frozen) -> Text
```

In [None]:
from transformers import Blip2Processor, Blip2ForConditionalGeneration

print("Loading BLIP-2 model (this may take a few minutes)...")

model_name = "Salesforce/blip2-opt-2.7b"

processor = Blip2Processor.from_pretrained(model_name)

vlm_model = Blip2ForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
    device_map="auto" if device == 'cuda' else None,
)

if device != 'cuda':
    vlm_model = vlm_model.to(device)

print(f"Model loaded!")
total_params = sum(p.numel() for p in vlm_model.parameters())
print(f"Total parameters: {total_params / 1e9:.1f}B")
print(f"Vision encoder: ViT")
print(f"Language model: OPT-2.7B")

In [None]:
# Helper function to load images from URLs
def load_image(url_or_path: str, max_size: int = 512) -> Image.Image:
    """Load an image from URL or local path, resize if needed."""
    if url_or_path.startswith('http'):
        response = requests.get(url_or_path, timeout=10)
        img = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        img = Image.open(url_or_path).convert('RGB')
    
    # Resize if too large
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
        img = img.resize(new_size, Image.LANCZOS)
    
    return img

# Load sample images
sample_urls = {
    'cat': 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg',
    'city': 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/1200px-New_york_times_square-terabass.jpg',
    'food': 'https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Good_Food_Display_-_NCI_Visuals_Online.jpg/1200px-Good_Food_Display_-_NCI_Visuals_Online.jpg',
}

images = {}
for name, url in sample_urls.items():
    try:
        images[name] = load_image(url)
        print(f"Loaded '{name}': {images[name].size}")
    except Exception as e:
        print(f"Could not load '{name}': {e}")
        # Create a placeholder
        images[name] = Image.new('RGB', (384, 384), color=(128, 128, 128))

# Display loaded images
fig, axes = plt.subplots(1, len(images), figsize=(5 * len(images), 5))
for ax, (name, img) in zip(axes, images.items()):
    ax.imshow(img)
    ax.set_title(f'{name.title()} ({img.size[0]}x{img.size[1]})', fontsize=12)
    ax.axis('off')
plt.tight_layout()
plt.show()

## Part 4: Processing Image + Text Queries

Now let's ask the VLM questions about our images.

In [None]:
@torch.no_grad()
def ask_vlm(image: Image.Image, question: str, max_new_tokens: int = 50) -> dict:
    """
    Ask the VLM a question about an image.
    
    Returns the answer and timing information.
    """
    # Process inputs
    inputs = processor(images=image, text=question, return_tensors="pt")
    inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in inputs.items()}
    
    # Count tokens
    input_token_count = inputs['input_ids'].shape[1]
    pixel_values_shape = inputs['pixel_values'].shape
    
    # Generate
    start = time.time()
    output_ids = vlm_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
    )
    elapsed = time.time() - start
    
    # Decode
    answer = processor.decode(output_ids[0], skip_special_tokens=True).strip()
    output_tokens = output_ids.shape[1]
    
    return {
        'question': question,
        'answer': answer,
        'time_s': round(elapsed, 3),
        'text_input_tokens': input_token_count,
        'output_tokens': output_tokens,
        'pixel_shape': pixel_values_shape,
        'tok_per_s': round(output_tokens / elapsed, 1),
    }

# Test with different questions
if 'cat' in images:
    questions = [
        "What is in this image?",
        "Describe the colors you see.",
        "Question: What animal is shown? Answer:",
    ]
    
    print("Image: Cat")
    print("=" * 70)
    for q in questions:
        result = ask_vlm(images['cat'], q)
        print(f"Q: {result['question']}")
        print(f"A: {result['answer']}")
        print(f"   Time: {result['time_s']}s | Output tokens: {result['output_tokens']}")
        print()

In [None]:
# Query all images with the same question
question = "Describe this image in detail."

print(f"Question: '{question}'")
print("=" * 70)

all_results = []
for name, img in images.items():
    result = ask_vlm(img, question, max_new_tokens=80)
    all_results.append((name, result))
    print(f"\n[{name.upper()}] ({result['time_s']}s)")
    print(f"  Answer: {result['answer'][:150]}...")
    print(f"  Pixel shape: {result['pixel_shape']}")
    print(f"  Text tokens: {result['text_input_tokens']}, Output tokens: {result['output_tokens']}")

## Part 5: How Images Become Visual Tokens

Let's trace through the vision encoder to see exactly how an image is tokenized.

In [None]:
# Analyze the vision encoding process
@torch.no_grad()
def analyze_visual_tokens(image: Image.Image) -> dict:
    """Analyze how an image is converted to visual tokens."""
    
    # Process the image
    inputs = processor(images=image, return_tensors="pt")
    pixel_values = inputs['pixel_values'].to(device)
    
    # Get vision encoder output
    if hasattr(vlm_model, 'vision_model'):
        vision_outputs = vlm_model.vision_model(pixel_values)
        visual_features = vision_outputs.last_hidden_state
    else:
        # For BLIP-2 architecture
        vision_outputs = vlm_model.vision_model(
            pixel_values=pixel_values,
            return_dict=True,
        )
        visual_features = vision_outputs.last_hidden_state
    
    return {
        'pixel_shape': tuple(pixel_values.shape),
        'feature_shape': tuple(visual_features.shape),
        'num_visual_tokens': visual_features.shape[1],
        'token_dim': visual_features.shape[2],
        'feature_norm': float(visual_features.norm(dim=-1).mean()),
    }

for name, img in images.items():
    analysis = analyze_visual_tokens(img)
    print(f"\n{name.upper()} ({img.size[0]}x{img.size[1]}):")
    print(f"  Pixel values shape: {analysis['pixel_shape']}")
    print(f"  Visual features shape: {analysis['feature_shape']}")
    print(f"  Number of visual tokens: {analysis['num_visual_tokens']}")
    print(f"  Token dimension: {analysis['token_dim']}")
    print(f"  Average feature norm: {analysis['feature_norm']:.2f}")

In [None]:
# Visualize visual token structure
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Token count comparison for different image sizes
ax = axes[0]
sizes = [224, 336, 384, 448, 512, 768, 1024]
patch_sizes = [14, 16]

for ps in patch_sizes:
    token_counts = [(s // ps) ** 2 for s in sizes]
    ax.plot(sizes, token_counts, 'o-', linewidth=2.5, markersize=8, 
            label=f'Patch size {ps}x{ps}')

# Add context window reference lines
ax.axhline(y=2048, color='red', linestyle='--', alpha=0.5, label='2K context')
ax.axhline(y=4096, color='orange', linestyle='--', alpha=0.5, label='4K context')

ax.set_xlabel('Image Size (pixels)', fontsize=12)
ax.set_ylabel('Number of Visual Tokens', fontsize=12)
ax.set_title('Visual Token Count vs Image Size', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

# Right: Context budget breakdown
ax = axes[1]
context_window = 2048
image_token_counts = [0, 196, 576, 1024]
image_labels = ['No image', '224x224\n(196 tokens)', '336x336\n(576 tokens)', '448x448\n(1024 tokens)']
text_budgets = [context_window - itc for itc in image_token_counts]

x = np.arange(len(image_labels))
ax.bar(x, image_token_counts, label='Visual tokens', color='#FF9800', alpha=0.8)
ax.bar(x, text_budgets, bottom=image_token_counts, label='Text tokens (remaining)',
       color='#2196F3', alpha=0.8)

for i in range(len(image_labels)):
    ax.text(i, image_token_counts[i] / 2, str(image_token_counts[i]),
           ha='center', va='center', fontsize=10, fontweight='bold', color='white')
    ax.text(i, image_token_counts[i] + text_budgets[i] / 2, str(text_budgets[i]),
           ha='center', va='center', fontsize=10, fontweight='bold', color='white')

ax.set_xticks(x)
ax.set_xticklabels(image_labels, fontsize=9)
ax.set_ylabel('Tokens', fontsize=12)
ax.set_title(f'Context Budget (Window={context_window})', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

plt.tight_layout()
plt.show()

print("Key insight: Larger images consume more of the context window,")
print("leaving less room for text input and output.")

## Part 6: Visualizing Attention Over Image Patches

Let's visualize which parts of the image the model attends to when answering questions.

In [None]:
@torch.no_grad()
def visualize_attention_map(image: Image.Image, question: str):
    """
    Visualize approximate attention over image patches.
    
    We use the vision encoder's self-attention to show
    which image regions are most important.
    """
    # Process image
    inputs = processor(images=image, return_tensors="pt")
    pixel_values = inputs['pixel_values'].to(device)
    
    # Get vision encoder attention weights
    vision_outputs = vlm_model.vision_model(
        pixel_values=pixel_values,
        output_attentions=True,
        return_dict=True,
    )
    
    # Average attention from the last layer
    # Shape: [batch, heads, seq_len, seq_len]
    last_attention = vision_outputs.attentions[-1]
    
    # Average over heads and sum attention received by each token
    avg_attention = last_attention.mean(dim=1)  # Average over heads
    
    # Get attention from CLS token to all patches
    cls_attention = avg_attention[0, 0, 1:]  # Skip CLS token itself
    
    # Reshape to 2D spatial map
    num_patches = cls_attention.shape[0]
    h = w = int(np.sqrt(num_patches))
    
    if h * w != num_patches:
        # Handle non-square cases
        h = w = int(np.ceil(np.sqrt(num_patches)))
        padded = torch.zeros(h * w, device=cls_attention.device)
        padded[:num_patches] = cls_attention
        attention_map = padded.reshape(h, w).cpu().numpy()
    else:
        attention_map = cls_attention.reshape(h, w).cpu().numpy()
    
    return attention_map

# Visualize attention for our images
fig, axes = plt.subplots(2, len(images), figsize=(5 * len(images), 10))

for idx, (name, img) in enumerate(images.items()):
    try:
        attn_map = visualize_attention_map(img, "What is in this image?")
        
        # Top: original image
        axes[0][idx].imshow(img)
        axes[0][idx].set_title(f'{name.title()} (Original)', fontsize=12, fontweight='bold')
        axes[0][idx].axis('off')
        
        # Bottom: attention heatmap overlaid
        axes[1][idx].imshow(img)
        # Resize attention map to image size
        from PIL import Image as PILImage
        attn_resized = np.array(PILImage.fromarray(
            (attn_map * 255).astype(np.uint8)).resize(img.size, PILImage.BILINEAR)) / 255.0
        axes[1][idx].imshow(attn_resized, cmap='hot', alpha=0.5)
        axes[1][idx].set_title(f'{name.title()} (Attention)', fontsize=12, fontweight='bold')
        axes[1][idx].axis('off')
    except Exception as e:
        axes[0][idx].text(0.5, 0.5, f'Error: {str(e)[:30]}', transform=axes[0][idx].transAxes,
                          ha='center')
        axes[1][idx].text(0.5, 0.5, 'Attention not available', transform=axes[1][idx].transAxes,
                          ha='center')

plt.suptitle('Vision Encoder Attention Maps', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 7: Impact of Image Resolution on Inference

How does image resolution affect:
1. Number of visual tokens
2. Inference time
3. Answer quality

In [None]:
# Test different resolutions
test_image = images.get('cat', list(images.values())[0])
question = "Describe this image in detail."

resolutions = [128, 224, 336, 448]
resolution_results = []

print(f"Testing different input resolutions...")
print(f"Question: '{question}'\n")

for res in resolutions:
    # Resize image
    resized = test_image.resize((res, res), Image.LANCZOS)
    
    # Process and generate
    result = ask_vlm(resized, question, max_new_tokens=60)
    result['resolution'] = res
    result['image_pixels'] = res * res
    resolution_results.append(result)
    
    print(f"Resolution {res}x{res}:")
    print(f"  Time: {result['time_s']}s | Tokens: {result['output_tokens']}")
    print(f"  Answer: {result['answer'][:100]}...")
    print()

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Time vs Resolution
ax = axes[0]
res_vals = [r['resolution'] for r in resolution_results]
time_vals = [r['time_s'] for r in resolution_results]
ax.plot(res_vals, time_vals, 'o-', color='#F44336', linewidth=2.5, markersize=10)
ax.set_xlabel('Image Resolution (pixels)', fontsize=12)
ax.set_ylabel('Inference Time (seconds)', fontsize=12)
ax.set_title('Inference Time vs Image Resolution', fontsize=14, fontweight='bold')

for r, t in zip(res_vals, time_vals):
    ax.annotate(f'{t:.2f}s', (r, t), textcoords='offset points', xytext=(0, 10),
               ha='center', fontsize=10)

# Right: Output tokens vs Resolution
ax = axes[1]
out_tok_vals = [r['output_tokens'] for r in resolution_results]
ax.bar(range(len(resolutions)), out_tok_vals, color='#2196F3', alpha=0.8, edgecolor='black')
ax.set_xticks(range(len(resolutions)))
ax.set_xticklabels([f'{r}x{r}' for r in resolutions])
ax.set_xlabel('Image Resolution', fontsize=12)
ax.set_ylabel('Output Tokens', fontsize=12)
ax.set_title('Output Length vs Resolution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## Part 8: Context Budget Analysis

Visual tokens take up space in the context window. Let's analyze the tradeoff between image detail and text capacity.

In [None]:
# Context budget analysis
fig, ax = plt.subplots(figsize=(12, 6))

context_sizes = [2048, 4096, 8192, 16384]
image_configs = [
    ('No image', 0),
    ('Small (224px)', 196),
    ('Medium (336px)', 576),
    ('Large (448px)', 1024),
    ('XL (672px)', 2304),
]

x = np.arange(len(context_sizes))
width = 0.15
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(image_configs)))

for i, (config_name, img_tokens) in enumerate(image_configs):
    remaining = [max(0, cs - img_tokens) for cs in context_sizes]
    bars = ax.bar(x + i * width, remaining, width, label=f'{config_name} ({img_tokens} tok)',
                  color=colors[i], alpha=0.8)

ax.set_xlabel('Context Window Size', fontsize=12)
ax.set_ylabel('Remaining Text Tokens', fontsize=12)
ax.set_title('Available Text Budget After Image Tokens', fontsize=14, fontweight='bold')
ax.set_xticks(x + width * 2)
ax.set_xticklabels([f'{cs:,}' for cs in context_sizes])
ax.legend(fontsize=9, ncol=2)

plt.tight_layout()
plt.show()

print("For models with small context windows (2K-4K), image tokens")
print("can consume a significant portion of the available budget.")
print("This is why many VLMs use techniques like:")
print("  - Dynamic resolution (resize based on task)")
print("  - Visual token compression (Q-Former, Perceiver)")
print("  - Token merging (reduce redundant visual tokens)")

## Part 9: VLM Inference Performance Summary

Let's compile our findings into a performance analysis.

In [None]:
# Performance summary visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Time breakdown for VLM inference
ax = axes[0][0]
components = ['Vision\nEncoder', 'Q-Former\nBridge', 'LLM\nPrefill', 'LLM\nDecode']
# Estimated typical time breakdown (percentages)
time_pcts = [15, 5, 30, 50]
colors = ['#FF9800', '#4CAF50', '#2196F3', '#F44336']
ax.pie(time_pcts, labels=components, colors=colors, autopct='%1.0f%%',
       startangle=90, textprops={'fontsize': 11})
ax.set_title('VLM Inference Time Breakdown', fontsize=14, fontweight='bold')

# Plot 2: Token types and counts
ax = axes[0][1]
configs = ['Text Only', 'Small Image', 'Large Image', 'Multi-Image']
text_tokens = [500, 500, 500, 500]
visual_tokens = [0, 196, 1024, 2048]
x = np.arange(len(configs))
ax.bar(x, text_tokens, label='Text Tokens', color='#2196F3', alpha=0.8)
ax.bar(x, visual_tokens, bottom=text_tokens, label='Visual Tokens', color='#FF9800', alpha=0.8)
for i in range(len(configs)):
    total = text_tokens[i] + visual_tokens[i]
    ax.text(i, total + 20, f'{total}', ha='center', fontsize=10, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(configs, fontsize=10)
ax.set_ylabel('Token Count', fontsize=12)
ax.set_title('Total Input Tokens by Configuration', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)

# Plot 3: Relative inference cost
ax = axes[1][0]
configs_cost = ['Text Only\n(1B LLM)', 'VLM\nSmall Img', 'VLM\nLarge Img', 'VLM\nMulti-Image']
relative_costs = [1.0, 1.8, 3.2, 5.0]
bars = ax.bar(configs_cost, relative_costs, color=['#4CAF50', '#FF9800', '#F44336', '#9C27B0'],
              alpha=0.8, edgecolor='black')
for bar, cost in zip(bars, relative_costs):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.1,
           f'{cost:.1f}x', ha='center', fontsize=11, fontweight='bold')
ax.set_ylabel('Relative Inference Cost', fontsize=12)
ax.set_title('Inference Cost vs Configuration', fontsize=14, fontweight='bold')

# Plot 4: Optimization potential
ax = axes[1][1]
optimizations = ['Baseline', 'FP16', 'Token\nCompression', 'Flash\nAttention', 'All\nCombined']
speedups = [1.0, 1.8, 2.5, 2.0, 4.5]
bars = ax.bar(optimizations, speedups, 
              color=['gray', '#FF9800', '#4CAF50', '#2196F3', '#9C27B0'],
              alpha=0.8, edgecolor='black')
for bar, sp in zip(bars, speedups):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.1,
           f'{sp:.1f}x', ha='center', fontsize=11, fontweight='bold')
ax.set_ylabel('Speedup', fontsize=12)
ax.set_title('VLM Optimization Techniques', fontsize=14, fontweight='bold')
ax.axhline(y=1, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

## Part 10: Key Takeaways

### Summary

1. **VLMs convert images to visual tokens** using a vision encoder (ViT). These tokens are processed by the LLM alongside text tokens.

2. **Image resolution directly affects token count**: Higher resolution = more tokens = more compute. A 448x448 image can produce 1000+ tokens.

3. **Context budget matters**: Visual tokens compete with text tokens for the context window. Balance image detail with text capacity.

4. **Attention visualization** shows that VLMs focus on semantically relevant image regions when answering questions.

5. **Optimization strategies**:
   - Dynamic resolution (use smaller images when detail isn't needed)
   - Visual token compression (Q-Former, Perceiver Resampler)
   - FP16/INT8 quantization
   - Flash Attention for long visual token sequences

### Connection to Inference Engineering

VLM inference combines the challenges of both vision and language:
- Vision encoder: batch-friendly, parallelizable
- LLM decoding: autoregressive, memory-bound
- The visual token count is a key lever for controlling speed

---

## Exercises

### Exercise 1: Multi-turn Visual QA
Ask a series of increasingly specific questions about the same image.

In [None]:
# Exercise 1: Multi-turn Visual QA
# For the same image, ask:
# 1. "What is in this image?"
# 2. "What colors are visible?"
# 3. "Count the objects you see."
# 4. "Describe the background."
# How do the answers and timing change?

print("Exercise 1: Implement multi-turn visual QA!")

### Exercise 2: Image Comparison
Compare how the model describes different images of the same category.

In [None]:
# Exercise 2: Image comparison
# Load two images of the same category (e.g., two different cats)
# Ask the same question to both
# Compare the responses

print("Exercise 2: Compare model descriptions of similar images!")

### Exercise 3: Token Efficiency Analysis
Measure the "information density" of visual tokens vs text tokens.

In [None]:
# Exercise 3: Token efficiency
# Compare: asking the model to describe an image (with image input)
# vs. asking the model to describe the same scene from a text description
# Which produces more detailed output? Which is faster?

print("Exercise 3: Analyze visual token information density!")

---

**End of Notebook 18: VLM Inference -- Image + Text**

Next: [Notebook 19 - ASR with Whisper](./19_asr_whisper.ipynb)