# Part 9.3: Multimodal AI

Humans understand the world through multiple senses simultaneously — we see an image, read its caption, and connect the two effortlessly. **Multimodal AI** gives models this same ability: understanding and connecting information across different modalities (text, images, audio, video).

**F1 analogy:** An F1 race engineer doesn't rely on a single data source. They're simultaneously watching the onboard camera (vision — track conditions, rival positions, debris), listening to team radio (text/audio — driver feedback, race director messages), and monitoring telemetry dashboards (structured data — speed traces, tire temperatures, fuel load). The magic happens when these streams *fuse*: seeing a rival's car snap sideways on the onboard camera, hearing the driver shout "He's lost it at Turn 4!" on the radio, and seeing the corresponding spike in the yellow flag telemetry — all connecting into a unified understanding: "Safety car incoming, pit NOW." That fusion of camera + telemetry + radio into unified race understanding is exactly what multimodal AI achieves.

The breakthrough came with **CLIP** (Contrastive Language-Image Pre-training), which showed that training on image-text pairs from the internet produces remarkably powerful representations. Think of CLIP as learning to match race images with their telemetry descriptions. This notebook builds multimodal systems from the ground up.

## Learning Objectives

- [ ] Understand the multimodal alignment problem and why it's hard
- [ ] Implement contrastive learning for aligning modalities
- [ ] Build a CLIP-style model from scratch (image encoder + text encoder)
- [ ] Implement zero-shot classification using aligned embeddings
- [ ] Build cross-modal retrieval (text -> image, image -> text)
- [ ] Understand vision-language models and how they extend LLMs
- [ ] Explore multimodal applications: captioning, VQA, generation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import defaultdict
import math
import re
import torch
import torch.nn as nn
import torch.nn.functional as F

np.random.seed(42)
torch.manual_seed(42)

print("Part 9.3: Multimodal AI")
print("=" * 50)

---

## 1. The Multimodal Alignment Problem

Different modalities live in completely different spaces:
- **Images**: 3D tensors (H x W x C) of pixel values
- **Text**: Sequences of discrete tokens
- **Audio**: 1D waveforms or spectrograms

**Alignment** means learning a shared embedding space where semantically similar content from different modalities is close together:

```
"a photo of a cat" (text) <-> [cat image] (image)  -> close in embedding space
"a photo of a dog" (text) <-> [cat image] (image)  -> far in embedding space
```

| Approach | How It Aligns | Example | F1 Parallel |
|----------|--------------|--------|-------------|
| **Contrastive** | Pull matching pairs together, push non-matching apart | CLIP, ALIGN | Learning that an onboard image of heavy rain matches the radio message "It's aquaplaning out here" — and doesn't match "Track is bone dry" |
| **Generative** | Predict one modality from another | Image captioning, DALL-E | Generating a race report from telemetry data, or predicting expected telemetry from a track layout image |
| **Fusion** | Combine modalities in a joint model | VisualBERT, Flamingo | The pit wall combining camera feed + telemetry + radio into one unified race model |

**F1 analogy:** The alignment problem in F1 is connecting completely different data types that describe the same event. An onboard camera frame (pixels) showing the car braking hard into Turn 1 needs to be aligned with the telemetry trace (numbers) showing brake pressure at 200 bar and the radio message (text) "Braking point is good, keep pushing." These three representations live in completely different mathematical spaces, but they all describe the same moment. Multimodal alignment learns to map them into a shared space where matching moments are close together.

In [None]:
# Visualize the multimodal alignment concept
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Before alignment: separate spaces
ax = axes[0]
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
ax.set_title('Before Alignment', fontsize=13, fontweight='bold')

# Text embeddings (random cluster)
np.random.seed(42)
text_pts = np.random.randn(5, 2) * 0.5 + np.array([-1.5, 1])
img_pts = np.random.randn(5, 2) * 0.5 + np.array([1.5, -1])

labels = ['cat', 'dog', 'car', 'tree', 'bird']
for i, label in enumerate(labels):
    ax.scatter(text_pts[i, 0], text_pts[i, 1], c='#3498db', s=100, zorder=5,
             edgecolors='black', marker='s')
    ax.scatter(img_pts[i, 0], img_pts[i, 1], c='#e74c3c', s=100, zorder=5,
             edgecolors='black', marker='o')
    ax.annotate(f'"{label}"', (text_pts[i, 0], text_pts[i, 1]),
               textcoords='offset points', xytext=(-5, 10), fontsize=8, color='#3498db')
    ax.annotate(f'img:{label}', (img_pts[i, 0], img_pts[i, 1]),
               textcoords='offset points', xytext=(-5, -15), fontsize=8, color='#e74c3c')

ax.scatter([], [], c='#3498db', s=60, marker='s', label='Text embeddings')
ax.scatter([], [], c='#e74c3c', s=60, marker='o', label='Image embeddings')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# After alignment: shared space
ax = axes[1]
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
ax.set_title('After Alignment (CLIP)', fontsize=13, fontweight='bold')

# Aligned: matching pairs are close
aligned_centers = np.array([[-2, 2], [2, 2], [2, -1], [-1, -2], [0, 0.5]])

for i, label in enumerate(labels):
    offset_text = np.random.randn(2) * 0.15
    offset_img = np.random.randn(2) * 0.15
    
    tp = aligned_centers[i] + offset_text
    ip = aligned_centers[i] + offset_img
    
    ax.scatter(tp[0], tp[1], c='#3498db', s=100, zorder=5, edgecolors='black', marker='s')
    ax.scatter(ip[0], ip[1], c='#e74c3c', s=100, zorder=5, edgecolors='black', marker='o')
    
    # Draw connection line
    ax.plot([tp[0], ip[0]], [tp[1], ip[1]], 'g--', alpha=0.5, linewidth=1.5)
    
    ax.annotate(label, aligned_centers[i], textcoords='offset points',
               xytext=(10, -5), fontsize=9, fontweight='bold')

ax.scatter([], [], c='#3498db', s=60, marker='s', label='Text embeddings')
ax.scatter([], [], c='#e74c3c', s=60, marker='o', label='Image embeddings')
ax.plot([], [], 'g--', label='Aligned pairs')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

plt.suptitle('Multimodal Alignment', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---

## 2. Contrastive Learning

Contrastive learning is the key technique behind CLIP. Given a batch of N image-text pairs:

1. Encode all images -> image embeddings
2. Encode all texts -> text embeddings
3. Compute similarity matrix (N x N)
4. The diagonal entries (matching pairs) should be high
5. Off-diagonal entries (non-matching) should be low

### InfoNCE Loss

$$\mathcal{L}_i = -\log \frac{\exp(\text{sim}(I_i, T_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(I_i, T_j) / \tau)}$$

where $\tau$ is a learned temperature parameter.

**F1 analogy:** Imagine training a system with thousands of paired examples: an onboard camera frame matched with its corresponding radio transcript. The contrastive loss says: "Given this image of a wet track with spray, the matching radio message 'It's really slippery out here' should score high similarity, while all other radio messages in the batch ('Tires feel great', 'Box this lap', 'Gap to car ahead?') should score low." The temperature parameter $\tau$ controls how sharp this distinction is — low temperature means the model must be very confident in its matches, like a race engineer who demands exact telemetry-radio alignment before acting on the information.

In [None]:
class ContrastiveLoss(nn.Module):
    """InfoNCE contrastive loss (CLIP-style)."""
    
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = nn.Parameter(torch.tensor(math.log(1/temperature)))
    
    def forward(self, image_embeddings, text_embeddings):
        """Compute symmetric contrastive loss.
        
        Args:
            image_embeddings: (batch_size, embed_dim)
            text_embeddings: (batch_size, embed_dim)
        """
        # Normalize embeddings
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)
        
        # Similarity matrix
        temp = self.temperature.exp()
        logits = image_embeddings @ text_embeddings.T * temp
        
        # Labels: diagonal (matching pairs)
        batch_size = logits.shape[0]
        labels = torch.arange(batch_size, device=logits.device)
        
        # Symmetric loss: image-to-text + text-to-image
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)
        
        return (loss_i2t + loss_t2i) / 2, logits


# Demonstrate contrastive loss
torch.manual_seed(42)
batch_size = 8
embed_dim = 64

# Simulated embeddings (not yet aligned)
img_emb = torch.randn(batch_size, embed_dim)
txt_emb = torch.randn(batch_size, embed_dim)

criterion = ContrastiveLoss(temperature=0.07)
loss, logits = criterion(img_emb, txt_emb)

print(f"Batch size: {batch_size}, Embed dim: {embed_dim}")
print(f"Contrastive loss (random embeddings): {loss.item():.4f}")
print(f"Expected loss for random (log(N)): {math.log(batch_size):.4f}")

# Now with perfectly aligned embeddings
aligned_emb = F.normalize(torch.randn(batch_size, embed_dim), dim=-1)
loss_aligned, _ = criterion(aligned_emb, aligned_emb)
print(f"Contrastive loss (perfectly aligned): {loss_aligned.item():.4f}")

In [None]:
# Visualize similarity matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Random (unaligned) similarity
ax = axes[0]
sim_random = F.normalize(img_emb) @ F.normalize(txt_emb).T
im = ax.imshow(sim_random.detach().numpy(), cmap='RdBu_r', vmin=-0.5, vmax=0.5)
ax.set_xlabel('Text', fontsize=11)
ax.set_ylabel('Image', fontsize=11)
ax.set_title('Similarity: Before Training', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax)

# Aligned similarity (diagonal should be bright)
ax = axes[1]
# Simulate aligned: add identity-like structure
aligned_img = F.normalize(torch.randn(batch_size, embed_dim), dim=-1)
aligned_txt = aligned_img + torch.randn(batch_size, embed_dim) * 0.1  # Small noise
aligned_txt = F.normalize(aligned_txt, dim=-1)

sim_aligned = aligned_img @ aligned_txt.T
im = ax.imshow(sim_aligned.detach().numpy(), cmap='RdBu_r', vmin=-0.5, vmax=1.0)
ax.set_xlabel('Text', fontsize=11)
ax.set_ylabel('Image', fontsize=11)
ax.set_title('Similarity: After Training (CLIP)', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.show()

---

## 3. Building CLIP from Scratch

CLIP has two encoders:
1. **Image encoder**: CNN or Vision Transformer -> image embedding
2. **Text encoder**: Transformer -> text embedding

Both project into the same embedding space and are trained with contrastive loss.

**F1 analogy:** Building a CLIP-style system for F1 means training two separate encoders — one for onboard camera frames (the image encoder, processing the visual stream) and one for radio transcripts and telemetry descriptions (the text encoder). Both project into the same space so you can ask: "Which radio message best describes what's happening in this camera frame?" or "Which camera frame matches this radio transcript?" The image encoder learns to see track conditions, car positions, and weather; the text encoder learns to understand the language of racing. They meet in a shared embedding space where a picture of a safety car and the phrase "safety car deployed" end up at the same point.

In [None]:
class ImageEncoder(nn.Module):
    """Simple CNN image encoder (in real CLIP: ViT or ResNet)."""
    
    def __init__(self, image_size=32, embed_dim=128):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.projection = nn.Linear(128, embed_dim)
    
    def forward(self, x):
        features = self.features(x).squeeze(-1).squeeze(-1)
        return self.projection(features)


class TextEncoder(nn.Module):
    """Simple transformer text encoder."""
    
    def __init__(self, vocab_size, embed_dim=128, max_len=32, n_heads=4, n_layers=2):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_len, embed_dim)
        
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, nhead=n_heads, dim_feedforward=embed_dim * 4,
            dropout=0.1, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        self.projection = nn.Linear(embed_dim, embed_dim)
    
    def forward(self, token_ids):
        B, T = token_ids.shape
        positions = torch.arange(T, device=token_ids.device).unsqueeze(0).expand(B, T)
        
        x = self.token_emb(token_ids) + self.pos_emb(positions)
        x = self.transformer(x)
        
        # Use [CLS]-style: mean pooling
        pooled = x.mean(dim=1)
        return self.projection(pooled)


class MiniCLIP(nn.Module):
    """Minimal CLIP model."""
    
    def __init__(self, vocab_size, embed_dim=128, image_size=32):
        super().__init__()
        self.image_encoder = ImageEncoder(image_size, embed_dim)
        self.text_encoder = TextEncoder(vocab_size, embed_dim)
        self.criterion = ContrastiveLoss(temperature=0.07)
    
    def forward(self, images, text_ids):
        image_emb = self.image_encoder(images)
        text_emb = self.text_encoder(text_ids)
        loss, logits = self.criterion(image_emb, text_emb)
        return loss, image_emb, text_emb, logits
    
    def encode_image(self, images):
        return F.normalize(self.image_encoder(images), dim=-1)
    
    def encode_text(self, text_ids):
        return F.normalize(self.text_encoder(text_ids), dim=-1)


# Create synthetic training data
# Simulate 5 classes with paired images and text
n_classes = 5
n_per_class = 20
image_size = 16
vocab_size = 50
max_text_len = 8

# Generate synthetic "images" (colored patches representing different classes)
images = []
texts = []
labels = []

class_colors = [
    [1, 0, 0],   # Red
    [0, 1, 0],   # Green
    [0, 0, 1],   # Blue
    [1, 1, 0],   # Yellow
    [1, 0, 1],   # Magenta
]

# Text tokens per class (simulated captions)
class_text_patterns = [
    [3, 5, 7, 0, 0, 0, 0, 0],   # "red object"
    [4, 8, 12, 0, 0, 0, 0, 0],  # "green thing"
    [6, 9, 15, 0, 0, 0, 0, 0],  # "blue item"
    [10, 11, 20, 0, 0, 0, 0, 0], # "yellow shape"
    [13, 14, 25, 0, 0, 0, 0, 0], # "magenta form"
]

for c in range(n_classes):
    for _ in range(n_per_class):
        # Image: class color + noise
        img = np.zeros((3, image_size, image_size), dtype=np.float32)
        for ch in range(3):
            img[ch] = class_colors[c][ch] + np.random.normal(0, 0.2, (image_size, image_size))
        images.append(img)
        
        # Text: class pattern + noise
        text = class_text_patterns[c].copy()
        # Add some random variation
        for j in range(len(text)):
            if text[j] > 0 and np.random.random() < 0.2:
                text[j] += np.random.randint(-1, 2)
        texts.append(text)
        labels.append(c)

images_tensor = torch.tensor(np.array(images))
texts_tensor = torch.tensor(np.array(texts)).long()
labels_tensor = torch.tensor(labels)

print(f"Dataset: {len(images)} image-text pairs, {n_classes} classes")
print(f"Image shape: {images_tensor.shape}")
print(f"Text shape: {texts_tensor.shape}")

In [None]:
# Train MiniCLIP
model = MiniCLIP(vocab_size, embed_dim=64, image_size=image_size)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

n_params = sum(p.numel() for p in model.parameters())
print(f"MiniCLIP parameters: {n_params:,}")

losses = []
n_epochs = 100
batch_size = 32
n_samples = len(images_tensor)

model.train()
for epoch in range(n_epochs):
    # Random batch
    idx = torch.randperm(n_samples)[:batch_size]
    batch_images = images_tensor[idx]
    batch_texts = texts_tensor[idx]
    
    loss, _, _, _ = model(batch_images, batch_texts)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    if (epoch + 1) % 20 == 0:
        print(f"  Epoch {epoch+1}/{n_epochs}: loss = {loss.item():.4f}")

print(f"\nFinal loss: {losses[-1]:.4f} (random baseline: {math.log(batch_size):.4f})")

In [None]:
# Visualize training and learned embeddings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training loss
ax = axes[0]
ax.plot(losses, color='#3498db', linewidth=1, alpha=0.3)
w = 5
smoothed = [np.mean(losses[max(0,i-w):i+1]) for i in range(len(losses))]
ax.plot(smoothed, color='#3498db', linewidth=2)
ax.axhline(y=math.log(batch_size), color='red', linestyle='--', alpha=0.5,
          label=f'Random baseline (log {batch_size})')
ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('Contrastive Loss', fontsize=11)
ax.set_title('CLIP Training Loss', fontsize=13, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)

# Learned embedding space (2D projection via PCA)
ax = axes[1]
model.eval()
with torch.no_grad():
    img_embs = model.encode_image(images_tensor).numpy()
    txt_embs = model.encode_text(texts_tensor).numpy()

# Simple PCA for visualization
all_embs = np.concatenate([img_embs, txt_embs], axis=0)
mean = all_embs.mean(axis=0)
centered = all_embs - mean
cov = centered.T @ centered / len(centered)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
top2 = eigenvectors[:, -2:]

img_2d = (img_embs - mean) @ top2
txt_2d = (txt_embs - mean) @ top2

class_names = ['Red', 'Green', 'Blue', 'Yellow', 'Magenta']
color_map = ['#e74c3c', '#2ecc71', '#3498db', '#f1c40f', '#9b59b6']

for c in range(n_classes):
    mask = labels_tensor.numpy() == c
    ax.scatter(img_2d[mask, 0], img_2d[mask, 1], c=color_map[c], marker='o',
             alpha=0.5, s=30, label=f'{class_names[c]} (img)' if c == 0 else None)
    ax.scatter(txt_2d[mask, 0], txt_2d[mask, 1], c=color_map[c], marker='s',
             alpha=0.5, s=30, label=f'{class_names[c]} (txt)' if c == 0 else None)

# Legend
ax.scatter([], [], c='gray', marker='o', s=40, label='Image embeddings')
ax.scatter([], [], c='gray', marker='s', s=40, label='Text embeddings')
ax.set_xlabel('PC 1', fontsize=11)
ax.set_ylabel('PC 2', fontsize=11)
ax.set_title('Learned Multimodal Embedding Space', fontsize=13, fontweight='bold')
ax.legend(fontsize=8, loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 4. Zero-Shot Classification

CLIP's most remarkable ability: classify images into **any** set of categories without training on those categories.

### How It Works
1. Encode the image
2. Encode each candidate label as text: "a photo of a {label}"
3. The label with highest cosine similarity to the image wins

This works because the shared embedding space captures semantic meaning — the model understands "cat" refers to the same concept whether it appears as pixels or as text.

**F1 analogy:** Zero-shot classification is like a system that can label onboard camera frames with *any* set of descriptions — without ever being specifically trained on those labels. Feed it an onboard frame and the candidate descriptions ["dry track conditions", "wet track conditions", "safety car period", "pit stop in progress", "overtaking maneuver"], and the system picks the best match based purely on its understanding of the shared vision-language space. No labeled training data needed for the specific categories. This is how a CLIP-style system could automatically tag thousands of hours of race footage with descriptive labels it has never explicitly been trained on.

In [None]:
class ZeroShotClassifier:
    """Zero-shot classification using CLIP-style model."""
    
    def __init__(self, model):
        self.model = model
        self.model.eval()
    
    def classify(self, images, label_text_ids):
        """Classify images into label categories.
        
        Args:
            images: (N, C, H, W) tensor
            label_text_ids: (K, T) tensor, one text per label
        
        Returns:
            predictions: (N,) label indices
            similarities: (N, K) similarity scores
        """
        with torch.no_grad():
            image_embs = self.model.encode_image(images)  # (N, D)
            text_embs = self.model.encode_text(label_text_ids)  # (K, D)
            
            # Cosine similarity
            similarities = image_embs @ text_embs.T  # (N, K)
            predictions = similarities.argmax(dim=1)
        
        return predictions, similarities


# Zero-shot classification on our dataset
classifier = ZeroShotClassifier(model)

# Create text embeddings for each class
label_texts = torch.tensor(class_text_patterns).long()

# Classify all images
preds, sims = classifier.classify(images_tensor, label_texts)

# Accuracy
accuracy = (preds == labels_tensor).float().mean().item()

print(f"Zero-Shot Classification Results\n")
print(f"  Overall accuracy: {accuracy:.1%}")

# Per-class accuracy
print("\n  Per-class accuracy:")
for c in range(n_classes):
    mask = labels_tensor == c
    class_acc = (preds[mask] == c).float().mean().item()
    print(f"    {class_names[c]:>10}: {class_acc:.1%}")

# Confusion matrix
confusion = np.zeros((n_classes, n_classes), dtype=int)
for true, pred in zip(labels_tensor.numpy(), preds.numpy()):
    confusion[true][pred] += 1

print("\n  Confusion matrix:")
print(f"{'':>10}", ''.join(f'{n:>8}' for n in class_names))
for i, name in enumerate(class_names):
    print(f"{name:>10}", ''.join(f'{confusion[i][j]:>8}' for j in range(n_classes)))

---

## 5. Cross-Modal Retrieval

Another powerful application: retrieve images using text queries (or vice versa).

- **Text -> Image**: "Find me images of blue objects"
- **Image -> Text**: Given an image, find the most relevant captions

**F1 analogy:** Cross-modal retrieval is incredibly useful for F1 broadcast and analysis. **Text -> Image**: A broadcast producer types "overtaking move into Turn 1" and the system retrieves the most relevant onboard camera frames from the entire race archive. **Image -> Text**: Given an onboard frame showing a car spinning, the system finds the most relevant radio transcripts ("I've lost the rear end!", "Car has spun at Turn 8"). This works because both modalities live in the same aligned embedding space — the CLIP-style system has learned that images of spins and radio messages about spins are the same *concept* expressed in different forms.

In [None]:
class CrossModalRetriever:
    """Cross-modal retrieval using aligned embeddings."""
    
    def __init__(self, model):
        self.model = model
        self.model.eval()
        self.image_index = None
        self.text_index = None
    
    def build_image_index(self, images):
        """Pre-compute image embeddings."""
        with torch.no_grad():
            self.image_index = self.model.encode_image(images)
    
    def build_text_index(self, texts):
        """Pre-compute text embeddings."""
        with torch.no_grad():
            self.text_index = self.model.encode_text(texts)
    
    def text_to_image(self, query_text, top_k=5):
        """Find images most similar to a text query."""
        with torch.no_grad():
            query_emb = self.model.encode_text(query_text)
        
        similarities = query_emb @ self.image_index.T
        scores, indices = similarities.squeeze().topk(top_k)
        return indices, scores
    
    def image_to_text(self, query_image, top_k=5):
        """Find texts most similar to an image query."""
        with torch.no_grad():
            query_emb = self.model.encode_image(query_image)
        
        similarities = query_emb @ self.text_index.T
        scores, indices = similarities.squeeze().topk(top_k)
        return indices, scores
    
    def evaluate_retrieval(self, labels, top_k=5):
        """Evaluate retrieval quality."""
        n = len(labels)
        recalls = []
        
        for i in range(n):
            # Text-to-image: use text i to find images
            query = self.text_index[i:i+1]
            sims = query @ self.image_index.T
            _, top_indices = sims.squeeze().topk(top_k)
            
            # Check if matching image is in top-k
            retrieved_labels = labels[top_indices.numpy()]
            correct = labels[i] in retrieved_labels
            recalls.append(correct)
        
        return np.mean(recalls)


retriever = CrossModalRetriever(model)
retriever.build_image_index(images_tensor)
retriever.build_text_index(texts_tensor)

# Evaluate retrieval
labels_np = labels_tensor.numpy()

print("Cross-Modal Retrieval Results\n")
for k in [1, 3, 5, 10]:
    recall = retriever.evaluate_retrieval(labels_np, top_k=k)
    print(f"  Recall@{k}: {recall:.1%}")

# Example queries
print("\nExample: Text → Image retrieval")
for c in range(n_classes):
    query = texts_tensor[c * n_per_class:c * n_per_class + 1]  # First text of each class
    indices, scores = retriever.text_to_image(query, top_k=3)
    retrieved_classes = [class_names[labels_np[idx]] for idx in indices.numpy()]
    print(f"  Query class '{class_names[c]}' -> Retrieved: {retrieved_classes} "
          f"(scores: {scores.numpy().round(3)})")

In [None]:
# Visualize retrieval
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Recall@k curve
ax = axes[0]
ks = [1, 2, 3, 5, 10, 15, 20]
recalls = [retriever.evaluate_retrieval(labels_np, top_k=k) for k in ks]
ax.plot(ks, recalls, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('k (top-k retrieved)', fontsize=11)
ax.set_ylabel('Recall@k', fontsize=11)
ax.set_title('Text→Image Retrieval Performance', fontsize=13, fontweight='bold')
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3)

# Cross-modal similarity heatmap (class-level)
ax = axes[1]
class_img_embs = []
class_txt_embs = []
for c in range(n_classes):
    mask = labels_np == c
    with torch.no_grad():
        img_emb = model.encode_image(images_tensor[mask]).mean(dim=0)
        txt_emb = model.encode_text(texts_tensor[mask]).mean(dim=0)
    class_img_embs.append(img_emb)
    class_txt_embs.append(txt_emb)

class_img = torch.stack(class_img_embs)
class_txt = torch.stack(class_txt_embs)
class_sim = (F.normalize(class_img) @ F.normalize(class_txt).T).numpy()

im = ax.imshow(class_sim, cmap='YlOrRd', vmin=0, vmax=1)
ax.set_xticks(range(n_classes))
ax.set_yticks(range(n_classes))
ax.set_xticklabels(class_names, rotation=45, ha='right')
ax.set_yticklabels(class_names)
ax.set_xlabel('Text Class', fontsize=11)
ax.set_ylabel('Image Class', fontsize=11)
ax.set_title('Cross-Modal Class Similarity', fontsize=13, fontweight='bold')

for i in range(n_classes):
    for j in range(n_classes):
        color = 'white' if class_sim[i, j] > 0.5 else 'black'
        ax.text(j, i, f'{class_sim[i, j]:.2f}', ha='center', va='center',
               fontsize=9, color=color, fontweight='bold')

plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

---

## 6. Vision-Language Models

While CLIP aligns modalities for retrieval and classification, **vision-language models** (VLMs) can *generate* text about images. They extend LLMs with visual understanding.

### Architecture Patterns

| Model | Approach | How Vision Connects to LLM | F1 Parallel |
|-------|---------|---------------------------|-------------|
| **LLaVA** | Visual tokens | Image patches -> visual tokens prepended to text | Feeding onboard camera frames as "visual tokens" alongside telemetry data into the strategy model |
| **Flamingo** | Cross-attention | Image features injected via cross-attention layers | The strategy model *attending to* camera feeds when relevant, like a race engineer glancing at screens during key moments |
| **BLIP-2** | Q-Former bridge | Learnable queries extract info from frozen image encoder | A specialized "visual analyst" module that extracts key information from camera feeds before passing it to the strategy team |
| **Claude/GPT-4V** | Native multimodal | Images and text processed jointly | A fully integrated pit wall system where camera, telemetry, and radio are processed as one unified input stream |

**F1 analogy:** Vision-language models are the evolution from "matching images to descriptions" (CLIP) to "understanding images and talking about them" (VLMs). In F1 terms, CLIP can match an onboard frame to the correct radio transcript. But a VLM can *look at* an onboard frame and generate: "The car ahead is running wide at Turn 3 exit, there's an opportunity to attack into Turn 4 on the inside. Track surface appears dry but there are dark clouds approaching from the left." That's genuine visual understanding combined with language generation — the AI equivalent of a race engineer narrating what they see on the monitors.

In [None]:
class SimpleVLM(nn.Module):
    """Simplified Vision-Language Model (LLaVA-style).
    
    Converts image features to "visual tokens" that are prepended
    to text tokens, then processed by a language model.
    """
    
    def __init__(self, vocab_size, embed_dim=64, n_visual_tokens=4, image_size=16):
        super().__init__()
        self.n_visual_tokens = n_visual_tokens
        
        # Image encoder
        self.image_encoder = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        
        # Project image features to visual tokens
        self.visual_projection = nn.Linear(64, n_visual_tokens * embed_dim)
        
        # Language model components
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(64, embed_dim)  # Max combined length
        
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=embed_dim, nhead=4, dim_feedforward=embed_dim * 4,
            dropout=0.1, batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=2)
        self.lm_head = nn.Linear(embed_dim, vocab_size)
    
    def forward(self, images, text_ids):
        """Forward pass: image + text -> next token predictions."""
        B = images.shape[0]
        
        # Encode image -> visual tokens
        img_features = self.image_encoder(images).squeeze(-1).squeeze(-1)  # (B, 64)
        visual_tokens = self.visual_projection(img_features)  # (B, n_vis * embed)
        visual_tokens = visual_tokens.view(B, self.n_visual_tokens, -1)  # (B, n_vis, embed)
        
        # Embed text tokens
        text_emb = self.token_emb(text_ids)  # (B, T, embed)
        
        # Concatenate: [visual_tokens, text_tokens]
        combined = torch.cat([visual_tokens, text_emb], dim=1)  # (B, n_vis + T, embed)
        seq_len = combined.shape[1]
        
        # Add positional embeddings
        positions = torch.arange(seq_len, device=combined.device).unsqueeze(0).expand(B, -1)
        combined = combined + self.pos_emb(positions)
        
        # Causal mask
        causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=combined.device), diagonal=1).bool()
        
        # Decode
        output = self.decoder(combined, combined, tgt_mask=causal_mask, memory_mask=causal_mask)
        logits = self.lm_head(output)
        
        # Only return logits for text positions (skip visual tokens)
        text_logits = logits[:, self.n_visual_tokens:, :]
        return text_logits


# Demonstrate VLM
vlm = SimpleVLM(vocab_size=50, embed_dim=64, n_visual_tokens=4, image_size=16)
vlm_params = sum(p.numel() for p in vlm.parameters())

# Forward pass
sample_images = images_tensor[:4]
sample_texts = texts_tensor[:4]

vlm.eval()
with torch.no_grad():
    text_logits = vlm(sample_images, sample_texts)

print(f"Vision-Language Model")
print(f"  Parameters: {vlm_params:,}")
print(f"  Visual tokens per image: 4")
print(f"  Input: {sample_images.shape} images + {sample_texts.shape} text")
print(f"  Output logits: {text_logits.shape} (text positions only)")
print(f"  Predicted tokens: {text_logits.argmax(dim=-1).tolist()}")

---

## 7. Multimodal Applications Landscape

Multimodal AI powers a wide range of applications today. In Formula 1, the potential applications span every aspect of the sport — from real-time race analysis to fan engagement to safety systems.

In [None]:
# Visualize the multimodal landscape
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('Multimodal AI Application Landscape', fontsize=15, fontweight='bold')

apps = [
    # (x, y, label, description, color)
    (2, 8, 'Image\nCaptioning', 'Image → Text description', '#3498db'),
    (5.5, 8, 'Visual QA', 'Image + Question → Answer', '#2ecc71'),
    (9, 8, 'OCR / Doc\nUnderstanding', 'Document → Structured data', '#f39c12'),
    (12, 8, 'Image\nGeneration', 'Text → Image (DALL-E)', '#e74c3c'),
    (2, 5, 'Cross-modal\nRetrieval', 'Text ↔ Image search', '#9b59b6'),
    (5.5, 5, 'Zero-shot\nClassification', 'Classify without training', '#1abc9c'),
    (9, 5, 'Video\nUnderstanding', 'Video → Description', '#e67e22'),
    (12, 5, 'Audio-Visual\nSpeech', 'Lip reading, dubbing', '#c0392b'),
    (3.5, 2, 'Multimodal\nAgents', 'See + reason + act', '#2c3e50'),
    (7, 2, 'Medical\nImaging + NLP', 'X-ray → Diagnosis report', '#16a085'),
    (10.5, 2, 'Autonomous\nDriving', 'Camera + LiDAR + Maps', '#8e44ad'),
]

for x, y, label, desc, color in apps:
    box = mpatches.FancyBboxPatch((x - 1.3, y - 0.7), 2.6, 1.4,
                                   boxstyle="round,pad=0.12", facecolor=color,
                                   edgecolor='black', linewidth=1.5, alpha=0.85)
    ax.add_patch(box)
    ax.text(x, y + 0.15, label, ha='center', va='center', fontsize=9,
            fontweight='bold', color='white')
    ax.text(x, y - 0.9, desc, ha='center', va='center', fontsize=7, color='gray')

# Category labels
ax.text(7, 9.3, 'Generation & Understanding', ha='center', fontsize=11,
        fontweight='bold', color='gray', style='italic')
ax.text(7, 6.3, 'Retrieval & Classification', ha='center', fontsize=11,
        fontweight='bold', color='gray', style='italic')
ax.text(7, 3.3, 'Domain Applications', ha='center', fontsize=11,
        fontweight='bold', color='gray', style='italic')

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Hard Negative Mining

Improve the contrastive training by implementing **hard negative mining**: instead of random negatives, select the most confusing non-matching pairs (highest similarity among negatives). Compare training speed and final accuracy.

**F1 scenario:** Instead of random negative pairings, find the *hardest* negatives — the most confusing mismatches. For example, an onboard frame of heavy rain (hard negative) paired with "light drizzle, track drying" (close but wrong — it's actually a downpour). Or a pit stop image paired with "practice start at pit exit" (visually similar pit lane activity, but completely different event). These hard negatives force the model to learn finer distinctions, just like how F1 drivers improve most by studying their *closest* competitors, not the backmarkers.

In [None]:
# Exercise 1: Your code here
# Hint: After computing the similarity matrix, for each positive pair,
# find the negative with highest similarity and weight it more in the loss.


### Exercise 2: Prompt Engineering for Zero-Shot

Real CLIP uses prompt templates like "a photo of a {class}", "a {class} in the wild", etc. Implement an ensemble of 5 different prompt templates and show that averaging their text embeddings improves zero-shot accuracy.

**F1 scenario:** Instead of a single text template like "an image of {condition}", use multiple phrasings: "onboard camera showing {condition}", "F1 track in {condition}", "race conditions: {condition}", "a Grand Prix under {condition} conditions", "telemetry consistent with {condition}". Average these embeddings for each category and show that the ensemble is more robust than any single template — just as an F1 team consults multiple data sources before making a strategy call.

In [None]:
# Exercise 2: Your code here
# Hint: Create multiple text encodings per class using different templates,
# average the embeddings, and compare to single-template accuracy.


### Exercise 3: Multimodal Fusion Model

Implement a simple **fusion model** that combines image and text embeddings for a classification task. Compare three fusion strategies: (1) concatenation, (2) element-wise addition, (3) cross-attention. Which performs best?

**F1 scenario:** Build a model that fuses onboard camera embeddings (image) with radio transcript embeddings (text) to classify race incidents. Try three fusion approaches: (1) concatenation — stack camera and radio features side by side, (2) element-wise addition — directly combine the signals, (3) cross-attention — let the camera features attend to the radio features and vice versa. Which fusion strategy best captures the relationship between what the camera sees and what the driver says? This mirrors the real pit wall challenge of combining visual and verbal information for rapid incident classification.

In [None]:
# Exercise 3: Your code here


---

## Summary

### Key Concepts

| Concept | What It Does | F1 Parallel |
|---------|-------------|-------------|
| **Multimodal alignment** | Maps different modalities into a shared embedding space | Connecting onboard camera frames, radio transcripts, and telemetry into a unified race understanding |
| **Contrastive learning** (InfoNCE) | Trains by pulling matching pairs together, pushing non-matching apart | Learning that rain images match "aquaplaning" radio calls and not "track is dry" calls |
| **CLIP** | Trains dual encoders on internet-scale image-text pairs | Learning to match race images with their telemetry descriptions across millions of examples |
| **Zero-shot classification** | Classifies without task-specific training data | Labeling race footage with any set of descriptive categories — no labeled training data needed |
| **Cross-modal retrieval** | Text->image and image->text search | "Find me overtaking moves into Turn 1" retrieves matching onboard frames from the archive |
| **Vision-language models** | Extend LLMs with visual understanding via visual tokens | A pit wall AI that looks at camera feeds and generates natural language race analysis |
| **Temperature** $\tau$ | Controls sharpness of similarity distribution | How confident the system must be before matching a camera frame to a radio message |

### The Multimodal Future

The trend is clear: the most capable AI systems are multimodal. Claude, GPT-4, and Gemini all process text, images, and more. Understanding how modalities are aligned and how information flows between them is essential for building the next generation of AI systems. In Formula 1, this future is already arriving — teams that can fuse onboard camera analysis, team radio understanding, and telemetry data into a unified real-time race intelligence system will have the ultimate competitive advantage. The pit wall of the future doesn't just display data from separate screens; it *understands* the race as a unified multimodal experience, just like a human race engineer does — but at the speed and scale that no human can match.

---

## Congratulations!

You've completed the full 28-notebook ML/AI curriculum! Here's everything you've learned, from the starting grid to the checkered flag:

- **Part 1-2**: Math foundations and programming tools — the engineering fundamentals, like learning physics and materials science before designing a race car
- **Part 3**: Neural network fundamentals (perceptrons, backprop, PyTorch, training) — building your first engine
- **Part 4**: Specialized architectures (CNNs, RNNs, attention) — adding aerodynamics, suspension, and power unit specializations
- **Part 5**: Modern NLP (transformers, language models, embeddings, fine-tuning) — the radio communication system connecting driver, car, and pit wall
- **Part 6**: Reinforcement learning (MDPs, Q-learning, policy gradients, PPO, RLHF) — teaching the strategy model to learn from race outcomes
- **Part 7**: Applied AI (RAG, agents, evaluation, production systems) — race day operations, from qualifying strategy to pit stop timing
- **Part 8**: LLM engineering (tokenization, inference optimization, ML systems, multimodal AI) — the complete factory, simulation farm, and pit wall technology stack

From matrix multiplication to multimodal contrastive learning — from the raw physics to the integrated pit wall intelligence system — you now have the foundations to understand, build, and deploy modern AI systems. The checkered flag is waving, but the real race is just beginning. Keep building!