# CLIP — Contrastive Learning and Shared Embedding Spaces

**Module 6.3, Lesson 3** | CourseAI

In the lesson, you learned how CLIP trains two separate encoders to put matching text and images near each other in a shared embedding space. Now you will work with CLIP directly — computing similarity matrices, exploring a pretrained model, performing zero-shot classification, and probing the model's limitations.

**What you will do:**
- Compute a cosine similarity matrix by hand from pre-computed embeddings and connect it to cross-entropy loss
- Load a real pretrained CLIP model and visualize its similarity matrix as a heatmap
- Perform zero-shot classification on CIFAR-10 images using text prompts — no training required
- Probe CLIP's systematic failure modes: spatial relationships, counting, and adversarial text

**For each exercise, PREDICT the output before running the cell.**

Everything connects to concepts you already know. The cosine similarity matrix is the same operation from *Queries and Keys*. The cross-entropy loss is the same loss from *Transfer Learning*. The zero-shot classification uses the same embedding comparison pattern you saw in the lesson. No new theory — just practice.

**Estimated time:** 30–45 minutes.

---

## Setup

Run this cell to install dependencies and import everything.

**Note:** We use the `open_clip` package (OpenCLIP) rather than OpenAI's original `clip` package. OpenCLIP is actively maintained and available on PyPI. The model architecture is the same ViT-B/32, but the OpenCLIP version was trained on LAION-2B (~2 billion text-image pairs) rather than OpenAI's private WIT dataset (~400 million pairs). More data, same idea.

In [None]:
!pip install -q open_clip_torch

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
import open_clip
from PIL import Image
import requests
from io import BytesIO
import torchvision
import torchvision.transforms as transforms

# Reproducible results
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

print('\nSetup complete.')

## Shared Helpers

In [None]:
def plot_similarity_matrix(sim_matrix, row_labels, col_labels, title='Similarity Matrix'):
    """Plot a cosine similarity matrix as a heatmap.

    Args:
        sim_matrix: 2D numpy array or torch tensor of similarity values
        row_labels: labels for each row (text descriptions)
        col_labels: labels for each column (image descriptions)
        title: plot title
    """
    if isinstance(sim_matrix, torch.Tensor):
        sim_matrix = sim_matrix.numpy()

    n_rows, n_cols = sim_matrix.shape
    fig, ax = plt.subplots(figsize=(max(6, n_cols * 1.4), max(5, n_rows * 1.2)))

    im = ax.imshow(sim_matrix, cmap='viridis', vmin=0, vmax=1)

    # Annotate each cell with its value
    for i in range(n_rows):
        for j in range(n_cols):
            val = sim_matrix[i, j]
            color = 'black' if val > 0.5 else 'white'
            ax.text(j, i, f'{val:.2f}', ha='center', va='center',
                    fontsize=10, fontweight='bold', color=color)

    ax.set_xticks(range(n_cols))
    ax.set_xticklabels(col_labels, rotation=30, ha='right', fontsize=9)
    ax.set_yticks(range(n_rows))
    ax.set_yticklabels(row_labels, fontsize=9)
    ax.set_xlabel('Images', fontsize=11)
    ax.set_ylabel('Text Prompts', fontsize=11)
    ax.set_title(title, fontsize=13)
    plt.colorbar(im, ax=ax, shrink=0.8)
    plt.tight_layout()
    plt.show()


print('Helpers loaded.')

---

## Exercise 1: Compute a Similarity Matrix by Hand [Guided]

Before touching a real CLIP model, you will compute the similarity matrix yourself with simple 2D vectors. This makes the math concrete and shows that CLIP's training objective is just cross-entropy on a cosine similarity matrix.

We have 4 text-image pairs. Each embedding is a 2D vector (for simplicity — real CLIP uses 512D, but the math is identical).

| Pair | Text | Image |
|------|------|-------|
| 0 | "a tabby cat" | cat photo |
| 1 | "a red sports car" | car photo |
| 2 | "a mountain landscape" | mountain photo |
| 3 | "a bowl of ramen" | ramen photo |

The image encoder and text encoder have each produced a 2D embedding vector for their respective inputs. After training, matching pairs point in similar directions.

**Before running, predict:**
- Which entries in the 4×4 similarity matrix should be highest? (The diagonal — matching pairs.)
- If row 0 is the text "a tabby cat," which column should have the highest value? (Column 0 — the cat image.)
- What are the cross-entropy labels for all 4 rows? (Answer: [0, 1, 2, 3] — each row's correct answer is the diagonal entry.)

In [None]:
# Pre-computed 2D embeddings (imagine these came from trained encoders)
# These are designed so that matching pairs point in similar directions.

# Image embeddings: each is a 2D vector
image_embeddings = torch.tensor([
    [0.9, 0.3],    # image 0: cat photo
    [-0.7, 0.6],   # image 1: car photo
    [0.1, -0.95],  # image 2: mountain photo
    [-0.8, -0.5],  # image 3: ramen photo
], dtype=torch.float32)

# Text embeddings: each is a 2D vector
text_embeddings = torch.tensor([
    [0.85, 0.4],   # text 0: "a tabby cat"
    [-0.6, 0.7],   # text 1: "a red sports car"
    [0.2, -0.9],   # text 2: "a mountain landscape"
    [-0.75, -0.6], # text 3: "a bowl of ramen"
], dtype=torch.float32)

pair_labels = ['cat', 'car', 'mountain', 'ramen']

print('Image embeddings (4 images, 2 dimensions each):')
for i, label in enumerate(pair_labels):
    print(f'  Image {i} ({label:>8s}): [{image_embeddings[i, 0]:+.2f}, {image_embeddings[i, 1]:+.2f}]')

print()
print('Text embeddings (4 texts, 2 dimensions each):')
for i, label in enumerate(pair_labels):
    print(f'  Text  {i} ({label:>8s}): [{text_embeddings[i, 0]:+.2f}, {text_embeddings[i, 1]:+.2f}]')

In [None]:
# Step 1: Normalize all embeddings to unit vectors
# Cosine similarity = dot product of unit vectors
# So we normalize first, then the matrix multiply gives us cosine similarities.

image_normed = F.normalize(image_embeddings, dim=1)  # [4, 2]
text_normed = F.normalize(text_embeddings, dim=1)    # [4, 2]

print('Normalized image embeddings (unit vectors):')
for i, label in enumerate(pair_labels):
    v = image_normed[i]
    norm = v.norm().item()
    print(f'  Image {i} ({label:>8s}): [{v[0]:+.4f}, {v[1]:+.4f}]  (norm = {norm:.4f})')

print()
print('Normalized text embeddings (unit vectors):')
for i, label in enumerate(pair_labels):
    v = text_normed[i]
    norm = v.norm().item()
    print(f'  Text  {i} ({label:>8s}): [{v[0]:+.4f}, {v[1]:+.4f}]  (norm = {norm:.4f})')

In [None]:
# Step 2: Compute the 4x4 similarity matrix
# S[i, j] = cosine_similarity(text_i, image_j)
# Since the vectors are normalized: S = text_normed @ image_normed.T

similarity_matrix = text_normed @ image_normed.T  # [4, 4]

print('Similarity matrix (text rows × image columns):')
print()
header = '                    ' + ''.join(f'{label:>12s}' for label in pair_labels)
print(header)
print('                    ' + ''.join(f'     img {i}  ' for i in range(4)))
print('-' * len(header))
for i, label in enumerate(pair_labels):
    row_str = ''.join(f'{similarity_matrix[i, j]:>12.4f}' for j in range(4))
    marker = '  <-- diagonal' if True else ''
    print(f'  text {i} ({label:>8s})' + row_str)

print()
print('Diagonal entries (matching pairs):')
for i, label in enumerate(pair_labels):
    print(f'  S[{i},{i}] = {similarity_matrix[i, i]:.4f}  ({label} text <-> {label} image)')

print()
print('The diagonal entries should be the highest in each row and column.')
print('This is the pattern CLIP is trained to produce: bright diagonal, dark off-diagonal.')

In [None]:
# Visualize the similarity matrix as a heatmap
text_labels = ['"a tabby cat"', '"a red sports car"', '"a mountain landscape"', '"a bowl of ramen"']
img_labels = ['cat img', 'car img', 'mountain img', 'ramen img']

plot_similarity_matrix(
    similarity_matrix,
    row_labels=text_labels,
    col_labels=img_labels,
    title='4×4 Cosine Similarity Matrix (hand-computed)'
)

print('The bright diagonal shows that matching pairs have high cosine similarity.')
print('The darker off-diagonal entries are the non-matching pairs.')
print('This is exactly the pattern from the lesson\'s similarity matrix example.')

In [None]:
# Step 3: Connect to cross-entropy loss
# Each row of the similarity matrix IS a classification problem.
# Row i asks: "which image matches text i?"
# The correct answer is always column i (the diagonal entry).
#
# Labels = [0, 1, 2, 3] -- each text i matches image i.

labels = torch.arange(4)  # [0, 1, 2, 3]

# Walk through row 0 ("a tabby cat")
print('Walking through row 0: "a tabby cat"')
print(f'  Similarities: {similarity_matrix[0].tolist()}')
print(f'  Correct answer: column {labels[0].item()} (cat image)')
print(f'  Highest similarity is at column {similarity_matrix[0].argmax().item()}')
print()

# In CLIP, a learnable temperature scales the logits before softmax.
# We will use temperature = 1.0 for this exercise (no scaling).
temperature = 1.0
logits = similarity_matrix / temperature

# Apply softmax to row 0 to get a probability distribution
probs_row0 = F.softmax(logits[0], dim=0)
print('Softmax probabilities for row 0:')
for j, label in enumerate(pair_labels):
    marker = ' <-- correct' if j == 0 else ''
    print(f'  P(image={label:>8s}) = {probs_row0[j]:.4f}{marker}')

print()

# Cross-entropy loss for row 0: -log(P(correct class))
ce_row0 = -torch.log(probs_row0[0])
print(f'Cross-entropy loss for row 0: -log({probs_row0[0]:.4f}) = {ce_row0:.4f}')
print()
print('This is the same cross-entropy you used for MNIST and language modeling.')
print('The only difference: the "classes" are images in the batch,')
print('and the "logits" are cosine similarities.')

In [None]:
# Step 4: Compute the full CLIP loss (symmetric cross-entropy)
# L = (L_image + L_text) / 2
# L_image: each row is a classification (which image matches this text?)
# L_text:  each column is a classification (which text matches this image?)

# Using PyTorch's cross_entropy (which handles softmax internally)
loss_image = F.cross_entropy(logits, labels)          # rows: which image?
loss_text = F.cross_entropy(logits.T, labels)         # columns: which text?
loss_clip = (loss_image + loss_text) / 2

print('CLIP loss breakdown:')
print(f'  L_image (text->image matching): {loss_image:.4f}')
print(f'  L_text  (image->text matching): {loss_text:.4f}')
print(f'  L_CLIP  (average):              {loss_clip:.4f}')
print()
print('The loss is symmetric: it asks BOTH "which image matches this text?"')
print('AND "which text matches this image?" This ensures neither direction dominates.')
print()
print('If you lower this loss during training, you are pushing matching pairs together')
print('and non-matching pairs apart in the shared embedding space.')
print()
print('Key insight: each row IS a classification problem. The labels are always')
print('[0, 1, 2, ..., N-1] -- the diagonal. No human ever labeled anything as')
print('"not matching." The batch structure provides the negatives for free.')

### What Just Happened

You computed a CLIP-style similarity matrix by hand:

1. **Normalized** the embeddings to unit vectors (so dot product = cosine similarity).
2. **Computed** the 4×4 similarity matrix via matrix multiplication.
3. **Visualized** it as a heatmap — bright diagonal (matching pairs), dark off-diagonal (non-matches).
4. **Connected** each row to a cross-entropy classification problem — "which image matches this text?"
5. **Computed** the full symmetric CLIP loss — cross-entropy in both directions.

The numbers you computed here are exactly what CLIP produces at scale. The only difference: real CLIP uses 512 dimensions and batches of 32,768. Same math, vastly more dimensions and data.

---

## Exercise 2: Explore a Pretrained CLIP Model [Guided]

Now you will load a real pretrained CLIP model and see the same similarity matrix pattern — but produced by encoders trained on billions of text-image pairs.

We will load the ViT-B/32 model (Vision Transformer as the image encoder, 32×32 patch size) trained on LAION-2B via OpenCLIP, and:
1. Download 5 images from the web
2. Encode them with CLIP's image encoder
3. Encode 5 text prompts with CLIP's text encoder
4. Compute and visualize the 5×5 similarity matrix

**Before running, predict:**
- Will the diagonal entries be close to 1.0 or just "higher than off-diagonal"?
- Among off-diagonal entries, will some be noticeably higher than others? (Think about which images and texts might be semantically similar even if they are not exact matches.)

In [None]:
# Load the pretrained CLIP model (ViT-B/32)
# This downloads the model weights (~350 MB) on first run.

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='laion2b_s34b_b79k'
)
tokenizer = open_clip.get_tokenizer('ViT-B-32')

model = model.to(device)
model.eval()

print(f'Model loaded: ViT-B-32')
print(f'Image encoder: Vision Transformer (patches = 32×32)')
print(f'Text encoder: Transformer')
print(f'Embedding dimension: 512')
print(f'Device: {device}')

In [None]:
# Download 5 images from the web
# These are public domain or CC-licensed images from Unsplash and similar.

image_urls = {
    'cat': 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg',
    'sports car': 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/2019_Ferrari_488_Pista_3.9.jpg/1200px-2019_Ferrari_488_Pista_3.9.jpg',
    'mountain': 'https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg/1200px-Everest_North_Face_toward_Base_Camp_Tibet_Luca_Galuzzi_2006.jpg',
    'ramen': 'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Shoyu_ramen%2C_at_Kasukabe_Station_%282014.05.05%29_1.jpg/1200px-Shoyu_ramen%2C_at_Kasukabe_Station_%282014.05.05%29_1.jpg',
    'dog': 'https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/1200px-YellowLabradorLooking_new.jpg',
}

images = {}
for name, url in image_urls.items():
    try:
        response = requests.get(url, timeout=10)
        img = Image.open(BytesIO(response.content)).convert('RGB')
        images[name] = img
        print(f'  Downloaded: {name} ({img.size[0]}x{img.size[1]})')
    except Exception as e:
        print(f'  Failed to download {name}: {e}')

print(f'\nLoaded {len(images)} images.')

In [None]:
# Display the images
fig, axes = plt.subplots(1, len(images), figsize=(15, 3))
for ax, (name, img) in zip(axes, images.items()):
    ax.imshow(img)
    ax.set_title(name, fontsize=11)
    ax.axis('off')
plt.suptitle('Input Images', fontsize=13)
plt.tight_layout()
plt.show()

In [None]:
# Encode images and text with CLIP

# Text prompts — one for each image, plus the order matches
text_prompts = [
    'a photo of a cat',
    'a photo of a sports car',
    'a photo of a mountain',
    'a photo of a bowl of ramen',
    'a photo of a dog',
]

# Preprocess images for CLIP (resize, crop, normalize)
image_tensors = torch.stack([preprocess(img) for img in images.values()]).to(device)

# Tokenize text
text_tokens = tokenizer(text_prompts).to(device)

# Encode
with torch.no_grad():
    image_features = model.encode_image(image_tensors)   # [5, 512]
    text_features = model.encode_text(text_tokens)       # [5, 512]

print(f'Image features shape: {image_features.shape}')   # [5, 512]
print(f'Text features shape:  {text_features.shape}')    # [5, 512]
print()
print('Two separate encoders produced two sets of 512-dimensional vectors.')
print('The image encoder and text encoder share NO weights.')
print('The only thing that connected them during training was the contrastive loss.')

In [None]:
# Compute the 5x5 similarity matrix — same steps as Exercise 1
# 1. Normalize to unit vectors
# 2. Matrix multiply

image_features_norm = F.normalize(image_features, dim=1)
text_features_norm = F.normalize(text_features, dim=1)

clip_sim_matrix = (text_features_norm @ image_features_norm.T).cpu()

# Plot the heatmap
img_names = list(images.keys())
plot_similarity_matrix(
    clip_sim_matrix,
    row_labels=text_prompts,
    col_labels=img_names,
    title='5×5 CLIP Similarity Matrix (real pretrained model)'
)

print('Observations:')
print('  1. The diagonal entries are the highest in their row — matching pairs.')
print('  2. The off-diagonal entries are lower — non-matching pairs.')
print('  3. This is the same pattern you computed by hand in Exercise 1!')
print()

# Check for interesting off-diagonal entries
print('Interesting off-diagonal entries (semantically related non-matches):')
# Look at dog-cat similarity
dog_idx = img_names.index('dog')
cat_text_idx = text_prompts.index('a photo of a cat')
dog_text_idx = text_prompts.index('a photo of a dog')
cat_idx = img_names.index('cat')

print(f'  "a photo of a cat" <-> dog image:  {clip_sim_matrix[cat_text_idx, dog_idx]:.4f}')
print(f'  "a photo of a dog" <-> cat image:  {clip_sim_matrix[dog_text_idx, cat_idx]:.4f}')
print(f'  "a photo of a cat" <-> car image:  {clip_sim_matrix[cat_text_idx, 1]:.4f}')
print()
print('Cat and dog images are more similar to each other\'s text than to car or ramen.')
print('The shared embedding space captures semantic similarity — animals cluster together.')

### What Just Happened

You loaded a real CLIP model trained on billions of text-image pairs and saw the same pattern you computed by hand:

1. **Two separate encoders** produced 512-dimensional vectors for images and text.
2. **The similarity matrix** has a bright diagonal — matching text-image pairs have high cosine similarity.
3. **Off-diagonal entries** reveal semantic structure — a cat and a dog are more similar to each other than either is to a car or a mountain.

The numbers from Exercise 1 are exactly what CLIP produces at scale. Same math, 512 dimensions instead of 2, trained on ~2 billion pairs instead of hand-designed.

---

## Exercise 3: Zero-Shot Classification [Supported]

The shared embedding space enables a surprising capability: **classify images without any training on the target classes**. In *Transfer Learning*, you replaced the classification head and retrained. CLIP skips both steps — the text encoder IS the classification head.

The recipe:
1. Create a text prompt for each class: `"a photo of a [class name]"`
2. Encode all prompts with the text encoder (once)
3. Encode each test image with the image encoder
4. Find the text prompt with highest cosine similarity — that is the prediction

**Your task:** Use CLIP to classify images from CIFAR-10 (10 classes, 32×32 images). The structure is provided with `# TODO` markers. Each TODO is 1–3 lines.

**Hints:**
- `F.normalize(features, dim=1)` normalizes each row to unit length.
- After normalizing, `image_features @ text_features.T` gives cosine similarities.
- `.argmax(dim=1)` finds the index of the highest value per row.
- Compare `predictions == true_labels` to compute accuracy.

In [None]:
# Load CIFAR-10 test set
cifar10_test = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True
)

cifar10_classes = [
    'airplane', 'automobile', 'bird', 'cat', 'deer',
    'dog', 'frog', 'horse', 'ship', 'truck'
]

print(f'CIFAR-10 test set: {len(cifar10_test)} images')
print(f'Classes: {cifar10_classes}')
print(f'Image size: 32×32 (CLIP will resize these to 224×224)')

In [None]:
# Step 1: Create and encode text prompts for each class
# The standard CLIP template: "a photo of a [class name]"

# TODO: Create a list of text prompts, one per class.
# Use the template: "a photo of a {class_name}"
# Hint: text_prompts_cifar = [f"a photo of a {c}" for c in cifar10_classes]


# Encode all text prompts (do this ONCE — reuse for every image)
text_tokens_cifar = tokenizer(text_prompts_cifar).to(device)
with torch.no_grad():
    text_features_cifar = model.encode_text(text_tokens_cifar)  # [10, 512]

# TODO: Normalize text features to unit vectors
# Hint: text_features_cifar = F.normalize(text_features_cifar, dim=1)


print(f'Text features shape: {text_features_cifar.shape}')  # [10, 512]
print('Text prompts:')
for i, prompt in enumerate(text_prompts_cifar):
    print(f'  Class {i}: "{prompt}"')

In [None]:
# Step 2: Classify a batch of test images
# We will classify the first 500 images for speed.

n_test = 500
correct = 0
total = 0
per_class_correct = {c: 0 for c in cifar10_classes}
per_class_total = {c: 0 for c in cifar10_classes}

# Process in batches of 50
batch_size = 50
all_predictions = []
all_labels = []

for start_idx in range(0, n_test, batch_size):
    end_idx = min(start_idx + batch_size, n_test)

    # Load and preprocess images
    batch_images = []
    batch_labels = []
    for i in range(start_idx, end_idx):
        img, label = cifar10_test[i]
        batch_images.append(preprocess(img))
        batch_labels.append(label)

    image_batch = torch.stack(batch_images).to(device)
    label_batch = torch.tensor(batch_labels)

    # Encode images
    with torch.no_grad():
        image_feats = model.encode_image(image_batch)  # [batch, 512]

    # TODO: Normalize image features to unit vectors
    # Hint: image_feats = F.normalize(image_feats, dim=1)


    # TODO: Compute similarity between each image and all 10 text prompts
    # Result should be shape [batch, 10]
    # Hint: similarities = image_feats @ text_features_cifar.T


    # TODO: Find the predicted class for each image (index of highest similarity)
    # Hint: predictions = similarities.argmax(dim=1).cpu()


    # Track accuracy
    all_predictions.extend(predictions.tolist())
    all_labels.extend(batch_labels)

    for pred, true_label in zip(predictions.tolist(), batch_labels):
        class_name = cifar10_classes[true_label]
        per_class_total[class_name] += 1
        if pred == true_label:
            correct += 1
            per_class_correct[class_name] += 1
        total += 1

overall_accuracy = correct / total * 100

print(f'Zero-shot CLIP accuracy on CIFAR-10: {correct}/{total} = {overall_accuracy:.1f}%')
print(f'Random chance: 10.0%')
print()
print('Per-class accuracy:')
for class_name in cifar10_classes:
    n_correct = per_class_correct[class_name]
    n_total = per_class_total[class_name]
    acc = n_correct / n_total * 100 if n_total > 0 else 0
    bar = '#' * int(acc / 5)
    print(f'  {class_name:>12s}: {acc:5.1f}%  {bar}')

<details>
<summary>Solution</summary>

The three TODOs implement the zero-shot classification pipeline. The key insight: this is the exact same cosine similarity computation from Exercises 1 and 2, just applied as a classifier.

**Create text prompts:**
```python
text_prompts_cifar = [f"a photo of a {c}" for c in cifar10_classes]
```

**Normalize text features:**
```python
text_features_cifar = F.normalize(text_features_cifar, dim=1)
```

**Normalize image features:**
```python
image_feats = F.normalize(image_feats, dim=1)
```

**Compute similarities:**
```python
similarities = image_feats @ text_features_cifar.T
```

**Find predictions:**
```python
predictions = similarities.argmax(dim=1).cpu()
```

The pattern is always the same: normalize, matrix multiply, argmax. CLIP never saw CIFAR-10 during training. The shared embedding space generalizes — the text encoder IS the classification head. No retraining, no fine-tuning, no new parameters.

Common mistake: forgetting to normalize features before computing similarity. Without normalization, the dot product is NOT cosine similarity and the magnitudes of the vectors will dominate the comparison.

</details>

In [None]:
# Visualize some predictions — show images with their predicted and true labels
fig, axes = plt.subplots(2, 10, figsize=(16, 4))

# Show first 10 correct and first 10 incorrect predictions
correct_indices = [i for i in range(n_test) if all_predictions[i] == all_labels[i]]
wrong_indices = [i for i in range(n_test) if all_predictions[i] != all_labels[i]]

# Top row: correct predictions
for col, idx in enumerate(correct_indices[:10]):
    img, _ = cifar10_test[idx]
    axes[0, col].imshow(img)
    axes[0, col].set_title(f'{cifar10_classes[all_predictions[idx]]}', fontsize=8, color='#86efac')
    axes[0, col].axis('off')

# Bottom row: wrong predictions
for col, idx in enumerate(wrong_indices[:10]):
    img, _ = cifar10_test[idx]
    true_name = cifar10_classes[all_labels[idx]]
    pred_name = cifar10_classes[all_predictions[idx]]
    axes[1, col].imshow(img)
    axes[1, col].set_title(f'{pred_name}\n(true: {true_name})', fontsize=7, color='#f87171')
    axes[1, col].axis('off')

axes[0, 0].set_ylabel('Correct', fontsize=10, color='#86efac')
axes[1, 0].set_ylabel('Wrong', fontsize=10, color='#f87171')
plt.suptitle('CLIP Zero-Shot Predictions on CIFAR-10', fontsize=13)
plt.tight_layout()
plt.show()

print('Top row: correct predictions (green).')
print('Bottom row: incorrect predictions (red) with true labels.')
print()
print('CLIP never saw CIFAR-10 during training. It never saw these specific 10 classes.')
print('The shared embedding space generalizes beyond the training data.')
print(f'\nAccuracy ({overall_accuracy:.1f}%) is far above random chance (10%) but below')
print('a trained classifier (~95%). CLIP trades peak accuracy for generalization to')
print('ANY text description — not just a fixed set of 10 classes.')

### What Just Happened

You performed zero-shot classification — classifying images into categories CLIP was never explicitly trained on:

1. **Created text prompts** for each class using the template "a photo of a [class]."
2. **Encoded** both images and text into the shared 512-dimensional space.
3. **Found the best match** using cosine similarity — no training, no fine-tuning.

The accuracy is well above random chance, demonstrating that the shared embedding space generalizes beyond training data. In *Transfer Learning*, you had to retrain the classification head. CLIP does not even need that — the text encoder IS the classification head.

---

## Exercise 4: Probing CLIP's Limitations [Independent]

CLIP is impressive, but it learns statistical co-occurrence, not deep understanding. The lesson identified several systematic failure modes:

- **Spatial relationships:** "a cat on a table" vs "a table on a cat"
- **Counting:** "three dogs" vs "five dogs"
- **Compositional reasoning:** unusual combinations of familiar concepts

**Your task:** Design experiments to test at least 3 of CLIP's known limitations. For each test:
1. Create two or more text prompts that differ in a way CLIP should distinguish
2. Find or describe an image that matches one prompt but not the other
3. Compute similarities and check whether CLIP can tell them apart

Document where CLIP succeeds and where it fails. The goal is not to catch CLIP making mistakes (that is easy) — it is to understand **why** those mistakes happen and what they reveal about how the model represents information.

**No skeleton code is provided.** Use the encoding and similarity computation patterns from the previous exercises.

In [None]:
# YOUR CODE HERE
#
# Design experiments to probe CLIP's limitations.
#
# Pattern to reuse from previous exercises:
#   text_tokens = tokenizer(list_of_prompts).to(device)
#   with torch.no_grad():
#       text_feats = model.encode_text(text_tokens)
#   text_feats = F.normalize(text_feats, dim=1)
#
# For text-only comparisons (no images needed):
#   Compare two prompts that differ in spatial order, count, etc.
#   If CLIP embeddings are nearly identical for different meanings,
#   that reveals a limitation.
#
# For image-text comparisons:
#   Use images from the web or CIFAR-10 to test specific claims.
#
# Suggested experiments:
#   1. Spatial: "a cat on a table" vs "a table on a cat"
#   2. Counting: "one dog" vs "three dogs" vs "five dogs"
#   3. Negation: "a photo of a dog" vs "a photo with no dog"
#   4. Attribute binding: "a red car and a blue house" vs "a blue car and a red house"
#   5. Abstract: "happiness" vs "sadness" vs a photo of people smiling


<details>
<summary>Solution</summary>

The experiments reveal CLIP's systematic limitations. The key insight: CLIP's embedding space encodes **what things are** but not **how they relate spatially, how many there are, or how attributes bind to objects**.

```python
def compare_prompts(prompt_list, title="Prompt Comparison"):
    """Compute and display pairwise similarities between text prompts."""
    tokens = tokenizer(prompt_list).to(device)
    with torch.no_grad():
        features = model.encode_text(tokens)
    features = F.normalize(features, dim=1)
    sim = (features @ features.T).cpu()

    print(f'\n{title}')
    print('=' * 60)
    for i, p in enumerate(prompt_list):
        print(f'  [{i}] "{p}"')
    print()
    for i in range(len(prompt_list)):
        for j in range(i + 1, len(prompt_list)):
            print(f'  [{i}] vs [{j}]: {sim[i, j]:.4f}')
    return sim

# Test 1: Spatial relationships
compare_prompts([
    "a cat sitting on a table",
    "a table sitting on a cat",
], title="Spatial: Does CLIP understand 'on'?")
# Expected: very high similarity (>0.9) — CLIP cannot distinguish word order

# Test 2: Counting
compare_prompts([
    "one dog",
    "three dogs",
    "five dogs",
    "ten dogs",
], title="Counting: Does CLIP distinguish quantities?")
# Expected: all very similar — CLIP encodes 'dog' strongly but 'number of dogs' weakly

# Test 3: Negation
compare_prompts([
    "a photo of a dog",
    "a photo with no dog",
    "a photo of a cat",
], title="Negation: Does CLIP understand 'no'?")
# Expected: "a photo of a dog" and "a photo with no dog" are more similar
# than "a photo of a dog" and "a photo of a cat" — CLIP ignores negation

# Test 4: Attribute binding
compare_prompts([
    "a red car and a blue house",
    "a blue car and a red house",
], title="Attribute Binding: Does CLIP bind colors to objects?")
# Expected: very high similarity — same words, different bindings

# Test 5: What CLIP IS good at — basic categories
compare_prompts([
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a car",
    "a photo of a mountain",
], title="Baseline: Basic category distinction (should work well)")
# Expected: moderate similarities between all pairs, clearly distinguishable
```

**Why these failures happen:** CLIP's training objective is contrastive matching — "does this text go with this image?" at the level of whole captions and whole images. It learns *what things are* (bag-of-concepts) but not the *relationships between things* (spatial structure, quantity, attribute binding). The embedding is a compressed summary of the entire input, not a structured representation of its parts.

**What CLIP is good at:** Distinguishing categories (cat vs car), recognizing styles (painting vs photograph), matching descriptions to images at a semantic level. These are exactly the capabilities that matter for conditioning a diffusion model — the U-Net needs to know WHAT to generate, and CLIP provides that.

**What CLIP is bad at:** Spatial layout, counting, negation, attribute binding — the compositional aspects of language. These limitations carry through to text-to-image models that rely on CLIP embeddings.

</details>

---

## Key Takeaways

1. **The similarity matrix IS the core of CLIP.** Normalize embeddings, matrix multiply, get cosine similarities. Matching pairs on the diagonal, non-matches off-diagonal. Each row is a classification problem with cross-entropy loss.

2. **Two encoders, one shared space — the loss function creates the alignment.** The image encoder and text encoder share no weights. Only the contrastive loss — symmetric cross-entropy on the similarity matrix — forces their embedding spaces into alignment.

3. **The shared space enables zero-shot transfer.** Classify any image by comparing its embedding to text descriptions of candidate classes. No training on the target classes required. The text encoder IS the classification head.

4. **Useful does not mean perfect.** CLIP encodes what things are and what words mean, well enough to match them. It does not reliably encode spatial relationships, counts, negation, or attribute binding. Understanding the limitations is as important as understanding the capabilities.

5. **Same building blocks, different question.** CNN/ViT image encoder, transformer text encoder, cosine similarity, cross-entropy loss — every piece is familiar from earlier in the course. The question is new: "do this text and image match?" The answer gives us text embeddings that encode visual meaning — exactly what the U-Net needs to generate images from text descriptions.