# 🔵 Lesson 5: CLIP and Tokenization

How does the model understand "A cat in space"?

It uses **CLIP** (Contrastive Language-Image Pre-Training), a model trained by OpenAI on 400 million image-text pairs.

### Goals:
1.  Inspect the **Tokenizer**.
2.  See the **77-Token Limit**.
3.  Understand **Embeddings** (Vectors).

In [None]:
# 1. Setup
import notebook_utils
project_root, device, dtype = notebook_utils.setup_notebook()

from transformers import CLIPTokenizer, CLIPTextModel
import torch

## 1. The Tokenizer

A tokenizer chops text into sub-words (tokens). It has a fixed vocabulary of ~49,000 words.
Every word is converted into a unique ID number.

In [None]:
model_id = "runwayml/stable-diffusion-v1-5"
tokenizer = CLIPTokenizer.from_pretrained(model_id, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder").to(device)

prompt = "A cute robot painting a masterpiece"

tokens = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt")
print("Token IDs:", tokens.input_ids[0][:10]) # Show first 10

## 2. Inspecting the Tokens

Let's see what those numbers actually mean.

In [None]:
ids = tokens.input_ids[0][:10]
decoded = [tokenizer.decode([i]) for i in ids]

print(f"{'ID':<8} | {'Token'}")
print("-" * 20)
for i, word in zip(ids, decoded):
    print(f"{i.item():<8} | {word}")

### Why 77 Tokens?
Stable Diffusion uses a fixed input size of 77 tokens. 
- One `<|startoftext|>` token.
- Up to 75 prompt tokens.
- One `<|endoftext|>` token.

Everything after the end token is **padding** (usually ID `49407`).

## 3. Text Embeddings (The Math)

The generic integers (`329`, `83`) aren't useful yet. We pass them through the **Text Encoder**.
This converts each token into a **vector of 768 floating point numbers**.

This vector represents the *meaning* of the word in a 768-dimensional space.

In [None]:
with torch.no_grad():
    embeddings = text_encoder(tokens.input_ids.to(device))[0]

print(f"Embedding Shape: {embeddings.shape}")
print("  Batch Size: 1")
print("  Tokens: 77")
print("  Dimensions: 768")

## 📚 Educational Note

When you use `(emphasis:1.5)` in a UI like A1111, it physically multiplies these vectors by 1.5, making them "louder" to the U-Net.