# Tokenization: How LLMs See Text

**Inference Engineering Series - Notebook 3**

---

Neural networks work with numbers, not text. Before any matrix multiplication or activation function can do its work, we need to convert text into a sequence of integers called **token IDs**. This conversion process is called **tokenization**, and it has a surprisingly large impact on model quality and inference speed.

In this notebook, we'll explore tokenization from first principles: how it works, how different models tokenize differently, and why it matters for inference.

## What You'll Learn

1. **What tokenization is** and why we need it
2. **Byte Pair Encoding (BPE)** - the dominant tokenization algorithm
3. **Comparing tokenizers** across models (GPT, Llama, Qwen)
4. **How different content types tokenize differently** (prose, code, math, multilingual)
5. **Vocabulary sizes** and their tradeoffs
6. **Impact on inference speed** - why tokenizer efficiency matters

In [None]:
!pip install tiktoken transformers tokenizers -q

In [None]:
import tiktoken
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import time

print("Libraries loaded successfully.")

## Part 1: Why Tokenization?

The most basic approach would be to work at the **character level** -- each character gets its own ID. But this has problems:

- Sequences become very long ("transformer" = 11 characters vs 1-2 tokens)
- Each token carries very little meaning
- Attention costs scale quadratically with sequence length

The other extreme is **word-level** tokenization:
- Vocabulary would be enormous (millions of words + variations)
- Can't handle new/rare words ("ChatGPT", misspellings)
- Different languages have different word boundaries

**Subword tokenization** is the sweet spot: common words get their own token, while rare words are split into meaningful pieces.

In [None]:
# Demonstrate the three approaches
text = "The transformer model uses self-attention mechanisms."

# Character-level
char_tokens = list(text)
print(f"Text: '{text}'")
print(f"\n1. Character-level: {len(char_tokens)} tokens")
print(f"   {char_tokens}")

# Word-level (naive split)
word_tokens = text.split()
print(f"\n2. Word-level: {len(word_tokens)} tokens")
print(f"   {word_tokens}")

# Subword (BPE) using GPT-4's tokenizer
enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer
bpe_token_ids = enc.encode(text)
bpe_tokens = [enc.decode([t]) for t in bpe_token_ids]
print(f"\n3. Subword (BPE): {len(bpe_token_ids)} tokens")
print(f"   {bpe_tokens}")
print(f"   Token IDs: {bpe_token_ids}")

## Part 2: Byte Pair Encoding (BPE) from Scratch

BPE is the most common tokenization algorithm. It works by:
1. Start with a vocabulary of individual bytes/characters
2. Find the most frequent pair of adjacent tokens
3. Merge that pair into a new token
4. Repeat until you reach the desired vocabulary size

Let's implement it from scratch.

In [None]:
def train_bpe(text, num_merges):
    """Train a simple BPE tokenizer from scratch."""
    # Start with character-level tokens
    tokens = list(text)
    merges = []  # Track what we merge
    
    print(f"Starting with {len(set(tokens))} unique characters")
    print(f"Text length: {len(tokens)} tokens")
    print("=" * 50)
    
    for step in range(num_merges):
        # Count all adjacent pairs
        pairs = Counter()
        for i in range(len(tokens) - 1):
            pairs[(tokens[i], tokens[i+1])] += 1
        
        if not pairs:
            break
        
        # Find most frequent pair
        best_pair = pairs.most_common(1)[0]
        pair, count = best_pair
        new_token = pair[0] + pair[1]
        
        # Merge all occurrences
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
                new_tokens.append(new_token)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        
        tokens = new_tokens
        merges.append((pair, new_token, count))
        
        print(f"Step {step+1}: Merge '{pair[0]}' + '{pair[1]}' -> '{new_token}' (count: {count}), "
              f"tokens: {len(tokens)}")
    
    return tokens, merges

# Train on a small example
sample_text = "the cat sat on the mat the cat ate the rat"
tokens, merges = train_bpe(sample_text, num_merges=10)

print(f"\nFinal tokenization:")
print(f"  {tokens}")
print(f"  {len(tokens)} tokens (down from {len(sample_text)} characters)")

In [None]:
# Visualize BPE merge steps
fig, ax = plt.subplots(figsize=(12, 5))

# Track token count at each step
token_counts = [len(sample_text)]  # Start with character count
temp_tokens = list(sample_text)

for pair, new_token, count in merges:
    new_temp = []
    i = 0
    while i < len(temp_tokens):
        if i < len(temp_tokens) - 1 and temp_tokens[i] == pair[0] and temp_tokens[i+1] == pair[1]:
            new_temp.append(new_token)
            i += 2
        else:
            new_temp.append(temp_tokens[i])
            i += 1
    temp_tokens = new_temp
    token_counts.append(len(temp_tokens))

ax.plot(range(len(token_counts)), token_counts, 'o-', color='#4ECDC4', linewidth=2, markersize=8)
ax.set_xlabel('Number of BPE Merges', fontsize=12)
ax.set_ylabel('Token Count', fontsize=12)
ax.set_title('How BPE Compression Reduces Token Count', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)

# Add merge labels
for i, (pair, new_token, count) in enumerate(merges):
    label = f"'{pair[0]}'+'{pair[1]}'->'{new_token}'"
    ax.annotate(label, (i+1, token_counts[i+1]), 
               textcoords="offset points", xytext=(10, 10),
               fontsize=8, rotation=30)

plt.tight_layout()
plt.show()

## Part 3: Comparing Real Tokenizers

Let's load tokenizers from major LLM families and compare them.

In [None]:
# Load multiple tokenizers
tokenizers = {}

# GPT-2
tokenizers['GPT-2'] = tiktoken.get_encoding("gpt2")

# GPT-4 / GPT-3.5
tokenizers['GPT-4'] = tiktoken.get_encoding("cl100k_base")

# GPT-4o
tokenizers['GPT-4o'] = tiktoken.get_encoding("o200k_base")

# HuggingFace tokenizers
hf_tokenizers = {
    'Llama-2': 'meta-llama/Llama-2-7b-hf',
    'Qwen2': 'Qwen/Qwen2-0.5B',
}

for name, model_id in hf_tokenizers.items():
    try:
        tokenizers[name] = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    except Exception as e:
        print(f"Could not load {name}: {e}")

# Print vocabulary sizes
print(f"{'Tokenizer':<15s} {'Vocab Size':>12s}")
print("-" * 30)
for name, tok in tokenizers.items():
    if hasattr(tok, 'n_vocab'):
        vocab_size = tok.n_vocab
    elif hasattr(tok, 'vocab_size'):
        vocab_size = tok.vocab_size
    else:
        vocab_size = len(tok)
    print(f"{name:<15s} {vocab_size:>12,}")

In [None]:
# Helper function to tokenize and display results
def compare_tokenization(text, tokenizers):
    """Compare how different tokenizers handle the same text."""
    print(f"Text: '{text}'")
    print(f"Characters: {len(text)}")
    print("=" * 80)
    
    results = {}
    for name, tok in tokenizers.items():
        if isinstance(tok, tiktoken.Encoding):
            ids = tok.encode(text)
            tokens = [tok.decode([t]) for t in ids]
        else:
            ids = tok.encode(text, add_special_tokens=False)
            tokens = [tok.decode([t]) for t in ids]
        
        results[name] = {'ids': ids, 'tokens': tokens, 'count': len(ids)}
        
        # Display tokens with separators
        token_display = ' | '.join(tokens)
        print(f"\n{name} ({len(ids)} tokens):")
        print(f"  Tokens: [{token_display}]")
        print(f"  IDs: {ids}")
    
    return results

# Compare on a simple sentence
results = compare_tokenization(
    "Hello, how are you doing today?",
    tokenizers
)

## Part 4: How Different Content Types Are Tokenized

Tokenizer efficiency varies dramatically depending on the type of content. Let's compare.

In [None]:
# Different content types
test_cases = {
    "English prose": "The quick brown fox jumps over the lazy dog. This is a simple sentence that demonstrates basic English tokenization.",
    
    "Python code": """def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)""",
    
    "JSON data": '{"name": "John", "age": 30, "scores": [95.5, 87.3, 91.0], "active": true}',
    
    "Math equation": "E = mc^2, F = ma, PV = nRT, e^(i*pi) + 1 = 0",
    
    "URLs": "https://www.example.com/api/v2/users?page=1&limit=100&sort=name",
    
    "Chinese text": "Transformer模型在自然语言处理领域取得了巨大的成功。",
    
    "Repeated text": "aaaaabbbbbcccccaaaaabbbbbcccccaaaaabbbbbccccc",
    
    "Numbers": "3.14159265358979323846264338327950288419716939937510",
}

# Tokenize all cases with all tokenizers
efficiency_data = {}

for content_type, text in test_cases.items():
    efficiency_data[content_type] = {}
    for name, tok in tokenizers.items():
        if isinstance(tok, tiktoken.Encoding):
            ids = tok.encode(text)
        else:
            ids = tok.encode(text, add_special_tokens=False)
        
        # Efficiency: characters per token (higher = more efficient)
        efficiency = len(text) / len(ids)
        efficiency_data[content_type][name] = {
            'token_count': len(ids),
            'chars_per_token': efficiency
        }

# Display results
print(f"{'Content Type':<20s}", end="")
for name in tokenizers:
    print(f"{name:>12s}", end="")
print()
print("-" * (20 + 12 * len(tokenizers)))

for content_type in test_cases:
    print(f"{content_type:<20s}", end="")
    for name in tokenizers:
        count = efficiency_data[content_type][name]['token_count']
        print(f"{count:>12d}", end="")
    print()

In [None]:
# Visualize token efficiency across content types
fig, ax = plt.subplots(figsize=(14, 7))

content_types = list(test_cases.keys())
tokenizer_names = list(tokenizers.keys())
x = np.arange(len(content_types))
width = 0.15
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFD93D', '#DDA0DD']

for i, name in enumerate(tokenizer_names):
    values = [efficiency_data[ct][name]['chars_per_token'] for ct in content_types]
    ax.bar(x + i * width, values, width, label=name, color=colors[i % len(colors)],
           edgecolor='black', linewidth=0.5)

ax.set_xlabel('Content Type', fontsize=12)
ax.set_ylabel('Characters per Token (higher = more efficient)', fontsize=12)
ax.set_title('Tokenizer Efficiency Across Content Types', fontsize=13, fontweight='bold')
ax.set_xticks(x + width * len(tokenizer_names) / 2)
ax.set_xticklabels(content_types, rotation=30, ha='right')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

In [None]:
# Show token-by-token breakdown for code
code = """def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)"""

print("How different tokenizers see Python code:")
print("=" * 60)

for name, tok in list(tokenizers.items())[:3]:  # First 3 tokenizers
    if isinstance(tok, tiktoken.Encoding):
        ids = tok.encode(code)
        tokens = [tok.decode([t]) for t in ids]
    else:
        ids = tok.encode(code, add_special_tokens=False)
        tokens = [tok.decode([t]) for t in ids]
    
    print(f"\n{name} ({len(ids)} tokens):")
    # Color-code tokens for visibility
    for i, (token, tid) in enumerate(zip(tokens, ids)):
        display = repr(token)
        print(f"  [{i:2d}] ID={tid:>6d}  {display}")

## Part 5: Vocabulary Size Comparison

Vocabulary size is a crucial design choice. Larger vocabularies:
- Produce fewer tokens (more efficient for inference)
- But increase embedding table size (more parameters)
- And increase the output softmax computation

In [None]:
# Vocabulary size comparison
vocab_data = [
    ('GPT-2', 50257, 2019),
    ('GPT-3', 50257, 2020),
    ('BERT', 30522, 2018),
    ('T5', 32100, 2019),
    ('Llama 1', 32000, 2023),
    ('Llama 2', 32000, 2023),
    ('Llama 3', 128256, 2024),
    ('GPT-4', 100277, 2023),
    ('GPT-4o', 199998, 2024),
    ('Mistral', 32000, 2023),
    ('Qwen 2', 151936, 2024),
    ('Gemma', 256000, 2024),
]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart of vocab sizes
names = [v[0] for v in vocab_data]
sizes = [v[1] for v in vocab_data]
years = [v[2] for v in vocab_data]

colors = ['#FF6B6B' if s > 100000 else '#4ECDC4' if s > 50000 else '#45B7D1' for s in sizes]
ax1.barh(names, sizes, color=colors, edgecolor='black', linewidth=0.5)
for i, s in enumerate(sizes):
    ax1.text(s + 2000, i, f'{s:,}', va='center', fontsize=9)
ax1.set_xlabel('Vocabulary Size', fontsize=12)
ax1.set_title('Vocabulary Sizes Across Models', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='x')

# Embedding table memory cost
hidden_dims = {'GPT-2': 768, 'GPT-3': 12288, 'BERT': 768, 'T5': 768,
               'Llama 1': 4096, 'Llama 2': 4096, 'Llama 3': 4096,
               'GPT-4': 4096, 'GPT-4o': 4096, 'Mistral': 4096, 
               'Qwen 2': 896, 'Gemma': 2048}

memory_mb = []
for name, size, _ in vocab_data:
    hdim = hidden_dims.get(name, 4096)
    mem = size * hdim * 2 / 1e6  # FP16 bytes
    memory_mb.append(mem)

ax2.barh(names, memory_mb, color=colors, edgecolor='black', linewidth=0.5)
for i, m in enumerate(memory_mb):
    ax2.text(m + 2, i, f'{m:.0f} MB', va='center', fontsize=9)
ax2.set_xlabel('Embedding Table Size (MB, FP16)', fontsize=12)
ax2.set_title('Embedding Memory Cost', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nTrend: Newer models use larger vocabularies (100K-256K) for better efficiency.")
print("Llama 3 quadrupled its vocab size from Llama 2 (32K -> 128K).")

## Part 6: Token Encoding and Decoding

Let's explore the complete encode/decode pipeline and see how tokens map to and from text.

In [None]:
# Demonstrate encode -> decode roundtrip
enc = tiktoken.get_encoding("cl100k_base")

texts = [
    "Hello, world!",
    "The quick brown fox",
    "GPT-4 is an LLM",
    "    indented code",  # Leading spaces
    "!!??",  # Punctuation
]

for text in texts:
    ids = enc.encode(text)
    decoded = enc.decode(ids)
    tokens = [enc.decode([t]) for t in ids]
    
    print(f"Original:  '{text}'")
    print(f"Token IDs: {ids}")
    print(f"Tokens:    {tokens}")
    print(f"Decoded:   '{decoded}'")
    print(f"Roundtrip: {'PASS' if text == decoded else 'FAIL'}")
    print()

In [None]:
# Interesting tokenization edge cases
edge_cases = [
    ("Numbers: ", ["1", "10", "100", "1000", "10000", "100000", "1000000"]),
    ("Spaces: ", [" ", "  ", "    ", "        "]),
    ("Repeated: ", ["a", "aa", "aaa", "aaaa", "aaaaa", "aaaaaa"]),
]

enc = tiktoken.get_encoding("cl100k_base")

for prefix, examples in edge_cases:
    print(prefix)
    for text in examples:
        ids = enc.encode(text)
        print(f"  '{text}' -> {len(ids)} token(s): {ids}")
    print()

## Part 7: Visualizing Tokenization

Let's create a visual representation of how text gets split into tokens.

In [None]:
def visualize_tokenization(text, tokenizer, tokenizer_name, ax=None):
    """Create a color-coded visualization of tokenization."""
    if isinstance(tokenizer, tiktoken.Encoding):
        ids = tokenizer.encode(text)
        tokens = [tokenizer.decode([t]) for t in ids]
    else:
        ids = tokenizer.encode(text, add_special_tokens=False)
        tokens = [tokenizer.decode([t]) for t in ids]
    
    if ax is None:
        fig, ax = plt.subplots(figsize=(14, 2))
    
    # Color palette for tokens
    colors = plt.cm.Set3(np.linspace(0, 1, max(12, len(tokens))))
    
    x_pos = 0
    y_pos = 0
    max_x = 60  # Characters per line
    
    for i, (token, tid) in enumerate(zip(tokens, ids)):
        display = token.replace('\n', '\\n')
        width = len(display) * 0.12 + 0.1
        
        if x_pos + width > max_x * 0.12:
            x_pos = 0
            y_pos -= 0.4
        
        rect = plt.Rectangle((x_pos, y_pos), width, 0.3, 
                           facecolor=colors[i % len(colors)], 
                           edgecolor='black', linewidth=1)
        ax.add_patch(rect)
        ax.text(x_pos + width/2, y_pos + 0.15, display, 
               ha='center', va='center', fontsize=8, fontweight='bold')
        
        x_pos += width + 0.02
    
    ax.set_xlim(-0.1, 8)
    ax.set_ylim(y_pos - 0.1, 0.5)
    ax.set_title(f'{tokenizer_name}: {len(ids)} tokens', fontsize=11, fontweight='bold')
    ax.axis('off')
    
    return len(ids)

# Visualize a sentence across multiple tokenizers
text = "The transformer architecture revolutionized natural language processing."

fig, axes = plt.subplots(len(tokenizers), 1, figsize=(14, 2.5 * len(tokenizers)))
if len(tokenizers) == 1:
    axes = [axes]

for ax, (name, tok) in zip(axes, tokenizers.items()):
    visualize_tokenization(text, tok, name, ax)

plt.suptitle(f"Tokenization Comparison: '{text}'", fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

## Part 8: Special Tokens

Tokenizers use special tokens for control signals: beginning/end of text, padding, separators, etc.

In [None]:
# Explore special tokens
for name, tok in tokenizers.items():
    print(f"\n{name}:")
    
    if isinstance(tok, tiktoken.Encoding):
        special = tok.special_tokens_set
        print(f"  Special tokens: {special}")
    else:
        print(f"  BOS token: {tok.bos_token} (ID: {tok.bos_token_id})")
        print(f"  EOS token: {tok.eos_token} (ID: {tok.eos_token_id})")
        print(f"  PAD token: {tok.pad_token} (ID: {tok.pad_token_id})")
        if hasattr(tok, 'additional_special_tokens') and tok.additional_special_tokens:
            print(f"  Additional: {tok.additional_special_tokens[:5]}")

## Part 9: Impact on Inference Speed

Tokenizer efficiency directly impacts inference:
- **Fewer tokens = faster prefill** (less computation needed)
- **Fewer tokens = less KV cache** (less memory used)
- **Fewer tokens = fewer decode steps** (faster generation)

Let's quantify this impact.

In [None]:
# Measure tokenization speed
long_text = """The transformer architecture was introduced in the paper 'Attention Is All You Need' 
by Vaswani et al. in 2017. It has since become the foundation for most state-of-the-art 
natural language processing models, including GPT, BERT, T5, and their many variants. 
The key innovation was replacing recurrent layers with self-attention mechanisms, which 
allow the model to process all positions in a sequence simultaneously rather than 
sequentially. This parallelism makes transformers much more efficient to train on modern 
hardware like GPUs and TPUs. The architecture consists of an encoder and decoder, each 
made up of layers containing multi-head self-attention and feed-forward neural networks.
""" * 10  # Repeat for a decent-sized text

print(f"Text length: {len(long_text)} characters")
print("=" * 60)

for name, tok in tokenizers.items():
    # Benchmark encoding speed
    times = []
    for _ in range(50):
        start = time.time()
        if isinstance(tok, tiktoken.Encoding):
            ids = tok.encode(long_text)
        else:
            ids = tok.encode(long_text, add_special_tokens=False)
        times.append(time.time() - start)
    
    median_time = np.median(times) * 1000
    token_count = len(ids)
    chars_per_token = len(long_text) / token_count
    
    print(f"{name:15s}: {token_count:5d} tokens, {chars_per_token:.2f} chars/tok, "
          f"encode time: {median_time:.2f} ms")

In [None]:
# Calculate inference cost implications
print("\nInference Cost Implications")
print("(For a 7B parameter model processing the above text)")
print("=" * 70)

params_7b = 7e9
hidden_dim = 4096
n_layers = 32

for name, tok in tokenizers.items():
    if isinstance(tok, tiktoken.Encoding):
        ids = tok.encode(long_text)
    else:
        ids = tok.encode(long_text, add_special_tokens=False)
    
    n_tokens = len(ids)
    
    # Rough FLOP estimate for prefill: ~2 * params * seq_len
    prefill_flops = 2 * params_7b * n_tokens
    
    # KV cache size: 2 (K+V) * n_layers * n_tokens * hidden_dim * 2 (FP16)
    kv_cache_bytes = 2 * n_layers * n_tokens * hidden_dim * 2
    kv_cache_mb = kv_cache_bytes / 1e6
    
    print(f"{name:15s}: {n_tokens:5d} tokens -> "
          f"Prefill: {prefill_flops/1e12:.1f} TFLOPs, "
          f"KV Cache: {kv_cache_mb:.1f} MB")

print("\nA more efficient tokenizer directly reduces compute and memory costs!")

## Part 10: The Token-to-Embedding Pipeline

Let's trace how tokens become the vectors that flow through the neural network.

In [None]:
!pip install transformers torch -q

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load GPT-2 (small, runs easily on Colab)
model_name = "gpt2"
tokenizer_gpt2 = AutoTokenizer.from_pretrained(model_name)
model_gpt2 = AutoModel.from_pretrained(model_name)

text = "Tokenization is the first step."

# Step 1: Tokenize
inputs = tokenizer_gpt2(text, return_tensors='pt')
token_ids = inputs['input_ids'][0]
tokens = [tokenizer_gpt2.decode([t]) for t in token_ids]

print("Step 1: Tokenization")
print(f"  Text: '{text}'")
print(f"  Tokens: {tokens}")
print(f"  Token IDs: {token_ids.tolist()}")

# Step 2: Embedding lookup
embedding_layer = model_gpt2.wte  # Word Token Embedding
print(f"\nStep 2: Embedding Lookup")
print(f"  Embedding table shape: {embedding_layer.weight.shape}")
print(f"  (vocab_size x hidden_dim)")

with torch.no_grad():
    embeddings = embedding_layer(token_ids)

print(f"  Output shape: {embeddings.shape} (seq_len x hidden_dim)")

# Step 3: Position encoding
position_embedding = model_gpt2.wpe  # Word Position Embedding
positions = torch.arange(len(token_ids))

with torch.no_grad():
    pos_embeds = position_embedding(positions)

print(f"\nStep 3: Add Position Embeddings")
print(f"  Position embedding shape: {pos_embeds.shape}")

# Final input to transformer
hidden_states = embeddings + pos_embeds
print(f"\nFinal input to transformer: {hidden_states.shape}")
print(f"  Each token is now a {hidden_states.shape[-1]}-dimensional vector")

In [None]:
# Visualize token embeddings
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Token embeddings heatmap
ax = axes[0, 0]
im = ax.imshow(embeddings.numpy(), aspect='auto', cmap='RdBu_r')
ax.set_xlabel('Hidden Dimension')
ax.set_ylabel('Token')
ax.set_yticks(range(len(tokens)))
ax.set_yticklabels([f"{t} ({tid})" for t, tid in zip(tokens, token_ids.tolist())], fontsize=9)
ax.set_title('Token Embeddings', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8)

# Position embeddings
ax = axes[0, 1]
im = ax.imshow(pos_embeds.numpy(), aspect='auto', cmap='RdBu_r')
ax.set_xlabel('Hidden Dimension')
ax.set_ylabel('Position')
ax.set_title('Position Embeddings', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8)

# Combined
ax = axes[1, 0]
im = ax.imshow(hidden_states.numpy(), aspect='auto', cmap='RdBu_r')
ax.set_xlabel('Hidden Dimension')
ax.set_ylabel('Token')
ax.set_yticks(range(len(tokens)))
ax.set_yticklabels(tokens, fontsize=9)
ax.set_title('Token + Position Embeddings (input to transformer)', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8)

# Cosine similarity between token embeddings
ax = axes[1, 1]
emb_norm = embeddings / embeddings.norm(dim=-1, keepdim=True)
sim_matrix = (emb_norm @ emb_norm.T).numpy()
im = ax.imshow(sim_matrix, cmap='YlOrRd', vmin=0, vmax=1)
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45, ha='right', fontsize=9)
ax.set_yticklabels(tokens, fontsize=9)
ax.set_title('Cosine Similarity Between Token Embeddings', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8)

# Add values to similarity matrix
for i in range(len(tokens)):
    for j in range(len(tokens)):
        ax.text(j, i, f'{sim_matrix[i,j]:.2f}', ha='center', va='center', fontsize=7)

plt.suptitle('From Tokens to Vectors: The Embedding Pipeline', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 11: Token Distribution Analysis

Let's analyze what types of tokens dominate in different types of text.

In [None]:
# Analyze token length distribution
enc = tiktoken.get_encoding("cl100k_base")

sample_texts = {
    'English': """The transformer architecture has revolutionized the field of natural language 
    processing. Models like GPT-4 and Claude can understand and generate human-like text with 
    remarkable accuracy. These models are trained on massive datasets and use billions of parameters.""",
    
    'Python': """import torch\nimport torch.nn as nn\n\nclass TransformerBlock(nn.Module):\n    
    def __init__(self, hidden_dim, num_heads):\n        super().__init__()\n        
    self.attention = nn.MultiheadAttention(hidden_dim, num_heads)\n        
    self.ffn = nn.Sequential(nn.Linear(hidden_dim, 4*hidden_dim), nn.GELU())\n""",
    
    'Math': """Let f(x) = integral from 0 to infinity of e^(-x^2) dx = sqrt(pi)/2. 
    The derivative d/dx[sin(x)*cos(x)] = cos(2x). The sum from n=1 to infinity of 1/n^2 = pi^2/6.""",
}

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, (text_type, text) in zip(axes, sample_texts.items()):
    ids = enc.encode(text)
    token_lengths = [len(enc.decode([t])) for t in ids]
    
    ax.hist(token_lengths, bins=range(1, max(token_lengths)+2), 
           color='#4ECDC4', edgecolor='black', linewidth=0.5, alpha=0.7,
           align='left')
    ax.set_xlabel('Token Length (characters)', fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'{text_type}\n({len(ids)} tokens, avg {np.mean(token_lengths):.1f} chars/token)', 
                fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('Token Length Distribution by Content Type (GPT-4 tokenizer)', 
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## Key Takeaways

1. **Tokenization converts text to numbers** that neural networks can process. It's the very first step in any LLM pipeline.

2. **BPE (Byte Pair Encoding)** is the dominant algorithm: it iteratively merges the most frequent pairs of tokens to build a vocabulary of subword units.

3. **Vocabulary size is a key design choice**: Newer models trend toward larger vocabularies (100K-256K tokens) for better tokenization efficiency, especially for code and multilingual text.

4. **Tokenizer efficiency varies by content type**: English prose is tokenized most efficiently (3-5 chars/token), while code, math, and non-Latin scripts often produce more tokens.

5. **Tokenization directly impacts inference cost**: Fewer tokens means less computation (prefill), less memory (KV cache), and fewer generation steps (decode). A 2x reduction in token count roughly halves inference cost.

6. **Special tokens** (BOS, EOS, PAD) serve as control signals that tell the model where sequences begin, end, and how to handle batching.

7. **The token-to-vector pipeline**: Token ID -> Embedding lookup -> Add position encoding -> Input to transformer. Each token becomes a high-dimensional vector (768-8192 dimensions).

---

**Next notebook:** We'll dive into the attention mechanism -- the core operation that allows tokens to "look at" each other and build context-dependent representations.