# Tokenization Playground

This notebook explores different tokenization techniques used in NLP and LLMs.

In [2]:
# Import required libraries
import tiktoken
from transformers import AutoTokenizer
import matplotlib.pyplot as plt

## OpenAI Tokenization with tiktoken

In [3]:
# Initialize OpenAI tokenizer
encoding = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding

# Sample text
text = "Hello, world! This is a sample text for tokenization."
justHello = "Hello"

# Tokenize
tokens = encoding.encode(text)
print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"justHello tokens: {encoding.encode(justHello)}")
print(f"Number of tokens: {len(tokens)}")

# Decode back to text
decoded = encoding.decode(tokens)
print(f"Decoded text: {decoded}")


Original text: Hello, world! This is a sample text for tokenization.
Tokens: [9906, 11, 1917, 0, 1115, 374, 264, 6205, 1495, 369, 4037, 2065, 13]
justHello tokens: [9906]
Number of tokens: 13
Decoded text: Hello, world! This is a sample text for tokenization.


In [None]:
# Add this cell to test case sensitivity
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

# Test case sensitivity
hello_lower = "hello"
hello_upper = "Hello" 
hello_caps = "HELLO"

tokens_lower = encoding.encode(hello_lower)
tokens_upper = encoding.encode(hello_upper)
tokens_caps = encoding.encode(hello_caps)

print("Case sensitivity test:")
print(f"'hello' -> {tokens_lower}")
print(f"'Hello' -> {tokens_upper}")
print(f"'HELLO' -> {tokens_caps}")
print()
# Only pass the first token of tokens_caps to decode
print(f"First token of 'HELLO': {tokens_caps[0]}")
print(f"Decoded: {encoding.decode([tokens_caps[0]])}")
# Test with spaces
space_hello = " hello"
space_Hello = " Hello"

print("With leading space:")
print(f"' hello' -> {encoding.encode(space_hello)}")
print(f"' Hello' -> {encoding.encode(space_Hello)}")

Case sensitivity test:
'hello' -> [15339]
'Hello' -> [9906]
'HELLO' -> [51812, 1623]

First token of 'HELLO': 51812
Decoded: HEL
With leading space:
' hello' -> [24748]
' Hello' -> [22691]


## What is cl100k_base?

`cl100k_base` is a **tokenizer encoding** (vocabulary), not an LLM itself. Think of it as the "dictionary" that tells the model how to convert text into numbers.

**Different OpenAI models use different encodings:**
- `cl100k_base`: Used by GPT-4, GPT-3.5-turbo, text-embedding-ada-002
- `p50k_base`: Used by Codex models, text-davinci-002, text-davinci-003
- `r50k_base`: Used by older GPT-3 models (davinci, curie, babbage, ada)

The encoding determines:
- How text gets split into tokens
- What the vocabulary size is (cl100k_base has ~100k tokens)
- How efficiently different types of text are tokenized

In [None]:
# Compare different OpenAI encodings
import tiktoken

text = "Hello, world! This is a sample text for tokenization."

# Different encodings used by different models
encodings = {
    "cl100k_base": "GPT-4, GPT-3.5-turbo",
    "p50k_base": "Codex, text-davinci-002/003", 
    "r50k_base": "GPT-3 davinci, curie, babbage, ada"
}

print("Comparing encodings for the same text:")
print(f"Text: '{text}'")
print()

for encoding_name, models in encodings.items():
    try:
        encoding = tiktoken.get_encoding(encoding_name)
        tokens = encoding.encode(text)
        print(f"{encoding_name} ({models}):")
        print(f"  Tokens: {len(tokens)}")
        print(f"  Token IDs: {tokens[:10]}{'...' if len(tokens) > 10 else ''}")
        print()
    except Exception as e:
        print(f"{encoding_name}: Error - {e}")
        print()

Comparing encodings for the same text:
Text: 'Hello, world! This is a sample text for tokenization.'

cl100k_base (GPT-4, GPT-3.5-turbo):
  Tokens: 13
  Token IDs: [9906, 11, 1917, 0, 1115, 374, 264, 6205, 1495, 369]...

p50k_base (Codex, text-davinci-002/003):
  Tokens: 13
  Token IDs: [15496, 11, 995, 0, 770, 318, 257, 6291, 2420, 329]...

r50k_base (GPT-3 davinci, curie, babbage, ada):
  Tokens: 13
  Token IDs: [15496, 11, 995, 0, 770, 318, 257, 6291, 2420, 329]...



## Is cl100k_base Open Source?

**Yes and No** - it's complicated:

### What's Available:
- ✅ **Tokenizer implementation**: The `tiktoken` library is open source
- ✅ **Encoding rules**: You can encode/decode text freely
- ✅ **Vocabulary mappings**: The token-to-text mappings are accessible
- ✅ **Usage**: No restrictions on using it in your projects

### What's NOT Available:
- ❌ **Training details**: How exactly it was trained isn't fully documented
- ❌ **Training data**: The specific dataset used to create the vocabulary
- ❌ **Full methodology**: Complete details of the BPE training process

### Practical Impact:
- You can **use** cl100k_base freely in any project
- You can **inspect** what tokens exist and how text gets tokenized
- You **cannot** easily recreate or modify the encoding from scratch

It's "open source" in terms of usage, but not in terms of reproducibility.

In [None]:
# Exploring what's accessible about cl100k_base
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

print("What we CAN access about cl100k_base:")
print("=" * 50)

# 1. Vocabulary size
print(f"Vocabulary size: {encoding.n_vocab:,} tokens")

# 2. Some example tokens and their IDs
print(f"\nSample token mappings:")
test_words = ["hello", "world", "Python", "AI", "🤖", " the"]
for word in test_words:
    tokens = encoding.encode(word)
    print(f"  '{word}' -> {tokens} -> '{encoding.decode(tokens)}'")

# 3. Special tokens (if any)
try:
    special_tokens = encoding.special_tokens_set
    print(f"\nSpecial tokens: {special_tokens}")
except:
    print(f"\nSpecial tokens: Not directly accessible")

# 4. Max token value
sample_tokens = [encoding.encode(word)[0] for word in ["a", "the", "and", "hello", "world"]]
print(f"\nSample token IDs: {sample_tokens}")
print(f"Max token ID we can easily find: {max(sample_tokens)}")

print(f"\nWhat we CANNOT easily access:")
print("- The original training dataset")
print("- Exact BPE merge rules") 
print("- Training hyperparameters")
print("- Step-by-step creation process")

## Is Tokenization Just Dictionary Lookup?

**Mostly yes, but with some important nuances:**

### The Simple View (Encoding/Decoding):
- ✅ **Encoding**: Text → Look up in dictionary → Token IDs
- ✅ **Decoding**: Token IDs → Look up in dictionary → Text  
- This part is indeed just dictionary lookup!

### The Complex Part (How the Dictionary Was Built):
The "dictionary" itself was created through a sophisticated process:

1. **Byte Pair Encoding (BPE)**: Iteratively merges the most frequent character pairs
2. **Frequency Analysis**: Analyzes massive text corpora to find optimal splits
3. **Optimization**: Balances vocabulary size vs. text compression efficiency

### Key Insight:
- **Using** the tokenizer = Simple dictionary lookup ✅
- **Creating** the tokenizer = Complex machine learning process 🧠

Think of it like using Google Translate vs. building Google Translate!

In [None]:
# Demonstrating the "dictionary lookup" nature of tokenization
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

print("🔍 ENCODING (Text → Token IDs) - Dictionary Lookup:")
print("=" * 60)

text = "Hello, AI researcher!"
tokens = encoding.encode(text)

print(f"Input: '{text}'")
print(f"Output: {tokens}")
print()

# Show it's just mapping each piece to a number
print("Breaking it down piece by piece:")
for i, token_id in enumerate(tokens):
    # Decode each individual token to see what text it represents
    piece = encoding.decode([token_id])
    print(f"  Token {i+1}: '{piece}' → {token_id}")

print("\n" + "🔍 DECODING (Token IDs → Text) - Reverse Dictionary Lookup:")
print("=" * 60)

print(f"Input: {tokens}")
decoded = encoding.decode(tokens)
print(f"Output: '{decoded}'")
print()

print("Breaking it down piece by piece:")
for i, token_id in enumerate(tokens):
    piece = encoding.decode([token_id])
    print(f"  Token ID {token_id} → '{piece}'")

print("\n" + "🧠 THE COMPLEX PART (Creating the Dictionary):")
print("=" * 60)
print("❓ Why does 'Hello' get token 9906?")
print("❓ Why does ',' get token 11?") 
print("❓ Why does ' AI' (with space) get token 15592?")
print()
print("👆 THESE decisions came from analyzing billions of text examples")
print("   to find the most efficient way to split language into pieces!")

# Show that some words get split in surprising ways
print(f"\n🎯 Example of BPE intelligence:")
weird_word = "antidisestablishmentarianism"
weird_tokens = encoding.encode(weird_word)
print(f"'{weird_word}' →")
for token_id in weird_tokens:
    piece = encoding.decode([token_id])
    print(f"  '{piece}' ({token_id})")
print("👆 The tokenizer never saw this exact word, but intelligently splits it!")

🔍 ENCODING (Text → Token IDs) - Dictionary Lookup:
Input: 'Hello, AI researcher!'
Output: [9906, 11, 15592, 32185, 0]

Breaking it down piece by piece:
  Token 1: 'Hello' → 9906
  Token 2: ',' → 11
  Token 3: ' AI' → 15592
  Token 4: ' researcher' → 32185
  Token 5: '!' → 0

🔍 DECODING (Token IDs → Text) - Reverse Dictionary Lookup:
Input: [9906, 11, 15592, 32185, 0]
Output: 'Hello, AI researcher!'

Breaking it down piece by piece:
  Token ID 9906 → 'Hello'
  Token ID 11 → ','
  Token ID 15592 → ' AI'
  Token ID 32185 → ' researcher'
  Token ID 0 → '!'

🧠 THE COMPLEX PART (Creating the Dictionary):
❓ Why does 'Hello' get token 9906?
❓ Why does ',' get token 11?
❓ Why does ' AI' (with space) get token 15592?

👆 THESE decisions came from analyzing billions of text examples
   to find the most efficient way to split language into pieces!

🎯 Example of BPE intelligence:
'antidisestablishmentarianism' →
  'ant' (519)
  'idis' (85342)
  'establish' (34500)
  'ment' (479)
  'arian' (8997)
  '

In [None]:
# Step-by-step BPE breakdown of "antidisestablishmentarianism"
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

word = "antidisestablishmentarianism"
tokens = encoding.encode(word)

print("🔬 STEP-BY-STEP BPE BREAKDOWN")
print("=" * 50)
print(f"Word: '{word}'")
print(f"Final tokens: {tokens}")
print()

print("🎯 How BPE likely processed this word:")
print("(Note: This is a reconstruction - the actual BPE training was more complex)")
print()

# Start with character level
print("Step 1 - Start with characters:")
chars = list(word)
print(f"  {chars}")
print()

print("Step 2 - BPE iteratively merges frequent pairs:")
print("  (Showing the final result of many merge operations)")
print()

# Show the actual breakdown
print("Step 3 - Final tokenization:")
for i, token_id in enumerate(tokens):
    piece = encoding.decode([token_id])
    print(f"  Token {i+1}: '{piece}' (ID: {token_id})")

print()
print("🧠 Why these specific splits?")
print("Each piece likely appeared frequently in the training data:")
print("  • 'ant' - common prefix (antenna, anticipate, etc.)")
print("  • 'idis' - learned as a unit from words like 'antidiscriminatory'")  
print("  • 'establish' - very common word, gets its own token")
print("  • 'ment' - common suffix (development, movement, etc.)")
print("  • 'arian' - common in words like 'vegetarian', 'librarian'")
print("  • 'ism' - common suffix (capitalism, racism, etc.)")
print()
print("👆 The algorithm learned these patterns from seeing millions of examples!")

# Let's also test some related words to see patterns
print("\n🔍 Testing related words to see BPE patterns:")
related_words = ["establish", "establishment", "vegetarian", "capitalism"]
for test_word in related_words:
    test_tokens = encoding.encode(test_word)
    print(f"'{test_word}' → ", end="")
    for token_id in test_tokens:
        piece = encoding.decode([token_id])
        print(f"'{piece}'", end=" + " if token_id != test_tokens[-1] else "\n")

🔬 STEP-BY-STEP BPE BREAKDOWN
Word: 'antidisestablishmentarianism'
Final tokens: [519, 85342, 34500, 479, 8997, 2191]

🎯 How BPE likely processed this word:
(Note: This is a reconstruction - the actual BPE training was more complex)

Step 1 - Start with characters:
  ['a', 'n', 't', 'i', 'd', 'i', 's', 'e', 's', 't', 'a', 'b', 'l', 'i', 's', 'h', 'm', 'e', 'n', 't', 'a', 'r', 'i', 'a', 'n', 'i', 's', 'm']

Step 2 - BPE iteratively merges frequent pairs:
  (Showing the final result of many merge operations)

Step 3 - Final tokenization:
  Token 1: 'ant' (ID: 519)
  Token 2: 'idis' (ID: 85342)
  Token 3: 'establish' (ID: 34500)
  Token 4: 'ment' (ID: 479)
  Token 5: 'arian' (ID: 8997)
  Token 6: 'ism' (ID: 2191)

🧠 Why these specific splits?
Each piece likely appeared frequently in the training data:
  • 'ant' - common prefix (antenna, anticipate, etc.)
  • 'idis' - learned as a unit from words like 'antidiscriminatory'
  • 'establish' - very common word, gets its own token
  • 'ment' - c

## How BPE Training Actually Works (Conceptually)

The BPE algorithm that created this "dictionary" worked like this:

### Phase 1: Character Level Start
```
antidisestablishmentarianism → ['a','n','t','i','d','i','s','e','s','t',...]
```

### Phase 2: Iterative Merging (Thousands of Steps)
The algorithm analyzed massive text corpora and repeatedly:

1. **Count all adjacent pairs**: "an", "nt", "ti", "id", etc.
2. **Find the most frequent pair**: Maybe "th" appears 1M times across all text
3. **Merge that pair**: Replace all "t" + "h" with a single "th" token
4. **Repeat**: Now "th" can form new pairs like "the", "thi", etc.

### Phase 3: Result
After thousands of iterations, common patterns emerge:
- **"establish"** became one token (very frequent word)
- **"ment"** became one token (frequent suffix)  
- **"ism"** became one token (frequent suffix)
- **"ant"** became one token (frequent prefix)

### The Magic
Even though the algorithm never saw "antidisestablishmentarianism" during training, it learned the building blocks that make it up!