# Tokenization Deep Dive: What LLMs Actually "See"

This notebook explores **tokenization** — the process that converts raw text into the numerical units an LLM actually processes. Understanding tokenization is essential because it directly impacts:

1. **Model Behavior** — Why LLMs fail at seemingly simple tasks (the "Strawberry" problem)
2. **Multilingual Fairness** — Why non-English users pay more and get less
3. **Cost & Limits** — How tokens translate to dollars and context window constraints

---

### Setup

In [1]:
import sys
!{sys.executable} -m pip install -q tiktoken transformers

In [2]:
import tiktoken

# Load the tokenizer used by GPT-4 / GPT-4o
enc = tiktoken.encoding_for_model("gpt-4o")

def show_tokens(text, encoding=enc):
    """Visualize how a tokenizer splits text."""
    token_ids = encoding.encode(text)
    tokens = [encoding.decode([tid]) for tid in token_ids]
    print(f"Text:       {repr(text)}")
    print(f"Token IDs:  {token_ids}")
    print(f"Tokens:     {tokens}")
    print(f"Count:      {len(token_ids)} tokens")
    print()
    return tokens, token_ids

---

## 1. The "Strawberry" Problem

A famous failure mode: ask an LLM *"How many r's are in strawberry?"* and it often answers **2** instead of **3**.

This isn't a reasoning bug — it's a **tokenization** artifact. The model never sees individual letters. It sees *tokens*, which are subword chunks. Let's prove it.

In [3]:
print("=" * 60)
print("HOW THE TOKENIZER SPLITS 'strawberry'")
print("=" * 60)

tokens, ids = show_tokens("strawberry")

print("The model sees these chunks, NOT individual letters.")
print("To count 'r's, it would need to decompose tokens back")
print("into characters — something it was never trained to do.")

HOW THE TOKENIZER SPLITS 'strawberry'
Text:       'strawberry'
Token IDs:  [302, 1618, 19772]
Tokens:     ['st', 'raw', 'berry']
Count:      3 tokens

The model sees these chunks, NOT individual letters.
To count 'r's, it would need to decompose tokens back
into characters — something it was never trained to do.


### Let's see the problem more clearly

Below we'll show that the letter boundaries *inside* a token are invisible to the model.

In [5]:
print("=" * 60)
print("CHARACTER vs TOKEN VIEW")
print("=" * 60)

word = "strawberry"
token_ids = enc.encode(word)
tokens = [enc.decode([tid]) for tid in token_ids]

# Character-level view (what humans see)
print("\nWhat HUMANS see (character-level):")
print("  ", " | ".join(list(word)))
print(f"  r count = {word.count('r')}  (easy — just scan each letter)")

# Token-level view (what the model sees)
print("\nWhat the MODEL sees (token-level):")
print("  ", " | ".join([repr(t) for t in tokens]))
print(f"  The model must figure out how many 'r's hide INSIDE each token.")
print(f"  It has no direct letter-level access — only learned statistical patterns.")

CHARACTER vs TOKEN VIEW

What HUMANS see (character-level):
   s | t | r | a | w | b | e | r | r | y
  r count = 3  (easy — just scan each letter)

What the MODEL sees (token-level):
   'st' | 'raw' | 'berry'
  The model must figure out how many 'r's hide INSIDE each token.
  It has no direct letter-level access — only learned statistical patterns.


### More examples where tokenization causes confusion

In [6]:
tricky_words = [
    ("assessment", "s"),     # How many s's?
    ("mississippi", "s"),    # How many s's?
    ("bookkeeper", "o"),     # How many o's?
    ("committee", "m"),      # How many m's?
    ("occurrence", "c"),     # How many c's?
]

print("=" * 60)
print("WORDS THAT TRIP UP LLMs ON LETTER COUNTING")
print("=" * 60)

for word, letter in tricky_words:
    token_ids = enc.encode(word)
    tokens = [enc.decode([tid]) for tid in token_ids]
    actual_count = word.count(letter)
    print(f"\n'{word}' → tokens: {tokens}")
    print(f"  How many '{letter}'s? Actual = {actual_count}")
    print(f"  The letter '{letter}' is buried inside token boundaries — hard for the model.")

WORDS THAT TRIP UP LLMs ON LETTER COUNTING

'assessment' → tokens: ['assessment']
  How many 's's? Actual = 4
  The letter 's' is buried inside token boundaries — hard for the model.

'mississippi' → tokens: ['miss', 'issippi']
  How many 's's? Actual = 4
  The letter 's' is buried inside token boundaries — hard for the model.

'bookkeeper' → tokens: ['book', 'keeper']
  How many 'o's? Actual = 2
  The letter 'o' is buried inside token boundaries — hard for the model.

'committee' → tokens: ['committee']
  How many 'm's? Actual = 2
  The letter 'm' is buried inside token boundaries — hard for the model.

'occurrence' → tokens: ['occ', 'urrence']
  How many 'c's? Actual = 3
  The letter 'c' is buried inside token boundaries — hard for the model.


### Key Takeaway

> LLMs do not process text character-by-character. They process **tokens** — subword units learned from data via Byte Pair Encoding (BPE). Any task that requires character-level reasoning (counting letters, reversing strings, anagram solving) is fundamentally harder for LLMs because the characters are packed opaquely inside tokens.

---

## 2. Multilingual Efficiency: Why Non-English Users Pay More

Most tokenizers are trained primarily on English text. This means:
- Common English words → **1 token** (efficient)
- Non-English words → **many tokens** (inefficient)

This has real consequences: more tokens = higher cost, slower inference, and faster context window exhaustion.

Let's compare English against Hindi and other Indian languages.

In [7]:
print("=" * 70)
print("ENGLISH vs HINDI: SAME MEANING, DIFFERENT TOKEN COST")
print("=" * 70)

pairs = [
    ("Hello, how are you?", "नमस्ते, आप कैसे हैं?"),
    ("Artificial Intelligence", "कृत्रिम बुद्धिमत्ता"),
    ("machine learning", "मशीन लर्निंग"),
    ("The weather is nice today", "आज मौसम अच्छा है"),
    ("What is your name?", "आपका नाम क्या है?"),
]

print(f"{'English':<35} {'Tokens':>6}  |  {'Hindi':<35} {'Tokens':>6}  |  Ratio")
print("-" * 110)

for eng, hin in pairs:
    eng_tokens = len(enc.encode(eng))
    hin_tokens = len(enc.encode(hin))
    ratio = hin_tokens / eng_tokens
    print(f"{eng:<35} {eng_tokens:>6}  |  {hin:<35} {hin_tokens:>6}  |  {ratio:.1f}x")

ENGLISH vs HINDI: SAME MEANING, DIFFERENT TOKEN COST
English                             Tokens  |  Hindi                               Tokens  |  Ratio
--------------------------------------------------------------------------------------------------------------
Hello, how are you?                      6  |  नमस्ते, आप कैसे हैं?                     9  |  1.5x
Artificial Intelligence                  2  |  कृत्रिम बुद्धिमत्ता                      7  |  3.5x
machine learning                         2  |  मशीन लर्निंग                             6  |  3.0x
The weather is nice today                5  |  आज मौसम अच्छा है                         4  |  0.8x
What is your name?                       5  |  आपका नाम क्या है?                        6  |  1.2x


### Let's see exactly how Hindi gets fragmented

In [8]:
print("=" * 60)
print("TOKEN-LEVEL BREAKDOWN: ENGLISH vs HINDI")
print("=" * 60)

print("\n--- English ---")
show_tokens("Artificial Intelligence")

print("--- Hindi ---")
show_tokens("कृत्रिम बुद्धिमत्ता")

print("Notice: English words map to 1-2 tokens each.")
print("Hindi characters get split into many byte-level fragments.")

TOKEN-LEVEL BREAKDOWN: ENGLISH vs HINDI

--- English ---
Text:       'Artificial Intelligence'
Token IDs:  [186671, 42378]
Tokens:     ['Artificial', ' Intelligence']
Count:      2 tokens

--- Hindi ---
Text:       'कृत्रिम बुद्धिमत्ता'
Token IDs:  [1016, 10433, 11553, 15427, 155440, 15427, 75494]
Tokens:     ['क', 'ृ', 'त्र', 'िम', ' बुद्ध', 'िम', 'त्ता']
Count:      7 tokens

Notice: English words map to 1-2 tokens each.
Hindi characters get split into many byte-level fragments.


### Comparing across multiple Indian languages

In [9]:
# "Artificial Intelligence" in multiple Indian languages
translations = {
    "English":    "Artificial Intelligence",
    "Hindi":      "कृत्रिम बुद्धिमत्ता",
    "Tamil":      "செயற்கை நுண்ணறிவு",
    "Telugu":     "కృత్రిమ మేధస్సు",
    "Kannada":    "ಕೃತಕ ಬುದ್ಧಿಮತ್ತೆ",
    "Malayalam":  "നിർമിത ബുദ്ധി",
    "Bengali":    "কৃত্রিম বুদ্ধিমত্তা",
    "Marathi":    "कृत्रिम बुद्धिमत्ता",
}

print("=" * 60)
print("TOKEN COUNT: 'Artificial Intelligence' ACROSS LANGUAGES")
print("=" * 60)

eng_count = len(enc.encode(translations["English"]))

for lang, text in translations.items():
    count = len(enc.encode(text))
    bar = "█" * count
    ratio = count / eng_count
    print(f"{lang:<12} {count:>3} tokens  {bar}  ({ratio:.1f}x vs English)")

print(f"\nEnglish baseline: {eng_count} tokens")
print("Every extra token = extra cost + latency + context consumed.")

TOKEN COUNT: 'Artificial Intelligence' ACROSS LANGUAGES
English        2 tokens  ██  (1.0x vs English)
Hindi          7 tokens  ███████  (3.5x vs English)
Tamil          9 tokens  █████████  (4.5x vs English)
Telugu         9 tokens  █████████  (4.5x vs English)
Kannada        8 tokens  ████████  (4.0x vs English)
Malayalam      7 tokens  ███████  (3.5x vs English)
Bengali        9 tokens  █████████  (4.5x vs English)
Marathi        7 tokens  ███████  (3.5x vs English)

English baseline: 2 tokens
Every extra token = extra cost + latency + context consumed.


### Why does this happen?

Tokenizers use **Byte Pair Encoding (BPE)** — they learn the most frequent character sequences from a training corpus. If the corpus is 90%+ English, then:
- English subwords like `tion`, `ing`, `the` get dedicated tokens
- Hindi characters like `क`, `ृ`, `त` remain as raw UTF-8 bytes, requiring multiple tokens per character

This is not a bug — it's a **data distribution** consequence.

In [10]:
# Demonstrate: common English subwords get single tokens
print("=" * 60)
print("WHY ENGLISH IS EFFICIENT: COMMON SUBWORDS = SINGLE TOKENS")
print("=" * 60)

english_subwords = ["the", "tion", "ing", "ment", "ness", "able"]
for sw in english_subwords:
    count = len(enc.encode(sw))
    print(f"  '{sw}' → {count} token(s)")

print("\nThese are so frequent in training data that BPE merges them early.")
print("Hindi/Tamil/Telugu characters never appear frequently enough to merge.")

WHY ENGLISH IS EFFICIENT: COMMON SUBWORDS = SINGLE TOKENS
  'the' → 1 token(s)
  'tion' → 1 token(s)
  'ing' → 1 token(s)
  'ment' → 1 token(s)
  'ness' → 1 token(s)
  'able' → 1 token(s)

These are so frequent in training data that BPE merges them early.
Hindi/Tamil/Telugu characters never appear frequently enough to merge.


### Key Takeaway

> A standard English-centric tokenizer can use **3-5x more tokens** for the same meaning in Indian languages. This directly translates to:
> - **3-5x higher API costs** for the same content
> - **3-5x faster context window exhaustion**
> - **Slower inference** (more tokens to process)
>
> This is why multilingual models (like IndicBERT, MuRIL, or models with expanded vocabularies) exist — they add dedicated tokens for non-English scripts, dramatically reducing this overhead.

---

## 3. Cost & Context Windows: The Business Side of Tokens

Every API call to an LLM is priced **per token**. Understanding token economics is critical for building production applications.

### Key concepts:
- **Input tokens**: What you send to the model (prompt + context)
- **Output tokens**: What the model generates (completion)
- **Context window**: Maximum total tokens (input + output) the model can handle at once

In [11]:
# Current pricing (as of early 2025 — verify latest at provider sites)
pricing = {
    "GPT-4o": {
        "input": 2.50,     # per 1M input tokens
        "output": 10.00,   # per 1M output tokens
        "context": 128_000,
    },
    "GPT-4o-mini": {
        "input": 0.15,
        "output": 0.60,
        "context": 128_000,
    },
    "Claude 3.5 Sonnet": {
        "input": 3.00,
        "output": 15.00,
        "context": 200_000,
    },
    "Claude 3.5 Haiku": {
        "input": 0.80,
        "output": 4.00,
        "context": 200_000,
    },
    "Gemini 2.0 Flash": {
        "input": 0.10,
        "output": 0.40,
        "context": 1_000_000,
    },
}

print("=" * 80)
print("LLM PRICING COMPARISON (per 1M tokens, USD)")
print("=" * 80)
print(f"{'Model':<22} {'Input':>8} {'Output':>8} {'Context Window':>16}")
print("-" * 80)
for model, info in pricing.items():
    print(f"{model:<22} ${info['input']:>7.2f} ${info['output']:>7.2f} {info['context']:>14,} tokens")

LLM PRICING COMPARISON (per 1M tokens, USD)
Model                     Input   Output   Context Window
--------------------------------------------------------------------------------
GPT-4o                 $   2.50 $  10.00        128,000 tokens
GPT-4o-mini            $   0.15 $   0.60        128,000 tokens
Claude 3.5 Sonnet      $   3.00 $  15.00        200,000 tokens
Claude 3.5 Haiku       $   0.80 $   4.00        200,000 tokens
Gemini 2.0 Flash       $   0.10 $   0.40      1,000,000 tokens


### Real-world cost calculator

Let's calculate actual costs for common use cases.

In [12]:
def estimate_cost(input_text, output_tokens_est, model_name="GPT-4o"):
    """Estimate API cost for a given input text and expected output length."""
    input_tokens = len(enc.encode(input_text))
    model = pricing[model_name]
    
    input_cost = (input_tokens / 1_000_000) * model["input"]
    output_cost = (output_tokens_est / 1_000_000) * model["output"]
    total = input_cost + output_cost
    
    return {
        "model": model_name,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens_est,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_cost": total,
    }


# Scenario: Summarize a 5-page document (~2500 words ≈ 3300 tokens)
sample_doc = "Lorem ipsum dolor sit amet. " * 500  # ~3000 tokens
output_est = 200  # summary

print("=" * 70)
print("SCENARIO: Summarize a 5-page document (expect ~200 token summary)")
print("=" * 70)

for model_name in pricing:
    result = estimate_cost(sample_doc, output_est, model_name)
    print(f"\n{model_name}:")
    print(f"  Input:  {result['input_tokens']:,} tokens → ${result['input_cost']:.4f}")
    print(f"  Output: {result['output_tokens']:,} tokens → ${result['output_cost']:.4f}")
    print(f"  Total:  ${result['total_cost']:.4f}")

SCENARIO: Summarize a 5-page document (expect ~200 token summary)

GPT-4o:
  Input:  3,001 tokens → $0.0075
  Output: 200 tokens → $0.0020
  Total:  $0.0095

GPT-4o-mini:
  Input:  3,001 tokens → $0.0005
  Output: 200 tokens → $0.0001
  Total:  $0.0006

Claude 3.5 Sonnet:
  Input:  3,001 tokens → $0.0090
  Output: 200 tokens → $0.0030
  Total:  $0.0120

Claude 3.5 Haiku:
  Input:  3,001 tokens → $0.0024
  Output: 200 tokens → $0.0008
  Total:  $0.0032

Gemini 2.0 Flash:
  Input:  3,001 tokens → $0.0003
  Output: 200 tokens → $0.0001
  Total:  $0.0004


### Scaling costs: What happens at 10,000 requests/day?

In [13]:
daily_requests = 10_000
avg_input_tokens = 1_000
avg_output_tokens = 500

print("=" * 70)
print(f"DAILY COST AT {daily_requests:,} REQUESTS/DAY")
print(f"(avg {avg_input_tokens:,} input + {avg_output_tokens:,} output tokens per request)")
print("=" * 70)

print(f"\n{'Model':<22} {'Daily':>10} {'Monthly':>12} {'Yearly':>12}")
print("-" * 60)

for model_name, info in pricing.items():
    daily_input_cost = (daily_requests * avg_input_tokens / 1_000_000) * info["input"]
    daily_output_cost = (daily_requests * avg_output_tokens / 1_000_000) * info["output"]
    daily_total = daily_input_cost + daily_output_cost
    monthly = daily_total * 30
    yearly = daily_total * 365
    print(f"{model_name:<22} ${daily_total:>8.2f} ${monthly:>10.2f} ${yearly:>10.2f}")

print("\nModel choice at scale can mean the difference between $100/yr and $10,000+/yr.")

DAILY COST AT 10,000 REQUESTS/DAY
(avg 1,000 input + 500 output tokens per request)

Model                       Daily      Monthly       Yearly
------------------------------------------------------------
GPT-4o                 $   75.00 $   2250.00 $  27375.00
GPT-4o-mini            $    4.50 $    135.00 $   1642.50
Claude 3.5 Sonnet      $  105.00 $   3150.00 $  38325.00
Claude 3.5 Haiku       $   28.00 $    840.00 $  10220.00
Gemini 2.0 Flash       $    3.00 $     90.00 $   1095.00

Model choice at scale can mean the difference between $100/yr and $10,000+/yr.


### Context Window: What does 128K tokens actually look like?

In [14]:
print("=" * 60)
print("CONTEXT WINDOW SIZE — WHAT FITS INSIDE?")
print("=" * 60)

# Rough estimates: 1 token ≈ 0.75 English words, 1 page ≈ 250 words
estimates = {
    "4K tokens (GPT-3.5 original)": 4_000,
    "8K tokens": 8_000,
    "32K tokens": 32_000,
    "128K tokens (GPT-4o)": 128_000,
    "200K tokens (Claude 3.5)": 200_000,
    "1M tokens (Gemini 2.0)": 1_000_000,
}

print(f"\n{'Context Size':<35} {'≈ Words':>10} {'≈ Pages':>10} {'≈ Books':>10}")
print("-" * 70)
for label, tokens in estimates.items():
    words = int(tokens * 0.75)
    pages = words / 250
    books = pages / 300  # avg novel ≈ 300 pages
    print(f"{label:<35} {words:>10,} {pages:>10.0f} {books:>10.1f}")

print("\nContext window = the model's SHORT-TERM MEMORY.")
print("Everything beyond this limit is invisible to the model.")

CONTEXT WINDOW SIZE — WHAT FITS INSIDE?

Context Size                           ≈ Words    ≈ Pages    ≈ Books
----------------------------------------------------------------------
4K tokens (GPT-3.5 original)             3,000         12        0.0
8K tokens                                6,000         24        0.1
32K tokens                              24,000         96        0.3
128K tokens (GPT-4o)                    96,000        384        1.3
200K tokens (Claude 3.5)               150,000        600        2.0
1M tokens (Gemini 2.0)                 750,000       3000       10.0

Context window = the model's SHORT-TERM MEMORY.
Everything beyond this limit is invisible to the model.


### The multilingual cost penalty in practice

In [15]:
print("=" * 70)
print("MULTILINGUAL COST PENALTY: ENGLISH vs HINDI CUSTOMER SUPPORT BOT")
print("=" * 70)

# Simulating a customer support scenario
english_query = """Hello, I purchased a laptop from your store last week and the 
screen has started flickering. I would like to request a replacement or refund. 
My order number is 12345. Please help me resolve this issue as soon as possible."""

hindi_query = """नमस्ते, मैंने पिछले हफ्ते आपकी दुकान से एक लैपटॉप खरीदा था और 
स्क्रीन टिमटिमाने लगी है। मैं प्रतिस्थापन या धनवापसी का अनुरोध करना चाहता हूं। 
मेरा ऑर्डर नंबर 12345 है। कृपया इस समस्या को जल्द से जल्द हल करने में मेरी मदद करें।"""

eng_tokens = len(enc.encode(english_query))
hin_tokens = len(enc.encode(hindi_query))

print(f"\nEnglish query: {eng_tokens} tokens")
print(f"Hindi query:   {hin_tokens} tokens (same meaning!)")
print(f"Ratio:         {hin_tokens/eng_tokens:.1f}x")

# Cost at 100K queries/month using GPT-4o
monthly_queries = 100_000
avg_output = 300  # tokens
model = pricing["GPT-4o"]

eng_monthly = (monthly_queries * eng_tokens / 1_000_000) * model["input"] + \
              (monthly_queries * avg_output / 1_000_000) * model["output"]
hin_monthly = (monthly_queries * hin_tokens / 1_000_000) * model["input"] + \
              (monthly_queries * avg_output / 1_000_000) * model["output"]

print(f"\nAt {monthly_queries:,} queries/month (GPT-4o):")
print(f"  English: ${eng_monthly:,.2f}/month")
print(f"  Hindi:   ${hin_monthly:,.2f}/month")
print(f"  Extra cost for Hindi: ${hin_monthly - eng_monthly:,.2f}/month")
print(f"\nSame product, same meaning — Hindi users cost more to serve.")

MULTILINGUAL COST PENALTY: ENGLISH vs HINDI CUSTOMER SUPPORT BOT

English query: 50 tokens
Hindi query:   76 tokens (same meaning!)
Ratio:         1.5x

At 100,000 queries/month (GPT-4o):
  English: $312.50/month
  Hindi:   $319.00/month
  Extra cost for Hindi: $6.50/month

Same product, same meaning — Hindi users cost more to serve.


### Key Takeaway

> Tokens are the fundamental currency of LLMs. They determine:
> - **What you pay** — input/output tokens are billed separately, output is more expensive
> - **What the model can see** — the context window is a hard memory limit, not a soft suggestion
> - **Who gets penalized** — non-English users consume more tokens for the same information, paying more and fitting less into context
>
> Smart token management (choosing the right model, compressing prompts, using cheaper models for simple tasks) is a core production engineering skill.

---

## Bonus: Build Your Intuition

Use the interactive cell below to tokenize any text and see the breakdown.

In [16]:
# Try your own text!
your_text = "Replace this with any text you want to tokenize"

print("=" * 60)
print("YOUR CUSTOM TOKENIZATION")
print("=" * 60)
tokens, ids = show_tokens(your_text)

# Cost estimate across models
n = len(ids)
print(f"Cost to process this as INPUT across models:")
for model_name, info in pricing.items():
    cost = (n / 1_000_000) * info["input"]
    print(f"  {model_name:<22} ${cost:.6f}")

YOUR CUSTOM TOKENIZATION
Text:       'Replace this with any text you want to tokenize'
Token IDs:  [37050, 495, 483, 1062, 2201, 481, 1682, 316, 192720]
Tokens:     ['Replace', ' this', ' with', ' any', ' text', ' you', ' want', ' to', ' tokenize']
Count:      9 tokens

Cost to process this as INPUT across models:
  GPT-4o                 $0.000023
  GPT-4o-mini            $0.000001
  Claude 3.5 Sonnet      $0.000027
  Claude 3.5 Haiku       $0.000007
  Gemini 2.0 Flash       $0.000001


---

## Summary

| Concept | Why It Matters |
|---|---|
| **Strawberry Problem** | LLMs see tokens, not characters — letter-level tasks are inherently difficult |
| **Multilingual Efficiency** | English-centric tokenizers penalize other languages with 3-5x token overhead |
| **Cost & Context** | Tokens = money. Context window = memory. Both are hard limits you must design around |

### Practical implications for engineers:
- **Prompt compression** matters at scale — fewer tokens = lower cost
- **Model selection** should factor in tokenizer efficiency for your language
- **Context management** (chunking, summarization, RAG) is essential for long documents
- **Character-level tasks** should be offloaded to code, not LLMs