# Tokenization Playground
Explore how text is broken into tokens for different LLMs and how context‑window limits affect prompts.

Run the setup cell below in **Google Colab** (or any Jupyter that allows `pip`).

In [None]:
!pip -q install tiktoken transformers
import tiktoken, textwrap
from transformers import AutoTokenizer

def count_openai_tokens(text, model='gpt-4o-mini'):
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    return tokens, len(tokens)

def count_hf_tokens(text, model_name='gpt2'):
    tok = AutoTokenizer.from_pretrained(model_name)
    tokens = tok.encode(text)
    return tokens, len(tokens)

print('✅ Setup complete')

## 🔢  Count tokens for any text

In [None]:
sample = "The quick brown fox jumps over the lazy dog."
tokens, n = count_openai_tokens(sample)
print(sample, '\n→', n, 'tokens:', tokens)

In [None]:
# ✏️ TRY IT: Replace the text below and re‑run
your_text = "Replace me with any paragraph …"
model = 'gpt-4o-mini'
tokens, n = count_openai_tokens(your_text, model)
print(f"{n} tokens for model {model}\n", tokens)

## 🧮  Visualize vs. context window
Below we compare your text length to a model‘s max context window.

In [None]:
def percent_of_window(n_tokens, window=128k):
    pct = (n_tokens / window) * 100
    return min(pct, 100)

# Example for GPT‑4o (128k context)
_, n = count_openai_tokens(your_text)
print(f'{n} tokens is {percent_of_window(n, 128_000):.2f}% of a 128k window')

---
### Further Exploration
* Try `AutoTokenizer` with non‑English text.
* Examine how sub‑words split: e.g., `'cats'` vs `'cat' + 's'`.
* Investigate token costs in the OpenAI pricing page.