# Day 2: Advanced Tokenization with tiktoken and tokenizers - Practical Exercises

This notebook contains hands-on exercises and implementations for Day 2 of the LLM learning journey.

## Learning Objectives
- Implement tokenization using tiktoken and Hugging Face tokenizers
- Compare different tokenization libraries and their performance
- Analyze vocabulary size vs sequence length trade-offs
- Create custom domain-specific tokenizers
- Benchmark tokenizer performance

## Setup and Installation

In [None]:
# Install required packages
!pip install tiktoken tokenizers transformers matplotlib seaborn pandas numpy

In [None]:
import tiktoken
from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece
from tokenizers.trainers import BpeTrainer, WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.normalizers import Sequence, NFD, Lowercase, StripAccents
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import time
from collections import Counter

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 1. tiktoken: OpenAI's Fast Tokenizer

Let's start by exploring tiktoken and its different encodings.

In [None]:
# Get different tiktoken encodings
gpt2_enc = tiktoken.get_encoding("gpt2")
gpt4_enc = tiktoken.get_encoding("cl100k_base")  # GPT-4
codex_enc = tiktoken.get_encoding("p50k_base")   # Code models

# Test text
text = "Hello, world! This is advanced tokenization with tiktoken. Let's see how it handles different types of text: code_variable, émojis 🚀, and numbers 12345."

print(f"Original text: {text}")
print(f"Text length: {len(text)} characters\n")

# Compare different encodings
encodings = {
    'GPT-2': gpt2_enc,
    'GPT-4': gpt4_enc,
    'Codex': codex_enc
}

for name, enc in encodings.items():
    tokens = enc.encode(text)
    decoded = enc.decode(tokens)
    
    print(f"{name}:")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Vocab size: {enc.n_vocab:,}")
    print(f"  First 10 tokens: {tokens[:10]}")
    print(f"  Decoded tokens: {[enc.decode([t]) for t in tokens[:10]]}")
    print(f"  Perfect reconstruction: {decoded == text}")
    print()