# Week 2 ‚Äî Tokenization & ONNX Runtime Internals
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand what tokens are and how tokenization works
2. Analyze how token count affects inference latency
3. Enable ONNX Runtime profiling to inspect operator-level timings
4. Recognize how tokenization choices can mislead evaluation metrics

---

## üß† What is Tokenization? (Feynman Explanation)

Imagine you're teaching a robot to read. The robot doesn't understand words‚Äîit only understands numbers. So before the robot can read a sentence, we need to break it into smaller pieces and assign each piece a number.

**Tokenization** is this process:
1. Take a sentence: `"Hello world"`
2. Break it into pieces: `["Hello", " world"]`
3. Convert pieces to numbers: `[15496, 995]`

**Key insight:** The number of tokens determines how much work the model does. More tokens = more computation = higher latency.

---

## üõ†Ô∏è Step 1: Setup & Install Dependencies

In [None]:
# Install required packages
!pip install onnxruntime transformers pandas numpy

In [None]:
# Import libraries
import onnxruntime as ort
from transformers import AutoTokenizer
import pandas as pd
import numpy as np
import time
import json
import os

print(f"‚úÖ ONNX Runtime version: {ort.__version__}")

---

## üìù Step 2: Tokenization Analysis Function

### What this function does:
- Takes a list of prompts
- For each prompt, shows the token IDs and how many tokens it creates
- This helps us understand why some prompts are "heavier" than others

In [None]:
def print_tokenized_info(prompts: list, tokenizer):
    """
    Print tokenization details for a list of prompts.
    
    For each prompt, displays:
    - Original text
    - Token IDs
    - Decoded tokens (to see how text was split)
    - Total token count
    """
    results = []
    
    print("=" * 80)
    print("TOKENIZATION ANALYSIS")
    print("=" * 80)
    
    for prompt in prompts:
        # Tokenize the prompt
        encoding = tokenizer(prompt, return_tensors="np")
        token_ids = encoding["input_ids"][0].tolist()
        
        # Decode each token individually to see the pieces
        decoded_tokens = [tokenizer.decode([tid]) for tid in token_ids]
        
        # Print details
        print(f"\nüìù Prompt: \"{prompt}\"")
        print(f"   Char count: {len(prompt)}")
        print(f"   Token count: {len(token_ids)}")
        print(f"   Token IDs: {token_ids}")
        print(f"   Tokens: {decoded_tokens}")
        
        results.append({
            "prompt": prompt,
            "char_count": len(prompt),
            "token_count": len(token_ids),
            "token_ids": token_ids
        })
    
    print("\n" + "=" * 80)
    return results

### Let's analyze some prompts!

We'll look at how different types of text tokenize:
- Simple words
- Technical terms
- Numbers and symbols
- Long uncommon words

In [None]:
# Load GPT-2 tokenizer (commonly used with many models)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Define prompts with different characteristics
test_prompts = [
    "Hello world",                                          # Simple, common words
    "Supercalifragilisticexpialidocious",                   # Long uncommon word
    "GPT-4, BERT, and T5 are transformer models.",          # Technical terms
    "The price is $1,234.56 USD.",                          # Numbers and symbols
    "The cat sat on the mat.",                              # Common short words
]

# Analyze tokenization
token_results = print_tokenized_info(test_prompts, tokenizer)

### üí° What did we learn?

Notice how:
- Common words like "Hello" and "world" are single tokens
- Rare words get split into multiple pieces
- Numbers and special characters often become separate tokens
- Token count can vary dramatically for similar-length text

---

## ‚è±Ô∏è Step 3: Measure Latency vs Token Count

### Goal:
Show that **token count** (not character count) is what determines latency.

### Why this matters:
When comparing model performance, you must consider token count. Two prompts with the same number of characters can have very different latencies if they tokenize differently.

In [None]:
# First, load the ONNX model
# Note: You need to upload your ONNX model to Colab.
# To get a model:
#   1. Use a pre-exported ONNX model from Hugging Face Hub
#   2. Or export your own using: python -m transformers.onnx --model=gpt2 onnx_model/
#   3. Upload the .onnx file to Colab using the file browser on the left
model_path = "/tmp/tinygpt.onnx"

# Create inference session
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
print("‚úÖ Model loaded successfully!")

In [None]:
def measure_latency_vs_tokens(prompts: list, session, tokenizer, num_runs: int = 5):
    """
    Measure inference latency for prompts and correlate with token count.
    
    Args:
        prompts: List of text prompts to test
        session: ONNX Runtime InferenceSession
        tokenizer: Hugging Face tokenizer
        num_runs: Number of runs per prompt for averaging
    
    Returns:
        DataFrame with latency measurements and token counts
    """
    results = []
    
    for prompt in prompts:
        # Tokenize
        inputs = tokenizer(prompt, return_tensors="np")
        token_count = inputs["input_ids"].shape[1]
        char_count = len(prompt)
        
        # Measure latency over multiple runs using perf_counter for accuracy
        latencies = []
        for _ in range(num_runs):
            t0 = time.perf_counter()
            _ = session.run(None, {"input_ids": inputs["input_ids"]})
            t1 = time.perf_counter()
            latencies.append((t1 - t0) * 1000)  # Convert to ms
        
        mean_latency = np.mean(latencies)
        std_latency = np.std(latencies)
        
        results.append({
            "prompt": prompt[:40] + "..." if len(prompt) > 40 else prompt,
            "char_count": char_count,
            "token_count": token_count,
            "mean_latency_ms": round(mean_latency, 2),
            "std_latency_ms": round(std_latency, 2),
            "latency_per_token_ms": round(mean_latency / token_count, 3)
        })
        
        print(f"‚úÖ Processed: '{prompt[:30]}...' - {token_count} tokens, {mean_latency:.2f}ms")
    
    return pd.DataFrame(results)

In [None]:
# Create test prompts with varying characteristics
latency_test_prompts = [
    # Short prompts
    "Hi",
    "Hello world",
    
    # Medium prompts - common words
    "The quick brown fox jumps over the lazy dog.",
    
    # Medium prompts - technical terms (more tokens)
    "GPT-4 utilizes RLHF and PPO algorithms.",
    
    # Longer prompts
    "Artificial intelligence is transforming how we work, live, and interact with technology.",
    
    # Numbers heavy (often tokenize into many pieces)
    "The values are 123, 456, 789, 101112, and 131415.",
]

# Run the latency measurement
latency_df = measure_latency_vs_tokens(latency_test_prompts, session, tokenizer, num_runs=5)

# Display results
print("\nüìä Latency vs Token Count Results:")
display(latency_df)

### üí° What did we learn?

Look at the results table:
- **Token count correlates with latency** better than character count
- Technical terms and numbers often create more tokens
- The `latency_per_token_ms` column shows relatively consistent per-token cost

**Key takeaway:** When benchmarking, always report token counts alongside latency!

---

## üîç Step 4: ONNX Runtime Profiling

### What is profiling?

Profiling shows us **where time is spent** inside the model. Instead of just knowing "inference took 50ms", we can see:
- MatMul (matrix multiplication): 30ms
- Softmax: 10ms
- Add operations: 5ms
- etc.

This helps identify bottlenecks for optimization.

In [None]:
def run_with_profiling(model_path: str, prompt: str, tokenizer):
    """
    Run inference with ONNX profiling enabled and return operator timings.
    
    Args:
        model_path: Path to ONNX model
        prompt: Text prompt to run
        tokenizer: Hugging Face tokenizer
    
    Returns:
        DataFrame with operator timing summary
    """
    # Create session options with profiling enabled
    options = ort.SessionOptions()
    options.enable_profiling = True
    
    # Create session
    session = ort.InferenceSession(
        model_path, 
        options, 
        providers=["CPUExecutionProvider"]
    )
    
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="np")
    
    # Run inference (this generates the profile)
    _ = session.run(None, {"input_ids": inputs["input_ids"]})
    
    # Get the profile file path and end profiling
    profile_file = session.end_profiling()
    print(f"üìÑ Profile saved to: {profile_file}")
    
    return profile_file

In [None]:
def parse_profile_summary(profile_file: str):
    """
    Parse ONNX profile JSON and return a summary of operator timings.
    
    Args:
        profile_file: Path to the profile JSON file
    
    Returns:
        DataFrame with aggregated operator timings
    """
    # Read the profile JSON
    with open(profile_file, 'r') as f:
        profile_data = json.load(f)
    
    # Extract operator timings
    op_timings = {}
    
    for event in profile_data:
        # Profile events have 'cat' (category) and 'dur' (duration in microseconds)
        if 'cat' in event and 'dur' in event:
            op_type = event.get('name', 'unknown').split('_')[0]  # Get base operator name
            duration_us = event['dur']
            
            if op_type not in op_timings:
                op_timings[op_type] = {'total_us': 0, 'count': 0}
            
            op_timings[op_type]['total_us'] += duration_us
            op_timings[op_type]['count'] += 1
    
    # Convert to DataFrame
    rows = []
    for op_name, stats in op_timings.items():
        rows.append({
            'operator': op_name,
            'total_time_ms': round(stats['total_us'] / 1000, 3),
            'call_count': stats['count'],
            'avg_time_ms': round(stats['total_us'] / stats['count'] / 1000, 3)
        })
    
    df = pd.DataFrame(rows)
    df = df.sort_values('total_time_ms', ascending=False)
    
    return df

In [None]:
# Run profiling on a test prompt
test_prompt = "Explain the concept of machine learning in simple terms."
profile_file = run_with_profiling(model_path, test_prompt, tokenizer)

# Parse and display the profile summary
print("\nüìä Operator Timing Summary (Top 10):")
profile_df = parse_profile_summary(profile_file)
display(profile_df.head(10))

# Calculate total time
total_time = profile_df['total_time_ms'].sum()
print(f"\n‚è±Ô∏è Total profiled time: {total_time:.2f} ms")

# Show percentage breakdown
print("\nüìà Top 5 operators by time percentage:")
top5 = profile_df.head(5).copy()
top5['percentage'] = (top5['total_time_ms'] / total_time * 100).round(1)
for _, row in top5.iterrows():
    print(f"   {row['operator']}: {row['percentage']}% ({row['total_time_ms']}ms)")

### üí° What did we learn?

The profile shows us:
- Which operators consume the most time (usually MatMul, Attention)
- How many times each operator is called
- Where to focus optimization efforts

**Key insight:** Most time is spent in a few heavy operators. Optimizing these (e.g., with GPU acceleration) gives the biggest speedups.

---

## üîÑ Inversion Thinking: How Can Tokenization Mislead Evaluation?

Instead of asking "How does tokenization help?", let's ask:

> **"How can tokenization mislead our evaluation results?"**

### Common pitfalls:

1. **Comparing prompts by character length** - Two 100-character prompts can have very different token counts
2. **Ignoring tokenizer differences** - Different models use different tokenizers
3. **Forgetting tokenization overhead** - Tokenization itself takes time
4. **Multilingual bias** - English text often tokenizes more efficiently than other languages

In [None]:
# Demonstration: Same character count, different token counts
prompt_a = "The sun rises in the east and sets in the west every day."  # Common words
prompt_b = "GPT-4's RLHF utilizes PPO with KL-divergence constraints."   # Technical terms

print(f"Prompt A: \"{prompt_a}\"")
print(f"  Characters: {len(prompt_a)}")
print(f"  Tokens: {len(tokenizer(prompt_a)['input_ids'])}")

print(f"\nPrompt B: \"{prompt_b}\"")
print(f"  Characters: {len(prompt_b)}")
print(f"  Tokens: {len(tokenizer(prompt_b)['input_ids'])}")

print("\n‚ö†Ô∏è Notice: Similar character counts but different token counts!")
print("   This means latency comparisons based only on character count are misleading.")

---

## üìù Mini-Project: Compare Same-Length Prompts

### Your task:
1. Create two prompts with the same character count (~100 chars)
2. One should use common words, one should use technical/rare terms
3. Measure and compare their latencies
4. Calculate latency per token
5. Document your findings

In [None]:
# YOUR CODE HERE
# Create your two prompts (aim for ~100 characters each)

your_prompt_a = ""  # Common words - should have fewer tokens
your_prompt_b = ""  # Technical terms - should have more tokens

# Measure latency for each (uncomment and modify)
# your_prompts = [your_prompt_a, your_prompt_b]
# results_df = measure_latency_vs_tokens(your_prompts, session, tokenizer, num_runs=10)
# display(results_df)

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 3, ensure you can check all boxes:

- [ ] I can explain what a token is in simple terms
- [ ] I understand why token count matters more than character count
- [ ] I can inspect tokenization using the transformers library
- [ ] I can enable ONNX profiling and interpret operator timings
- [ ] I understand how tokenization can mislead evaluation
- [ ] I completed the mini-project

---

**Week 2 Complete!** üéâ

**Next:** *Week 3 ‚Äî Perplexity & Basic Benchmarks*