# Tutorial 4: Evaluation Metrics for TTT

## 1. Overview

In the previous tutorials, we built the specialized TTT layers. Now, we must measure their performance against standard models.

### Key Metrics
How good is a long-context model?
1.  **Perplexity (PPL)**: The "surprise" factor. Lower is better. A model that understands the context will be less surprised by the next word.
2.  **Latency (Throughput)**: TTT adds an "optimization loop" *during* inference. We must measure if this makes the model too slow to be usable.

### Why GPT-2?
For this educational notebook, we use **GPT-2** on the CPU/MPS. While not state-of-the-art accuracy, it provides the perfect sandbox for understanding *how* to calculate these metrics without needing an H100 GPU.

## 2. Setup: Preparing the Benchmark Model

We load a standard pre-trained model to act as our baseline.

In [1]:
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Setup Model
model_id = "gpt2"
device = "mps" if torch.backends.mps.is_available() else "cpu"
if torch.cuda.is_available(): device = "cuda"

print(f"Loading {model_id} on {device}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
model.eval()

Loading gpt2 on mps...


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## 3. Metric 1: Perplexity (PPL)

### The Concept
Perplexity measures how unsure the model is.
- **PPL = 1.0**: The model is 100% certain of the next word. (Perfect)
- **PPL = 1000**: The model is basically guessing randomly.

For **TTT**, we expect Perplexity to **drop** over time as the model "reads" the document and updates its weights. It should get "smarter" about the specific document the longer it reads.

In [2]:
# A sentence explaining the concept. 
# A model that knows English well should assign high probability (low loss) to this valid sentence.
text = "Test-Time Training (TTT) allows models to adapt to long contexts by updating weights on the fly."
encodings = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    # Forward pass: model predicts every token in the sequence
    outputs = model(**encodings, labels=encodings.input_ids)
    loss = outputs.loss
    # Perplexity is just e^(CrossEntropyLoss)
    perplexity = torch.exp(loss)

print(f"Validation Loss: {loss.item():.4f}")
print(f"Perplexity: {perplexity.item():.2f}")
print("Note: Lower is better!")

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Validation Loss: 4.8331
Perplexity: 125.60
Note: Lower is better!


## 4. Metric 2: Latency (Tokens per Second)

### The Trade-off
TTT introduces an **Inner Loop** (optimization steps) for every chunk of tokens.
- **Standard RAG/Attention**: Fast inference, huge memory usage.
- **TTT**: Slower inference (due to training steps), tiny memory usage.

We need to measure how many tokens we can generate per second to ensure the system is usable for real-time chat.

In [3]:
prompt = "Alice was beginning to get very tired of sitting by her sister on the bank, and"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
tokens_to_generate = 50

# Warmup pass (Initialize GPU kernels)
_ = model.generate(**inputs, max_new_tokens=5)

# Benchmark Start
start_time = time.time()
with torch.no_grad():
    _ = model.generate(**inputs, max_new_tokens=tokens_to_generate, pad_token_id=tokenizer.eos_token_id)
end_time = time.time()
# Benchmark End

duration = end_time - start_time
throughput = tokens_to_generate / duration

print(f"\n--- Speed Benchmark ---")
print(f"Generated {tokens_to_generate} tokens in {duration:.4f} seconds")
print(f"Throughput: {throughput:.2f} tokens/sec")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



--- Speed Benchmark ---
Generated 50 tokens in 1.3712 seconds
Throughput: 36.46 tokens/sec
