# 08. Model Evaluation

Evaluating a language model is crucial to understand its performance. The most common metric for causal language modeling is **Perplexity**.

## Perplexity

Perplexity (PPL) is defined as the exponentiated average negative log-likelihood of a sequence. Intuitively, it measures how "surprised" the model is by the text. A lower perplexity means the model is less surprised (i.e., predicts the text better).

$$ \text{PPL}(X) = \exp \left( -\frac{1}{t} \sum_{i=1}^t \log P(x_i | x_{<i}) \right) $$

where $X = (x_1, \dots, x_t)$ is the tokenized sequence.

In [1]:
import torch
import torch.nn.functional as F
from tqdm.auto import tqdm
import math

# Dummy model setup for demonstration
class DummyModel(torch.nn.Module):
    def __init__(self): super().__init__()
    def forward(self, x, targets=None):
        logits = torch.randn(x.size(0), x.size(1), 1000)
        loss = F.cross_entropy(logits.view(-1, 1000), targets.view(-1)) if targets is not None else None
        return logits, loss

model = DummyModel()
device = torch.device("cpu")

## 1. Calculating Perplexity

We can calculate perplexity by evaluating the cross-entropy loss on a validation set. Since CrossEntropyLoss computes the average negative log-likelihood, we simply take the exponential of the loss.

In [2]:
def calculate_perplexity(model, dataloader, device):
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            targets = batch['labels'].to(device)
            
            # Forward pass to get loss
            _, loss = model(input_ids, targets)
            
            # Accumulate loss (weighted by batch size if needed, but usually mean is fine for fixed batch size)
            total_loss += loss.item()
            
    avg_loss = total_loss / len(dataloader)
    perplexity = math.exp(avg_loss)
    return perplexity

# Dummy dataloader
from torch.utils.data import DataLoader, Dataset
class DummyDataset(Dataset):
    def __len__(self): return 10
    def __getitem__(self, idx): return {"input_ids": torch.randint(0, 1000, (32,)), "labels": torch.randint(0, 1000, (32,))}

val_loader = DataLoader(DummyDataset(), batch_size=2)

ppl = calculate_perplexity(model, val_loader, device)
print(f"Validation Perplexity: {ppl:.2f}")

Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]

Validation Perplexity: 1528.83


## 2. Other Metrics

While perplexity is the standard, other metrics can be useful depending on the downstream task:
- **Accuracy**: Next-token prediction accuracy (not always correlated with generation quality).
- **BLEU/ROUGE**: For specific tasks like translation or summarization (requires generating text and comparing to reference).

For a general-purpose LLM, we primarily focus on Perplexity and qualitative assessment of generated text.