(evals_benchmarks)=
# Evaluations and Benchmarks

At some point, when fine-tuning a model, we want to be able to assess whether it's "doing a good job" in some sense. So far, for demonstration purposes, we have been been pursuing fairly straightforward goals. Starting with a model that cannot do a given task, can we make it so that it is able to do that task? For example, the [OLMo](olmo_1b_ift) and [gpt2](gpt2) instruction-tuning notebooks were focused on taking a model that could not respond appropriately to questions or instructions, and training it such that it could respond appropriately. We measured this simply by comparing the base models to the fine-tuned models on a hanful of question and instruction examples and confirming that the fine-tuned model responded how we wanted it to.

This isn't a very rigorous approach, and we often have more ambitious aims than just changing the response style. We might, for example, want to improve a model's reasoning abilities, its knowledge of certain subjects, or its performance at particular tasks. Or we might, you know, just want to show off a model that performs better than others across multiple dimensions. We measure these aims using a variety of different evaluation and benchmarking techniques.

It's worth noting at this point that LLM evaluation is an evolving field, and if anyone agrees on anything, it's that evaluations aren't yet very good at capturing what matters. If you find evaluations confusing and difficult, you're in good company. If you have an idea of a better way to evaluate a model, give it a try and write about it!

## Types of evaluations

There are quite a few different approaches to evaluating models. Some common categories include:
- perplexity-based evaluations
- classical NLP metrics (BLEU, ROUGE)
- LLM as judge
- Task-Oriented Evaluation
- Reasoning and Knowledge evaluation

Note that these are not industry-standard classifications (and I will likely change and rearrange them over time).

We're going to start by discussing *perplexity* as it is an important building block for some popular evaluations and benchmarks used in the field today.

## Perplexity

[This Hugging Face Conceptual Guide](https://huggingface.co/docs/transformers/en/perplexity) provides a good overview of perplexity. Intuitively, you can think of perplexity as representing an LLM's level of uncertainty about the next token generated given the preceding tokens. The lower the perplexity, the more certain the model is about the next token. If a model has a perplexity of *k*, it is as if the model had to pick between *k* possible next tokens, each with the same probability. Thus, lower perplexity means the model is more "confident" in its choice.

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model_id = "gpt2"
model = GPT2LMHeadModel.from_pretrained(
    model_id,
    device_map="auto",
)

tokenizer = GPT2Tokenizer.from_pretrained(model_id)

question = "What is the capital of France?"
choices = ["(A) London", "(B) Paris", "(C) Berlin", "(D) Madrid"]

# Formatting the input sequence
input_sequence = f"{question} {' '.join(choices)}"
input_ids = tokenizer.encode(input_sequence, return_tensors="pt").to(model.device)

# Getting the token ids for the correct answer
correct_answer = "(B) Paris"
correct_answer_ids = tokenizer.encode(
    correct_answer, add_prefix_space=True, add_special_tokens=False
)

# Calculating the perplexity
with torch.no_grad():
    outputs = model(input_ids, labels=input_ids)
    log_probs = outputs.logits.log_softmax(dim=-1)

    # Find the location of the correct answer in the input sequence
    answer_indices = (input_ids[0] == correct_answer_ids[0]).nonzero(as_tuple=True)[0]

    # Extract the log probabilities for the correct answer tokens
    answer_log_probs = log_probs[0, answer_indices - 1, correct_answer_ids]

    # Calculate the perplexity
    perplexity = torch.exp(-answer_log_probs.mean()).item()

print(f"The perplexity for the question is: {perplexity:.2f}")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

The perplexity for the question is: 3410.03


In [6]:
correct_answer_ids

[357, 33, 8, 6342]