## Evaluation framework
The goal of this notebook is to create an evaluation toolkit to measure LLM performance.

Key notable metrics:
- Language metrics
  - Perplexity
  - BLUE
  - ROUGE
  - Other metrics (TODO Define)
- Task-specific metrics
  - Different benchmarks (TODO Define what benchmarks)
- Inference metrics
  - Time To First Token (TTFT)
  - Memory footprint

This framework is to be used in a future work implementing prunning-training approach to model compression

## Abstract usage flow
- A dataset is selected
- A model is selected and loaded
- Hyperparameters for a given model are fixed
- Dataset is parsed through the model and the metrics are calculated

---

### A dirty first example
Let's consider 2 LLMs:
- https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
- https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Note: these are **"instruct"** models

Let's select the metrics to evaluate these models:
- Language metrics
  - Perplexity
  - BLUE
  - ROUGE
- Task-specific metrics
  - MMLU
  - Winograde

Each model has values for each metrics published on their model card. Let's try to replicate the numbers

#### Language metrics

As noted above, let's first compute Perplexity, BLUE score, and ROUGE for each model

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [7]:
model_name = 'HuggingFaceTB/SmolLM2-135M-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to('mps')

> Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence:     
> $$X=(x_0, x_1, \dots, x_t)$$
> then the perplexity of $X$ is
> $$
\exp \left\{ -\dfrac{1}{t}\sum^{t} _{i}\log p_{\theta}(x_{i}|x_{<i})\right\}  
> $$
> where $log p_{\theta}(x_{i}|x_{<i})$ is the log-likelihood of the $i$th token conditioned on the preceding tokens $x_{<i}$ 
>
> Intuitively, it can be thought of as an evaluation of the model’s ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model’s perplexity which should always be taken into consideration when comparing different models.

[Reference](https://huggingface.co/docs/transformers/perplexity)

For evaluation we will be using the Wiki

In [8]:
from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (304978 > 8192). Running this sequence through the model will result in indexing errors


In [12]:
import torch
from tqdm import tqdm


def ppl(model, encodings, max_length = 1024, stride = 512):
    seq_len = encodings.input_ids.size(1)

    nll_sum = 0.0
    n_tokens = 0
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride)):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
        input_ids = encodings.input_ids[:, begin_loc:end_loc].to('mps')
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)

            # loss is calculated using CrossEntropyLoss which averages over valid labels
            # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
            # to the left by 1.
            neg_log_likelihood = outputs.loss

        # Accumulate the total negative log-likelihood and the total number of tokens
        num_valid_tokens = (target_ids != -100).sum().item()  # number of valid tokens in target_ids
        batch_size = target_ids.size(0)
        num_loss_tokens = num_valid_tokens - batch_size  # subtract batch_size due to internal label shift
        nll_sum += neg_log_likelihood * num_loss_tokens
        n_tokens += num_loss_tokens

        prev_end_loc = end_loc
        if end_loc == seq_len:
            break

    avg_nll = nll_sum / n_tokens  # average negative log-likelihood per token
    return torch.exp(avg_nll)

In [13]:
smol_135_ppl = ppl(model, encodings)
print(smol_135_ppl)

100%|█████████▉| 594/596 [01:06<00:00,  8.99it/s]

tensor(16.6064, device='mps:0')





In [14]:
model_name2 = 'meta-llama/Llama-3.1-8B-Instruct'
tokenizer2 = AutoTokenizer.from_pretrained(model_name)
model2 = AutoModelForCausalLM.from_pretrained(model_name).to('mps')

In [15]:
llama31_ppl = ppl(model2, encodings)
print(llama31_ppl)

100%|█████████▉| 594/596 [01:06<00:00,  8.95it/s]

tensor(16.6064, device='mps:0')



