# **Perplexity of fixed-length models**

## **Definition:**

Perplexity (PPL) is a widely used metric that estimates how well an autoregressive language model predicts a text in a given context. Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. 
## **Formula:**

If we have a tokenized sequence $X = (x_0, x_1, \dots, x_t)$, then the perplexity of $X$ is:

$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\},$$

where $\log p_\theta (x_i|x_{<i})$ is the log-likelihood of the $i^{th}$ token conditioned on the prior tokens $x_{<i}$ according to our model. 


#####Note:

The tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.


## **Calculating PPL with fixed-length models**

# Unlimited Context Size

If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
factorizing a sequence and conditioning on the **entire processed subsequence** at each step, as shown below.

<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="https://raw.githubusercontent.com/Nagoudi/Perplexity/main/ppl.gif
"/>


# Fixed Context Size

When working with autoregressive language model, we typically have a constraint on the number of tokens the model can process. [GPT-2](hhttps://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe), for example, has a fixed length of 1,024 tokens, so we cannot calculate $p_\theta(x_t|x_{<t})$ directly when $t$ is greater than 1,024.

Instead, the sequence is typically **broken into** subsequences equal to the model's maximum input size . If a model's max input size is $k$ (e.g., $k$=1,024 for GPT2 and $k$= 2,048 for GPT3), we then approximate the likelihood of a token $x_t$ by conditioning only on the $k-1$ tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
log-likelihoods of each segment independently. The example below shows the previous example with a fixed context size of $k=5$.

<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="https://raw.githubusercontent.com/Nagoudi/Perplexity/main/ppl2.gif"/>


# Sliding-Window Strategy

This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
have less context at most of the prediction steps.

Instead, the PPL of fixed-length models should be evaluated with a **sliding-window strategy**. This involves repeatedly
sliding the context window so that the model has more context when making each prediction. The example below  uses sliding-window size of $k=5$.

<img width="600" alt="Sliding window PPL taking advantage of all available context" src="https://raw.githubusercontent.com/Nagoudi/Perplexity/main/ppl3.gif"/>

This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
practical compromise is to employ a **strided sliding window**, moving the context by larger strides rather than sliding by
1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
predictions at each step.

# Example: Calculating perplexity with GPT-2 

## Let's first install Transformers from HuggingFace 🤗

In [None]:
# Transformers and datasets installation
! pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting

# Load the GPT-2's model and Toknizer

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
## download GPT model
device = "cuda"
model_id = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

# Load and toknize WikiText-2 dataset 

In [None]:
from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
print(test["text"][3])



 Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as " Craig " in the episode " Teddy 's Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . 



In [None]:
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt") # Since this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire dataset in memory.

# Evaluate the perplexity using the sliding-window strategie

With 🤗 Transformers, we can simply:
 

1.   Pass the `input_ids` as the `labels` to our model
2.   Average negative log-likelihood for each token is returned as the loss.


#### **Note 1:**

With our sliding window approach, however, there is overlap in the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. 

#### **Calculating perplexity:** 

The following is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens
for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
available to condition on).




In [None]:
import torch
from tqdm import tqdm

max_length = model.config.n_positions # For GPT-2 the max_length is 1024
stride = 512  # we will use 512 tokens as sliding-window size 
seq_len = encodings.input_ids.size(1) # seq_len of the WikiText-2. 

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        # loss is calculated using CrossEntropyLoss which averages over input tokens.
        # Multiply it with trg_len to get the summation instead of average.
        # We will take average over all the tokens to get the true average
        # in the last step of this example.
        neg_log_likelihood = outputs.loss * trg_len

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).sum() / end_loc)




  0%|          | 0/562 [00:00<?, ?it/s][A[A

  0%|          | 1/562 [00:02<22:56,  2.45s/it][A[A

  0%|          | 2/562 [00:02<10:06,  1.08s/it][A[A

  1%|          | 4/562 [00:02<04:18,  2.16it/s][A[A

  1%|          | 6/562 [00:02<02:41,  3.45it/s][A[A

  1%|▏         | 8/562 [00:03<01:57,  4.72it/s][A[A

  2%|▏         | 10/562 [00:03<01:33,  5.90it/s][A[A

  2%|▏         | 12/562 [00:03<01:20,  6.87it/s][A[A

  2%|▏         | 14/562 [00:03<01:11,  7.71it/s][A[A

  3%|▎         | 16/562 [00:03<01:05,  8.39it/s][A[A

  3%|▎         | 18/562 [00:04<01:00,  8.92it/s][A[A

  4%|▎         | 20/562 [00:04<00:58,  9.34it/s][A[A

  4%|▍         | 22/562 [00:04<00:56,  9.57it/s][A[A

  4%|▍         | 24/562 [00:04<00:55,  9.73it/s][A[A

  5%|▍         | 26/562 [00:04<00:54,  9.91it/s][A[A

  5%|▍         | 28/562 [00:05<00:53, 10.04it/s][A[A

  5%|▌         | 30/562 [00:05<00:52, 10.10it/s][A[A

  6%|▌         | 32/562 [00:05<00:52, 10.17it/s][A[A

  6%

In [None]:
print('The perplexity of GPT-2 on WikiText-2 is:', ppl.item())


The perplexity of GPT-2 on WikiText-2 is: 25.170446395874023


# Source

[1]: [GPT-2 Paper](https://www.semanticscholar.org/paper-Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe)

[2]  https://huggingface.co/docs/transformers

[3]  https://huggingface.co/docs/transformers/perplexity.