# Language Model Evaluation

## Language Model

A language model is a model that predicts the next word in a sequence of words. For example, given the sentence "I like to eat", a language model can predict the next word "apples" with high probability.

- A language model assigns a probability to a sequence of words.
- It predicts the next word $w$ given the previous words $w_1, \ldots, w_{n-1}$.
- Given the sentence `I like to eat`, the probability of the next word `apples` is higher than the probability of the next word `pencils`. In a mathematical notation, we can write this as:

    $$
    P(\text{apples} \mid \text{I like to eat}) > P(\text{pencils} \mid \text{I like to eat})
    $$
    
- A unigram model only depends on the current word $w_n$:

    $$
    P(w_n \mid w_1, \ldots, w_{n-1}) = P(w_n)
    $$

- The probability of a sequence of words is the product of the probabilities of the individual words:

    $$
    P(\text{I like to eat apples}) = P(\text{I}) \times P(\text{like}) \times P(\text{to}) \times P(\text{eat}) \times P(\text{apples})
    $$

- An n-gram model looks back at the previous $n-1$ words:

    $$
    P(w_n \mid w_1, \ldots, w_{n-1}) = P(w_n \mid w_{n-1}, \ldots, w_{n-n+1})
    $$

- The probability of a sequence of words $W$ in an n-gram model is the product of the probabilities of the individual words:

    $$
    P(W) = \prod_{i=1}^N P(w_i \mid w_{i-1}, \ldots, w_{i-n+1})
    $$

## Evaluation of Language Models

There are two main ways to evaluate language models:

**Extrinsic evaluation**
  - Evaluate the language model on a downstream task such as machine translation.
  - This is the best way to evaluate a language model as it is the most realistic evaluation.
  - This requires a full pipeline from the language model to the downstream task.
  - The GLUE benchmark score is one example of broader, multi-task evaluation for language models.

**Intrinsic evaluation**
  - Intrinsic evaluation uses intricsic metrics to evaluate the language model itself.
  - This is easier to do as it does not require a downstream task.
  - Intrinsic evaluation is not as realistic as extrinsic evaluation.
  - Metrics such as perplexity and cross-entropy fit into this category.
  - Unlike metrics such as accuracy, perplexity and cross-entropy are not directly interpretable.
  - While 90% accuracy is superior to 80% accuracy regardless of the model, comparing perplexity values is not as straightforward.

## References

- [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf)
- [Evaluation Metrics for Language Modeling](https://thegradient.pub/understanding-evaluation-metrics-for-language-models)
- [Perplexity in Language Models](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94)