# Lecture 4

## n-gram Language Models

### Probability of a Sentence
* Naive Bayes model can compute the probability of a sentence, but is not a sophisticated model because it does not take into account interdependency of words on each other

### Human Word Prediction
* Requires domain knowledge, syntactic knowledge, lexical knowledge
* Used by text generation, autocomplete, GPT technologies

### Probability of the Next Word
* Instead of relying on domain, syntactic, and lexical knowledge, rely on the notion of **probability of a sequence**
* Probability of the sequence $w_1, w_2, ..., w_n$:

$$
        P(w_n | w_1, w_2, ..., w_{n - 1})
$$

* Can generate the probability of the entire sequence using the Chain Rule

### Markov Assumption
* Probability of a word $w_n$ given a sequence $w_1, ..., w_{n - 1}$ is difficult to estimate
* The longer a sequence becomes, the less likely it will appear in the training data
* Instead, make Markov assumptions regarding independence:
    * The probability of $w_n$ only depends on the previous k-1 words

$$ P(w_n | w_1, w_2, ..., w_{n - 1}) \approx P(w_n | w_{n - k + 1}, ..., w_{n - 1}) $$

### Bi-gram Language Model
* Using the Markov assumption and the Chain Rule:

$$ P(w_1, ..., w_n) \approx P(w_1 | start) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdot \cdot \cdot P(w_n | w_{n -1}) 
$$

### Log Probabilities
* Probabilities can become very small, so we often work with log probabilities in practice

$$ p(w_1, ..., w_n) = \prod_{i = 1}^n p(w_i | w_{i - 1}) \\
\log p(w_1, ..., w_n) = \sum_{i = 1}^n \log p(w_i | w_{i - 1})
$$

### Estimating n-gram Probabilities
* We can estimate n-gram probabilities using maximum likelihood estimates

$$ p(w | u) = \frac{count(u, w)}{count(u)} $$

* Or for trigrams:

$$ p(w | u, v) = \frac{count(u, v, w)}{count(u, v)} $$ 

### Unseen Tokens
* If tokens are unseen, then counts become 0 in the denominator
* The problem can be solved using the following approach:
    * Start wtih a specific lexicon of known tokens
    * Replace all tokens in the training and testing corpus that are not in the lexicon with an *UNK* token
    * Practical approach:
        * Lexicon contains all words that appear more than *k* times in the training corpus
        * Replace all other tokens with UNK

### Unseen Contexts
* If the context has not been seen, then calculating probability would also involve dividing by 0
* Two basic approaches for dealing with unseen contexts:
    * Smoothing / Discounting: move some probability mass from seen trigrams to unseen trigrams
    * Back-off: use shorter contexts to fill in gaps, can use bigram or unigram information to approximate trigram information if trigram is unseen (use n-1, n-2, n-k, etc. grams to compute n-gram probability)
* Other techniques: 
    * Class-based backoff, use back-off probability for a specific word class / part-of-speech

### Zipf's Law
* Problem: n-grams (and most other linguistic phenomena) follow a *Zipfian* distribution
* A few words occur very frequently
* Most words occur very rarely, and many are seen only once
* **Zipf's Law:** a word's frequency is approximately inversely proportional to its rank in the word distribution list

### Smoothing
* Smoothing flattens spiky distributions
* Take counts from words that have a really high frequency, and distribute among those with lower frequencies to balance/smooth out the distribution

### Additive Smoothing
* Classic approach: Laplacian, a.k.a. additive smoothing

$$ P(w_i) = \frac{count(w_i) + 1}{N + V} $$

where N is the number of tokens, V is the number of types (i.e. the size of the vocabulary)

* In the bigram case, we get the following probability:

$$ P(w | u) = \frac{count(u, w) + 1}{count(u) + V} $$

* This allows us to avoid the case in which the context, u, has not been seen so we are no longer dividing by zero
* However, this treats all unseen words as equal, instead of treating words differently (some words inherently have a higher probability than others)

### Linear Interpolation
* Use denser distributions of shorter ngrams to "fill in" sparse ngram distributions

$$ p(w | u, v) = \lambda_1 \cdot p_{mle}(w | u, v) + \lambda_2 \cdot p_{mle}(w | v) + \lambda_3 \cdot p_{mle}(w) $$

where $\lambda_1, \lambda_2, \lambda_3 > 0$ and $ \lambda_1 + \lambda_2 + \lambda_3 = 1$

* Works well in practice
* Parameters can be estimated on development data (for example, using Expectation Maximization)

### Discounting
* Idea: set aside some probability mass, then fill in the missing mass using back-off (similar idea to smoothing)

$$ count^*(v, w) = count(v, w) - \beta $$

where $0 < \beta < 1$

* Then for all seen bigrams:

$$ p(w | v) = \frac{count^*(v, w)}{count(v)}$$

* For each context $v$ the missing probability mass is:

$$ \alpha (v) = 1 - \sum_{w | c(v, w) > 0} \frac{count^*(v, w)}{count(v)} $$

* We can now divide this held-out mass between the unseen words (evenly, or using back-off)
* We can assign this probability mass that has been set aside to the unseen words either uniformly, or distribute it with respect to the unigram count of each word

### Katz' Backoff
* Divide the held-out probability mass proportionally to the unigram probability of the unseen words in context $v$

$$ p(w | v) = 
\begin{cases} \frac{count^*(v, w)}{count(v)} & \text{if  } count(v, w) > 0\\
\alpha (v) \times \frac{p_{mle}(w)}{\sum_{u | c(v, u) = 0} p_{mle}(u) } & \text{otherwise.}
\end{cases}
$$

* This allocates the held-out probability mass proportionally to how often the unseen token occurs

### Katz' Backoff for Trigrams
* For trigrams: recursively compute backoff-probability for unseen bigrams. Then distribute the held-out probability mass proportionally to that bigram backoff-probability.

$$ p(w | u, v) = 
\begin{cases} \frac{count^*(u, v, w)}{count(u, v)} & \text{if } count(u, v, w) > 0 \\
\alpha (v) \times \frac{p_{BO}(w | v)}{\sum_{z | c(u, v, w) = 0} p_{BO}(z | v) } & \text{otherwise.}
\end{cases}
$$

where from earlier, we have that $\alpha (v)$:

$$ \alpha (u, v) = 1 - \sum_{w | count(u, v, w) > 0} \frac{count^*(u, v, w)}{count(u, v)}
$$

* Backoff methods are often combined with Good-Turing smoothing

### Evaluating n-gram Models
* Extrinsic evaluation: Apply the model in an application (for example, language classification) then evaluate the application
* Intrinsic evaluation: measure how well the model approximates unseen language data
    * Can compute the probability of each sentence according to the model (higher probability = better model)
    * Typically, compute *perplexity* of probability of each sentence

### Perplexity
* **Perplexity -** per word measure of how well the ngram model predicts the sample
* Perplexity is defined as $2^{-l}$ where 

$$l = \frac{1}{M} \sum_{i = 1}^{m} \log_2 p(s_i)$$

* Intuitively, perplexity is the average amount of "surprise" that the model experiences for every new word, can also be thought of as the "effective vocabulary size"
* Lower perplexity = better model
* Lower perplexity means the model is confident about its predictions