# Probabilistic Language Modeling

**Goal:**
- To compute the probability of a sentence (i.e. sequence of words)
$$
P(W)= P(w_1,w_2,w_3,\ldots,w_n)
$$
- Example:
    - Probability of "High winds tonight" > Probability of "Large winds tonight"
    - P("The post office is 15 minutes from here") > P("The post office is 15 minuets from here")
    - P("I want to eat steak") > P("I want to ride steak")

**Related Task Example:**
- Calculate the probability of upcoming word:
$$
P(w_4|w_1,w_2,w_3)
$$
- Example:
    - In the context of text in restaurant ads:
    - $P(\mathrm{eat}|\mathrm{I}, \mathrm{want}, \mathrm{to}) > P(P(\mathrm{sleep}|\mathrm{I}, \mathrm{want}, \mathrm{to}))$

**This is called the Basic Language Model**
- Grammar model is better.
- But sometimes, this model is sufficient.
- And, it's relatively easy to compute compared to grammar model.

---
## Calculating The Probability

$$
P(\mathrm{I}, \mathrm{want}, \mathrm{to},\mathrm{eat})
$$

Use the chain rule!

$$
P(a,b,c,d) = P(a)\cdot P(b|a) \cdot P(c|a,b) \cdot P(d|a,b,c)
$$

or 

$$
P(x_1,x_2,\ldots,x_n) = P(x_1)\cdot P(x_2|x_1) \cdot \ldots \cdot P(x_n|x_1,\ldots,x_{n-1})
$$

**Problem**

Estimating the probability:

$$
P(\mathrm{eat}|\mathrm{I}, \mathrm{want}, \mathrm{to}) = \frac{Count(I, want, to, eat)}{Count(I, want, to)}
$$

* Too many possibilites!
* Hard to estimate!

**Simplification**

Using Markov assumption:
- $P(\mathrm{eat}|\mathrm{I}, \mathrm{want}, \mathrm{to}) \approx P(\mathrm{eat}|\mathrm{to})$
- $P(\mathrm{eat}|\mathrm{I}, \mathrm{want}, \mathrm{to}) \approx P(\mathrm{eat}|\mathrm{want}, \mathrm{to})$
---
### Unigram Model

$$
P(w_1,w_2,\ldots,w_i) \approx \prod_i P(w_i)
$$

- Only see the probability of each word.
- Do not see context / previous words.
- Simplest model.
- But bad.
- Just no, unless there's no other way.

---
### Bigram Model

$$
P(w_i|w_1,w_2,\ldots,w_{i-1}) \approx \prod_i P(w_i|w_{i-1})
$$

- Better than unigram.
- See the context up to 1 word before.
- Not very complex.
- Sometimes, good enough

**Estimating bigram**

$$
P(w_i|w_{i-1}) = \frac{count(w_{i-1},w_i)}{count(w_{i-1})}
$$

---
### Tri, 4, 5, etc grams
- Have longer conditional.
- Usually better
- But the probability estimation is harder to calculate.
    - i.e. the probability of longer sentence is smaller.
---
### Practical Tips
- Do everything in log space
    - Avoid underflow

In [1]:
import numpy as np

p1 = 0.1
p2 = 0.02
p3 = 0.015
p4 = 0.2
p5 = 0.05

p = p1*p2*p3*p4*p5
logp = np.log(p1) + np.log(p2) + np.log(p3) + np.log(p4) + np.log(p5)

print(f'probability : {p}')
print(f'log-probability : {logp}')

probability : 3.0000000000000004e-07
log-probability : -15.01948336229021


---
## Evaluation

How to evaluate the language model?

Using **Perplexity**:
- Measurement of how well probability model predicts a sample.
- Low perplexity indicates the model is good at predicting sample.
- Minimizing Perplexity is equal to maximizing probability

<img src="https://images1.programmersought.com/109/ed/ed7002017e9dcf5bfdaed1a0dc845d55.png" width= 500px;/>

---
## Unknown Words
- Words that we have never seen before.
- So far, our vocabulary is closed. (Only contains some set of words)
- What if our model face words that have never seen before?
- Use OOV token:
    - Percentage of OOV words that appear in a test called OOV rate.
    - We usually use a pseudo-word `<UNK>` to assign this token.
- Two common way to use `<UNK>`:
- First:
    - Choose a vocabulary in advance.
    - Convert in the training set, any word that not in the vocab as `<UNK>`
    - Estimate the probability of `<UNK>` from its counts just like any other regular word.
- Second:
    - Similar with the first approach.
    - But instead, we choose the vocabulary based on frequency.
    - e.g: Words that occur less than $n$ times assigned as `<UNK>`

---
## Smoothing

- If there are words in our vocab but appear in a test set in an unseen context:
- Example:
    - Training set: "I want to learn english language"
    - Test set: "I want to eat english breakfast"
    - Word "English" never comes after word "eat"
    - The probability would be zero.
- Use **Smoothing** technique:
    - Add-one smoothing (Laplace)
    - Add-k
    - Stupid Backoff
    - Kneser-Ney (Most recommended)