# Language Models

## General Concept

Language models are computational models that assign probabilities to sequences of words or predict the next word in a sequence. They play a crucial role in various natural language processing tasks, such as:

- Speech recognition
- Machine translation
- Spelling and grammar correction
- Text generation

Language models capture the structure and patterns within a language, allowing them to estimate how likely a given word or phrase is to appear in a specific context. There are different types of language models, such as:

- N-gram models, which rely on sequences of words
- Neural language models, which utilize deep learning techniques to understand and generate text

Understanding and developing effective language models is essential for improving the performance of natural language processing systems.


## Why do we need language models?

Language models are essential for various reasons in natural language processing tasks:

1. **Disambiguation**: Language models help in resolving ambiguities in speech recognition and text processing, as they can assign probabilities to different interpretations based on the context, selecting the most likely one.

2. **Machine Translation**: In translating text from one language to another, language models can help choose the most fluent and accurate translations by estimating the likelihood of word sequences in the target language.

3. **Text Generation**: Language models can generate coherent and contextually relevant text, which is useful for tasks like summarization, question-answering, and dialogue systems.

4. **Spelling and Grammar Correction**: Language models can identify and correct errors in written text by comparing the probabilities of different word sequences and suggesting more likely alternatives.

5. **Assistive Technologies**: Language models are crucial for augmentative and alternative communication (AAC) systems, as they can predict and suggest likely words or phrases for users with speech or language impairments, making communication more efficient.

Overall, language models play a critical role in improving the performance and accuracy of natural language processing systems by capturing the structure, patterns, and nuances of a language.


## N-gram Language Models

N-gram language models are a simple yet powerful approach to language modeling. They predict the next word in a sequence based on the previous n-1 words, where n is the order of the model. The main types of n-grams are:

- **Unigram**: Considers only a single word, ignoring context (n=1).
- **Bigram**: Considers a sequence of two words (n=2).
- **Trigram**: Considers a sequence of three words (n=3).
- **Higher-order n-grams**: Considers longer sequences of words (n>3).

In n-gram models, the probability of a word depends only on the (n-1) previous words. This simplification is known as the Markov assumption. To estimate the probabilities, n-gram models rely on counting the occurrences of n-grams in a large corpus and normalizing the counts.

Advantages of n-gram models:
- Relatively simple and easy to implement.
- Efficient in terms of computation and memory usage.

Limitations of n-gram models:
- Unable to capture long-range dependencies between words.
- Sensitive to data sparsity issues, as some n-grams may not appear in the training corpus.

Despite their limitations, n-gram models serve as a foundational tool for understanding language modeling concepts and are still used in various NLP applications.


## N-Grams and Probability Estimation

N-grams can be used to estimate the probability of a word given a history (P(w|h)). For instance, if the history h is "I like to eat", we might want to estimate the probability of the next word being "pizza" (P(pizza|I like to eat)).

### Estimating Probabilities from Counts
We can estimate this probability by counting the occurrences of the history followed by the target word in a large corpus and dividing by the total count of the history:
$$
P(\text{pizza}|\text{I like to eat}) = \frac{C(\text{I like to eat pizza})}{C(\text{I like to eat})}
$$
$$
P(\text{먹습니다}|\text{저는 김치를}) = \frac{C(\text{저는 김치를 먹습니다})}{C(\text{저는 김치를})}
$$
However, even with a large corpus, it's often not sufficient to provide good estimates due to the creative nature of language and the possibility of unseen word sequences.

### Bigram Model
To tackle this issue, we can use the bigram model, which approximates the probability of a word given its entire history by considering only the preceding word:
$$
P(\text{pizza}|\text{I like to eat}) \approx P(\text{pizza}|\text{eat})
$$
$$
P(\text{먹습니다}|\text{저는 김치를}) \approx P(\text{먹습니다}|\text{김치를})
$$
This simplification allows us to estimate probabilities more reliably, but it may not capture longer context dependencies. Nevertheless, n-grams serve as a foundational tool for understanding language modeling concepts and can be useful in various NLP applications.


## Estimating Joint Probabilities of Word Sequences

To estimate the joint probability of an entire sequence of words, such as "The cat sat on the mat", we can decompose this probability using the chain rule of probability:

$$
P(w_1:n) = P(w_1)P(w_2|w_1)P(w_3|w_1:2)...P(w_n|w_1:n−1) = \prod_{k=1}^n P(w_k|w_1:k−1)
$$

This decomposition shows the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. However, it doesn't really seem to help us! Computing the exact probability of a word given a long sequence of preceding words (e.g., P(wn|w1:n−1)) is challenging because language is creative, and any particular context might have never occurred before.

### Example

For the sequence "The cat sat on the mat", we can decompose the joint probability as follows:

$$
P(\text{The, cat, sat, on, the, mat}) = P(\text{The})P(\text{cat}|\text{The})P(\text{sat}|\text{The, cat})...P(\text{mat}|\text{The, cat, sat, on, the})
$$

Estimating each conditional probability using counts from a large corpus is not feasible since many long sequences might never have occurred before.


## N-gram Models and Markov Assumption

N-gram models are used to predict the probability of a word given a fixed number of preceding words. The assumption that the probability of a word depends only on a limited number of previous words is called the Markov assumption.

For a bigram model (N=2), the approximation is:

$$
P(w_n|w_1:n−1) ≈ P(w_n|w_{n−1})
$$

For an n-gram model with size N, the approximation is:

$$
P(w_n|w_1:n−1) ≈ P(w_n|w_{n−N+1:n−1})
$$

Given the n-gram assumption for the probability of an individual word, we can compute the probability of a complete word sequence as:

$$
P(w_1:n) ≈ \prod_{k=1}^n P(w_k|w_{k−N+1:k−1})
$$

### Example

For the trigram model (N=3), the approximation for the sequence "The cat sat on the mat" is:

$$
P(\text{The, cat, sat, on, the, mat}) ≈ P(\text{The})P(\text{cat}|\text{The})P(\text{sat}|\text{The, cat})P(\text{on}|\text{cat, sat})P(\text{the}|\text{sat, on})P(\text{mat}|\text{on, the})
$$


### Stock Prices and Markov Assumption

The Markov assumption can be applied to model stock prices as well. In finance, the Markov assumption is often used to represent the idea that future stock prices depend only on the current price and a limited number of past prices, rather than the entire price history.

The relationship between stock prices and the Markov assumption can be understood as follows:

1. **Memoryless Property**: The Markov assumption implies that the stock price at a certain time is only influenced by a fixed number of previous time steps. This means that the future price movement doesn't depend on the entire history but only on the recent past. This property is also known as the memoryless property of Markov models.

2. **Simplifying Complex Systems**: Stock prices are affected by a vast number of factors, including market trends, company performance, global events, and investor sentiment. By applying the Markov assumption, we can simplify the modeling of stock prices by focusing only on the most recent price changes, which are assumed to capture relevant information.

3. **Prediction and Analysis**: Using the Markov assumption in financial models allows us to predict and analyze stock price movements. For example, we can create Markov models to estimate the probabilities of stock price changes, which can be useful for trading strategies, risk management, and portfolio optimization.

It's important to note that the Markov assumption is a simplification and may not always accurately represent the complexities of the stock market. However, it serves as a useful tool in modeling and analyzing stock prices.


## Estimating Bigram or N-gram Probabilities using Maximum Likelihood Estimation (MLE)

To estimate the probabilities for bigrams or n-grams, we use maximum likelihood estimation (MLE) by counting the occurrences in a corpus and normalizing the counts.

MLE, or Maximum Likelihood Estimation, is a statistical method used for estimating the parameters of a probability distribution or a model. It works by finding the parameter values that maximize the likelihood of the observed data under the given model. In other words, MLE selects the parameters that make the observed data most probable.

For example, if we have a dataset of coin tosses (heads and tails), and we want to estimate the probability of getting heads, we can use MLE to find the parameter value that makes the observed sequence of coin tosses most likely. This is usually done by calculating the ratio of the number of heads to the total number of tosses.

### Bigram Probability Calculation

Compute the bigram probability of a word $w_n$ given the previous word $w_{n-1}$:

$$ P(w_n|w_{n−1}) = C(w_{n−1}w_n) / C(w_{n−1}) $$

where $C(w_{n−1} w_n)$ is the count of the bigram $w_{n−1}w_n$ and $C(w_{n−1})$ is the count of the unigram $w_{n−1}$.

### Example

Consider a mini-corpus with three sentences:

1. `<s> I am Sam </s>`
2. `<s> Sam I am </s>`
3. `<s> I do not like green eggs and ham </s>`

Here are the bigram probabilities for some pairs in this corpus:

| Bigram         | Probability |
|----------------|-------------|
| P(I\|`<s>`)       | 2/3         |
| P(Sam\|`<s>`)     | 1/3         |
| P(am\|I)        | 2/3         |
| P(`</s>`\|Sam)    | 1/2         |
| P(Sam\|am)      | 1/2         |
| P(do\|I)        | 1/3         |

For general MLE n-gram parameter estimation:

$$ P(w_n|w_{n−N+1:n−1}) = C(w_{n−N+1:n−1} w_n) / C(w_{n−N+1:n−1}) $$

Bigram probabilities capture various linguistic phenomena, such as syntax, task-specific patterns, and cultural preferences.



## Practical Issues in N-gram Models

1. **Higher-order n-grams**: In practice, trigram (conditioning on the previous two words), 4-gram, or even 5-gram models are more common when there is sufficient training data. For larger n-grams, extra contexts are needed to the left and right of the sentence end (e.g., P(I|`<s><s>`) for trigrams).

2. **Log probabilities**: Since multiplying probabilities can lead to numerical underflow, it's better to represent and compute language model probabilities in log format. Adding log probabilities is equivalent to multiplying probabilities in linear space. Convert back to probabilities when needed by taking the exponential of the log probability:

$$   p_1 × p_2 × p_3 × p_4 = \exp(\log{p_1} + \log{p_2} + \log{p_3} + \log{p_4}) $$


## References

- [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf)
