# N-Gram Models in NLP

## 📖 What is an N-Gram?

An **N-gram** is a contiguous sequence of $n$ items (usually words or characters) from a given sample of text or speech.

- If $n = 1$: **Unigram** (single word)
- If $n = 2$: **Bigram** (pair of words)
- If $n = 3$: **Trigram** (three-word phrase)
- And so on...

---

## Why Use N-Grams?

N-grams are used to capture **local context** and **sequential structure** in text. They are foundational in many NLP tasks such as:

- Text classification
- Language modeling
- Text generation
- Spelling correction
- Machine translation

---

## 🧾 Example

### Sentence:

"I love natural language processing"


### Unigrams:
["I", "love", "natural", "language", "processing"]

### Bigrams:
["I love", "love natural", "natural language", "language processing"]

### Trigrams:
["I love natural", "love natural language", "natural language processing"]


---

## Mathematical Representation

Let $W = w_1, w_2, ..., w_n$ be a sequence of words. The **probability of the entire sequence** is:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, w_2, ..., w_{i-1})
$$

Using an **N-gram approximation** (Markov assumption):

$$
P(w_i \mid w_1, w_2, ..., w_{i-1}) \approx P(w_i \mid w_{i-(n-1)}, ..., w_{i-1})
$$

This simplifies the probability estimation using only the previous $n-1$ words.

---

## Applications of N-Grams

- **Text classification**: Feature engineering with bigrams or trigrams
- **Autocomplete**: Predicting next word based on history
- **Plagiarism detection**: Matching overlapping n-grams
- **Spelling correction**: Detecting unusual n-gram sequences

---

## ⚖️ Pros and Cons

| Aspect            | N-Gram Models                         |
|-------------------|----------------------------------------|
| ✅ Easy to implement   | Yes                                |
| ✅ Captures local context | Yes                            |
| ❌ Requires large corpus | For high $n$, data sparsity increases |
| ❌ No deep semantics     | Context is only partial and local     |

---

## Summary

- N-grams model sequences by capturing **word co-occurrence** patterns.
- They are **simple but powerful** tools for understanding context.
- Higher-order N-grams can be **more accurate**, but require **more data**.
- Often used as **baseline models** or **features** in larger NLP pipelines.

> 💡 N-gram models were foundational in early NLP, and they still play a role in many practical systems today — especially for fast, explainable models.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = [
    "Natural Language Processing is a fascinating field of AI.",
    "Tokenization breaks text into smaller units called tokens.",
    "The vocabulary is built from unique words across the corpus."
]

# Loop through 1-gram to 3-gram
for n in range(1, 4):
    print(f"\n🔢 {n}-Gram Features:\n")

    # Initialize vectorizer with n-gram range
    vectorizer = CountVectorizer(ngram_range=(n, n))
    X = vectorizer.fit_transform(corpus)

    # Extract and print n-gram features
    ngrams = vectorizer.get_feature_names_out()
    print(ngrams)



🔢 1-Gram Features:

['across' 'ai' 'breaks' 'built' 'called' 'corpus' 'fascinating' 'field'
 'from' 'into' 'is' 'language' 'natural' 'of' 'processing' 'smaller'
 'text' 'the' 'tokenization' 'tokens' 'unique' 'units' 'vocabulary'
 'words']

🔢 2-Gram Features:

['across the' 'breaks text' 'built from' 'called tokens'
 'fascinating field' 'field of' 'from unique' 'into smaller' 'is built'
 'is fascinating' 'language processing' 'natural language' 'of ai'
 'processing is' 'smaller units' 'text into' 'the corpus' 'the vocabulary'
 'tokenization breaks' 'unique words' 'units called' 'vocabulary is'
 'words across']

🔢 3-Gram Features:

['across the corpus' 'breaks text into' 'built from unique'
 'fascinating field of' 'field of ai' 'from unique words'
 'into smaller units' 'is built from' 'is fascinating field'
 'language processing is' 'natural language processing'
 'processing is fascinating' 'smaller units called' 'text into smaller'
 'the vocabulary is' 'tokenization breaks text' 'uniqu