# 🔁 Natural Language Processing: Understanding N-grams

---

## 🎯 Objective

In this section, we explore **N-grams**, a powerful technique in NLP that captures the context between words by grouping them into contiguous sequences. N-grams are especially useful when Bag of Words or TF-IDF fails to distinguish between similar but **contextually opposite sentences**.

---

## ❓ Why Do We Need N-grams?

- Standard **Bag of Words** and **TF-IDF** treat each word **independently**.
- Context is lost in sentences like:
  - “**Food is good**”
  - “**Food is not good**”
- These two are **opposites**, yet BoW may represent them similarly due to word frequency alone.
- **N-grams** solve this by preserving **word sequences**, adding **contextual depth**.

---

## 🔍 Example: Sentences

```text
Sentence 1: "Food is good"  
Sentence 2: "Food is not good"
```

## Without N-grams (Unigram BoW):

| Word | S1 | S2 |
| ---- | -- | -- |
| food | 1  | 1  |
| good | 1  | 1  |
| not  | 0  | 1  |

Only one word (“not”) differs. Sentences appear similar numerically, but are semantically opposite.

## 🔗 With N-grams (Bigram)
### We now consider word pairs:

| Bigram    | S1 | S2 |
| --------- | -- | -- |
| food good | 1  | 0  |
| food not  | 0  | 1  |
| not good  | 0  | 1  |

Bigram vectors now reflect major differences between the sentences.


## 📊 Types of N-grams

| Type    | n-gram Range | Description                   |
| ------- | ------------ | ----------------------------- |
| Unigram | (1,1)        | Individual words only         |
| Bigram  | (2,2)        | Consecutive word pairs        |
| Trigram | (3,3)        | Consecutive 3-word sequences  |
| Uni+Bi  | (1,2)        | Unigrams and Bigrams combined |
| Bi+Tri  | (2,3)        | Bigrams and Trigrams only     |



## 🛠️ How to Use in scikit-learn
```python
    from sklearn.feature_extraction.text import CountVectorizer
    // Example for Unigram + Bigram
    vectorizer = CountVectorizer(ngram_range=(1, 2))
    X = vectorizer.fit_transform(corpus)
```

- ngram_range=(1, 2) → includes unigrams and bigrams
- ngram_range=(2, 3) → includes bigrams and trigrams only




## ✅ Conclusion
- N-grams are essential for context-aware NLP tasks.
- They improve model understanding, especially in sentiment and text classification.
- By choosing the right ngram_range, we can preserve phrase-level meaning.
- In the next step, we’ll implement N-gram vectorization and test its impact on a model's performance.

🔁 Try experimenting with (1,3) to include unigram, bigram, and trigram together.