<a href="https://colab.research.google.com/github/debojit11/ml_nlp_dl_transformers/blob/main/ML_week_11_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📚 Week 11.5 – Understanding Text Vectorization (CountVectorizer & TF-IDF)
---


### 🤔 Why Vectorize Text?
Machines work with **numbers**, not words.
Text vectorization = converting **raw text → numerical features**  
This is the **first step** in almost every NLP pipeline.

---



### 🧰 Common Vectorization Methods
| Method            | What it does                              | Notes |
|-------------------|--------------------------------------------|-------|
| CountVectorizer   | Counts how often each word appears         | Simple, can be sparse |
| TF-IDF Vectorizer | Weighs word frequency + how rare it is     | Reduces weight of common words |

---


## ✏️ Example: CountVectorizer
```python
from sklearn.feature_extraction.text import CountVectorizer
docs = ["I love NLP", "NLP is cool", "I love coding"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Count Matrix:\n", X.toarray())
```
Output:
```
Vocabulary: ['coding' 'cool' 'is' 'love' 'nlp']
Count Matrix:
 [[0 0 0 1 1]
 [0 1 1 0 1]
 [1 0 0 1 0]]
```

Each row = a sentence  
Each column = word frequency  
Words like "I" are removed automatically if you use `stop_words='english'`.



## 📏 Example: TF-IDF Vectorizer
TF = Term Frequency
IDF = Inverse Document Frequency

```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", X.toarray())
```

Output: (approximate values)
```
Vocabulary: ['coding' 'cool' 'is' 'love' 'nlp']
TF-IDF Matrix:
 [[0.     0.     0.     0.7071 0.7071]
 [0.     0.7071 0.7071 0.     0.    ]
 [0.7071 0.     0.     0.7071 0.    ]]
```

TF-IDF **downweights common words** across documents (like "nlp" here). It's **better for distinguishing informative words** than raw counts.



## ⚖️ CountVectorizer vs TF-IDF
| Feature | CountVectorizer | TF-IDF |
|---------|----------------|--------|
| Simplicity | ✅ Very simple | ❌ Slightly more complex |
| Common words | ✅ Weighted equally | ✅ Penalized (IDF) |
| Rare words | ❌ No boost | ✅ Boosted by IDF |
| Good for | Naive Bayes, baseline models | SVM, Logistic Regression |



## ✨ Real Use Case in NLP
Before we fed our SMS spam data to:
* Logistic Regression
* Random Forest
* KNN
* SVM

We used `TfidfVectorizer` to turn raw messages into **numerical vectors**.
Now you understand **what those numbers represent**! 📊



## 🧪 Bonus: N-grams
```python
vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(["I love NLP", "NLP is fun"])
print("N-grams:", vectorizer.get_feature_names_out())
print(X.toarray())
```

Output:
```
N-grams: ['fun' 'i' 'i love' 'love' 'love nlp' 'nlp' 'nlp is' 'is']
```

`ngram_range=(1,2)` means unigrams + bigrams. Useful for capturing **phrase-level meaning** (e.g., "not good").



## ✅ Summary
* Vectorization = turning text into **number vectors**.
* `CountVectorizer`: simple, fast.
* `TF-IDF`: smarter, reduces noise from common words.
* Used **before training** any ML model.

## 🔜 Up Next: Week 12
Now that your text is numerical, you're ready for **MLPs (Neural Networks)** to learn **deeper representations** 🔥 Let's dive into deep learning next!