## 2. TF-IDF (Term Frequency – Inverse Document Frequency)
- Adjusts word counts by importance:
  - **TF**: Frequency of word in document.
  - **IDF**: Penalizes words that appear in many documents.


###  How it works
TF-IDF is an improved version of Count Vectorizer.  
Instead of raw counts, it assigns **weights** to words based on their **importance**.


###  Formulae

1. **Term Frequency (TF)**  
   Measures how often a word appears in a document:  
   $$
   TF(t,d) = \frac{f_{t,d}}{\sum_{k} f_{k,d}}
   $$  
   where \( f_{t,d} \) = count of term *t* in document *d*,  
   and denominator = total words in that document.  



2. **Inverse Document Frequency (IDF)**  
   Measures how important a word is across the corpus:  
   $$
   IDF(t) = \log \frac{N}{df_t}
   $$  
   where \( N \) = total number of documents,  
   \( df_t \) = number of documents containing term *t*.  

   Common words like "the" have low IDF, rare words get high IDF.  


3. **TF-IDF**  
   Final weight of a word in a document:  
   $$
   TFIDF(t,d) = TF(t,d) \times IDF(t)
   $$





Vocabulary = {"I", "like", "NLP", "Machine", "Learning", "is", "fun"}  

**Step 1 – Term Frequency (TF):**

| Document | I   | like | NLP | Machine | Learning | is  | fun |
|----------|-----|------|-----|---------|----------|-----|-----|
| Doc1     | 1/3 | 1/3  | 1/3 | 0       | 0        | 0   | 0   |
| Doc2     | 1/4 | 1/4  | 0   | 1/4     | 1/4      | 0   | 0   |
| Doc3     | 0   | 0    | 1/3 | 0       | 0        | 1/3 | 1/3 |

**Step 2 – Inverse Document Frequency (IDF):**

$$
IDF = \log \frac{N}{df}
$$

- N = 3 (documents)  
- Example: "I" appears in 2 docs → IDF("I") = log(3/2)  

So:  
- IDF(I) = log(3/2)  
- IDF(like) = log(3/2)  
- IDF(NLP) = log(3/2)  
- IDF(Machine) = log(3/1)  
- IDF(Learning) = log(3/1)  
- IDF(is) = log(3/1)  
- IDF(fun) = log(3/1)  

**Step 3 – TF × IDF = TF-IDF Matrix**  
- Words unique to one document get **higher weight**.  
- Common words like "I", "like" get **lower weight**.  

---

### Advantages
- Down-weights common words ("the", "is").  
- Emphasizes important, rare words.  
- Works well for search engines and information retrieval.  

---

### Limitations
1. Still **ignores context & word order**.  
2. Similar words ("car" vs "automobile") are treated as different.  
3. High dimensional for large vocabularies.  

---
