## TF-IDF in NLP

### Context
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (corpus). Unlike Bag of Words (BoW), which considers raw frequencies, TF-IDF assigns higher weights to words that are important within a document but rare across the corpus.

#### Key Points:
- **Purpose**: Helps distinguish important words in a document while reducing the influence of common words.
- **Usage**:
  - Commonly used in text classification, information retrieval, and search engines.
  - Improves upon BoW by reducing the weight of frequently occurring but unimportant words.
- **Mathematical Formulation**:
  - **Term Frequency (TF)**: Measures how often a word appears in a document.
    
    $TF(w) = \frac{\text{Number of times } w \text{ appears in the document}}{\text{Total number of words in the document}}$


  
  - **Inverse Document Frequency (IDF)**: Measures how unique a word is across the corpus.
    
    $IDF(w) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing } w} \right)$
  
  - **TF-IDF Score**: The final importance score of a word.
    
    $TF\text{-}IDF(w) = TF(w) \times IDF(w)$

### Example: Text Classification using TF-IDF

Let's use a toy dataset to demonstrate how TF-IDF works.

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Expanded dataset to improve learning
data = {
    "text": [
        "The movie was fantastic and I loved it",
        "Absolutely terrible movie, I hated it",
        "Great film with an amazing story",
        "The story was dull and boring",
        "A fantastic experience with great actors",
        "Worst film ever, do not watch it",
        "Loved the characters and the direction",
        "Terrible plot and bad acting",
        "The acting was great and the film was entertaining",
        "The movie was slow and had a bad script",
        "Best movie ever, highly recommended!",
        "Disappointing film with a predictable storyline",
        "Brilliant performances, a must-watch!",
        "Avoid this movie, it's a waste of time",
        "Spectacular direction and cinematography",
        "Horrible movie with no redeeming qualities",
        "The screenplay and performances were top-notch",
        "The pacing was terrible, made me fall asleep"
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1: Positive, 0: Negative
}

df = pd.DataFrame(data)

# Splitting dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

# Vectorizing text using TF-IDF with tuning
tfidf_vectorizer = TfidfVectorizer(min_df=1, max_df=0.9, ngram_range=(1, 2))  # Unigrams & bigrams
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Training a Naive Bayes classifier with smoothing
classifier = MultinomialNB(alpha=0.5)
classifier.fit(X_train_tfidf, y_train)

# Making predictions
y_pred = classifier.predict(X_test_tfidf)

# Evaluating model
accuracy = accuracy_score(y_test, y_pred)
print("TF-IDF Model Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred, zero_division=0))

# Predicting for new sentences
new_sentences = [
    "I loved the film and the story was amazing",
    "Worst experience, the plot was awful",
    "The cinematography and acting were fantastic",
    "Horrible script with bad direction"
]

# Transform new sentences using the trained vectorizer
new_sentences_tfidf = tfidf_vectorizer.transform(new_sentences)
new_predictions = classifier.predict(new_sentences_tfidf)

# Display predictions
for sentence, prediction in zip(new_sentences, new_predictions):
    label = "Positive" if prediction == 1 else "Negative"
    print(f"'{sentence}' -> {label}")

TF-IDF Model Accuracy: 0.25

Classification Report:
               precision    recall  f1-score   support

           0       0.33      0.50      0.40         2
           1       0.00      0.00      0.00         2

    accuracy                           0.25         4
   macro avg       0.17      0.25      0.20         4
weighted avg       0.17      0.25      0.20         4

'I loved the film and the story was amazing' -> Positive
'Worst experience, the plot was awful' -> Negative
'The cinematography and acting were fantastic' -> Positive
'Horrible script with bad direction' -> Negative


## 📚 Explaining TF-IDF with Logarithm  

### 🏛 Imagine a Library  
- You have a big **library** full of books.  
- Some words, like **"the"**, **"is"**, and **"a"**, appear in **almost every book**.  
- Other words, like **"volcano"**, **"tornado"**, or **"dinosaur"**, appear in **only a few books**.  

Now, you want to find **which words are important in each book**.

---

### 🔢 Step 1: Count the Words (TF - Term Frequency)  
You look at a book and **count** how many times a word appears.  

- If **"dinosaur"** appears **10 times**, its **TF is 10**.  
- If **"the"** appears **100 times**, its **TF is 100**.  

But is **"the"** more important than **"dinosaur"**? 🤔 **No!**  

---

### 📊 Step 2: Find Out How Common the Word Is (IDF - Inverse Document Frequency)  
- If **"dinosaur"** appears in **only 2 books out of 1,000**, it's **rare** and **important**! 🦖  
- If **"the"** appears in **every single book**, it's **too common** and **not important**.  

So, we give **rare words a higher score** and **common words a lower score**.

---

### 🤯 Step 3: Why Use Logarithm?  
If we just divide, rare words might get **huge** scores, making things **unfair**.  

#### ❌ Without Log  
- "dinosaur" appears in **2 books** →  
  **IDF = 1000 ÷ 2 = 500**  
- "volcano" appears in **10 books** →  
  **IDF = 1000 ÷ 10 = 100**  

Whoa! 😲 **That’s a HUGE difference!**  

#### ✅ With Log  
- "dinosaur" IDF =  
  $$ \log \left(\frac{1000}{2}\right) = \log(500) \approx 2.7 $$  
- "volcano" IDF =  
  $$ \log \left(\frac{1000}{10}\right) = \log(100) \approx 2 $$  

Now, the scores are **closer and more balanced**! 👍  

---

### 🎯 Final Answer  
📌 We use **logarithm** because it **smooths the difference** between **rare and common words**, making the importance **fair and balanced**. ⚖️  

🚀 So, **TF-IDF with log** helps computers **find the most important words in a book or article** without getting confused by common words!  

---

Hope that makes sense! Let me know if you want an even simpler version! 😊


### Advantages
- Reduces the influence of commonly occurring words.
- Helps in extracting meaningful words from text data.
- Works well in text classification and information retrieval.

### Disadvantages
- Still does not capture semantic meaning.
- High dimensionality for large corpora.
- May not work well for very short texts.

### Conclusion
TF-IDF is a powerful technique for text representation that improves upon BoW by weighting words based on their importance. It is widely used in search engines, text classification, and recommendation systems. However, for more advanced NLP tasks, embeddings like Word2Vec and BERT provide deeper semantic understanding.