"""
Natural Language Processing - TF-IDF (Term Frequency–Inverse Document Frequency)
-------------------------------------------------------------------------------

We're going to explore how TF-IDF works as a method to convert text into numerical vectors.

Overview:
---------
TF-IDF combines two concepts:
1. Term Frequency (TF): Measures how often a word appears in a document.
2. Inverse Document Frequency (IDF): Penalizes words that occur in many documents (less unique).

TF = (Number of times term appears in sentence) / (Total number of words in sentence)

IDF = log_e(Total number of sentences / Number of sentences containing the word)

TF-IDF = TF * IDF

We’ll work through an example with these three sentences (already lowercased and stopwords removed):

S1 = "good boy"
S2 = "good girl"
S3 = "boy girl good"

Step 1: Vocabulary
------------------
From the three sentences, our vocabulary (unique words) is:
['boy', 'girl', 'good']

Step 2: Term Frequency (TF)
---------------------------
Calculate TF for each word in each sentence:

Sentence 1: "good boy"
- good: 1/2
- boy: 1/2
- girl: 0

Sentence 2: "good girl"
- good: 1/2
- girl: 1/2
- boy: 0

Sentence 3: "boy girl good"
- good: 1/3
- boy: 1/3
- girl: 1/3

Step 3: Inverse Document Frequency (IDF)
----------------------------------------
Total number of sentences = 3

Calculate how many sentences contain each word:
- good → 3 sentences → IDF = log(3/3) = 0
- boy → 2 sentences → IDF = log(3/2)
- girl → 2 sentences → IDF = log(3/2)

So:
- IDF(good) = 0
- IDF(boy) ≈ 0.176
- IDF(girl) ≈ 0.176

Step 4: TF-IDF Calculation
--------------------------
TF-IDF = TF * IDF

Sentence 1:
- good: (1/2) * 0     = 0
- boy:  (1/2) * 0.176 ≈ 0.088
- girl: 0 * 0.176     = 0

Sentence 2:
- good: (1/2) * 0     = 0
- boy:  0 * 0.176     = 0
- girl: (1/2) * 0.176 ≈ 0.088

Sentence 3:
- good: (1/3) * 0     = 0
- boy:  (1/3) * 0.176 ≈ 0.059
- girl: (1/3) * 0.176 ≈ 0.059

Final TF-IDF Vectors:
---------------------
S1: [boy=0.088, girl=0,     good=0]
S2: [boy=0,     girl=0.088, good=0]
S3: [boy=0.059, girl=0.059, good=0]

Key Insight:
------------
- Unlike Bag of Words (BoW), TF-IDF gives **lower weight** to common words (like 'good' here, IDF=0).
- TF-IDF improves upon BoW by reducing the impact of common but less meaningful terms.

Next Steps:
-----------
- In the following section, we’ll implement TF-IDF using sklearn.
"""


----------------------------------------------------------------------------------------------------------------------------

# Understanding TF-IDF in NLP

In this notebook, we delve deeper into **TF-IDF (Term Frequency - Inverse Document Frequency)**, one of the most widely used techniques in Natural Language Processing (NLP) for text representation. We’ve already discussed the basic formula and seen introductory examples. Now, we will expand on that foundation by understanding **why TF-IDF is often preferred over Bag of Words (BoW)**, along with its **advantages, disadvantages**, and **impact on model performance**.

---

## 📚 What You Will Learn

- A recap of the TF-IDF formula (Term Frequency and Inverse Document Frequency).
- How TF-IDF differs from and improves upon the Bag of Words (BoW) model.
- Key advantages of using TF-IDF:
  - Intuitive and straightforward to implement.
  - Fixed input size, just like BoW (based on vocabulary size).
  - **Captures word importance**, a major differentiator compared to BoW.
- Real-world interpretation of how TF-IDF values reflect the importance of words in a paragraph or document.
- Understanding how TF-IDF helps improve model performance by emphasizing contextually relevant words.
- Disadvantages of TF-IDF:
  - **Sparsity**: The presence of many zeroes in feature vectors.
  - **Out of Vocabulary (OOV)**: New words in test data not present in training data get ignored.

---

## ❓ Why Do We Need TF-IDF?

While Bag of Words is simple and effective for basic text representation, it treats all words equally—ignoring their actual relevance within and across documents. TF-IDF solves this by:

- Penalizing words that appear too frequently across documents (like "the", "is", "and").
- Highlighting words that are **unique** and **contextually significant** in a specific document.
- Providing a more meaningful numerical representation of textual data for machine learning algorithms.
  
This allows models to **focus on discriminative features**, improving performance in tasks like classification, clustering, and information retrieval.

---

## ✅ Conclusion

TF-IDF is a crucial step in transforming raw text into a usable format for machine learning models. Unlike Bag of Words, it adds a layer of intelligence by considering the **importance of each word based on its frequency and distribution**. Though it shares some limitations like sparsity and handling OOV words, its strengths make it a better option in most NLP tasks.

In the next section, we’ll move on to practical implementation using **Python and NLTK**, and explore how this representation can be applied to real-world datasets.

> 🧠 *Tip: Practice on varied datasets to solidify your understanding. Assignments will be provided to help with hands-on learning.*

---
