## TF - IDF

TF-IDF is a way to convert text into numbers (vectors) that reflect not just how often a word appears (TF), but also how important it is — measured by how rare it is across documents (IDF).

## 1. Term Frequency (TF)

Measures how often a term appears in a document.

$$
\text{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total terms in } d}
$$

---

## 2. Inverse Document Frequency (IDF)

Measures how rare a word is across all documents.

$$
\text{IDF}(t) = \log\left(\frac{N}{1 + \text{DF}(t)}\right)
$$

Where:

- **N** = total number of documents  
- **DF(t)** = number of documents where term *t* appears  
- **+1** is added to avoid division by zero

---

## 3. TF-IDF Score

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

This value increases when a term appears frequently in a document but not in many other documents.


## Why Use Log in IDF?

### 1. Problem with raw inverse frequency

If we define IDF as:

$$
\text{IDF}(t) = \frac{N}{\text{DF}(t)}
$$

Then rare words get **huge weights**, and common words get **very tiny weights**.  
The difference becomes **too extreme** to model effectively.

---

### 2. Log smooths the scale

By applying logarithm:

$$
\text{IDF}(t) = \log\left(\frac{N}{1 + \text{DF}(t)}\right)
$$

- **Large values are compressed**  
- **Small values are preserved better**  
- It turns **multiplication into addition** in vector space  
  (helpful for dot product, cosine similarity, linear models, etc.)

---

### 3. Analogy

- Word appears in 1 doc → raw IDF = 100  
- Word appears in 100 docs → raw IDF = 1  
  → That's a **100× gap**

Using log:

- Word appears in 1 doc → IDF ≈ `log(100)` ≈ **4.6**  
- Word appears in 100 docs → IDF ≈ `log(1)` = **0**

 Much **smoother**, **more numerically stable**, and **better for modeling**


### Simple TF-IDF

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [3]:
# Sample corpus
documents = [
    "I love NLP",
    "NLP is fun",
    "I love machine learning"
]

In [4]:
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")  # includes single-letter words like "I"


In [5]:
vectorizer

In [6]:
# Fit and transform the documents
X = vectorizer.fit_transform(documents)

In [8]:
# Get feature (word) names
features = vectorizer.get_feature_names_out()

In [9]:
features

array(['fun', 'i', 'is', 'learning', 'love', 'machine', 'nlp'],
      dtype=object)

In [10]:
# Convert to a readable pandas DataFrame
tfidf_df = pd.DataFrame(X.toarray(), columns=features)

In [11]:
# Show the DataFrame
print(tfidf_df)


        fun         i        is  learning      love   machine      nlp
0  0.000000  0.577350  0.000000  0.000000  0.577350  0.000000  0.57735
1  0.622766  0.000000  0.622766  0.000000  0.000000  0.000000  0.47363
2  0.000000  0.428046  0.000000  0.562829  0.428046  0.562829  0.00000


### TF-IDF on 20 Newsgroups Dataset 

In [12]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [13]:
# 1. Load dataset (subset for clarity and speed)
categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics']
data = fetch_20newsgroups(subset='train', categories=categories)

**load a subset of the 20 Newsgroups dataset, limited to 3 categories:**

- sci.space

- rec.sport.hockey

- comp.graphics

data.data: list of news articles (strings)

data.target: numeric category labels (0, 1, 2)

In [18]:
# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)


In [20]:
X_train[0]

'From: d88-jwa@hemul.nada.kth.se (Jon Wtte)\nSubject: Re: Please Recommend 3D Graphics Library For Mac.\nOrganization: Royal Institute of Technology, Stockholm, Sweden\nLines: 21\nNntp-Posting-Host: hemul.nada.kth.se\n\nIn <Z2442B4w164w@cellar.org> tsa@cellar.org (The Silent Assassin) writes:\n\n>> I\'m building a CAD package and need a 3D graphics library that can handle\n>> some rudimentry tasks, such as hidden line removal, shading, animation, etc.\n>> \n>> Can you please offer some recommendations?\n\nI think APDA has something called MacWireFrame which is a full\nwire-frame (and supposedly hidden-line removal) library.\nI think it weighs in at $99 (but I\'ve been wrong on an order\nof magnitude before)\n\n>Libertarian, atheist, semi-anarchal Techno-Rat.\n\nI can relate to that\n\n\t\t\t\t\t/h+\n-- \n -- Jon W{tte, h+@nada.kth.se, Mac Hacker Deluxe --\n\n  "On a clear disc, you can seek forever."\n'

In [22]:
y_train[0]

0

In [23]:
# 3. Initialize and fit the TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)  # fit on train data
X_test_tfidf = vectorizer.transform(X_test)        # transform test data

**Creates a numerical matrix where:**

- Rows = documents

- Columns = vocabulary (words)

- Values = TF-IDF scores

**.fit_transform():**

Learns vocabulary + computes TF-IDF for training set

**.transform():**

Uses same vocab on test set to ensure consistent feature space

**The number of columns = total number of unique words (tokens) across the entire training set vocabulary.**

In [25]:
# 4. Initialize and train the classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)

In [26]:
# 5. Predict on test set
y_pred = clf.predict(X_test_tfidf)

In [27]:
# 6. Evaluate
print(classification_report(y_test, y_pred, target_names=data.target_names))

                  precision    recall  f1-score   support

   comp.graphics       0.93      0.98      0.95       128
rec.sport.hockey       0.99      0.96      0.97       114
       sci.space       0.98      0.95      0.96       114

        accuracy                           0.96       356
       macro avg       0.97      0.96      0.96       356
    weighted avg       0.96      0.96      0.96       356



| Component | What it captures                      | Why it matters                         |
| --------- | ------------------------------------- | -------------------------------------- |
| **TF**    | How important a word is in a document | Captures the document’s focus          |
| **IDF**   | How rare the word is in the corpus    | Filters out common/uninformative words |
