# TF-IDF (Term Frequency - Inverse Document Frequency)

## Formula

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

### Term Frequency (TF)

$$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}$$

### Inverse Document Frequency (IDF)

$$\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)$$

## What It Does

TF-IDF measures **how important a word is to a document** in a collection of documents:

- **High TF-IDF**: Word is frequent in this document but rare in others → Important/Distinctive
- **Low TF-IDF**: Word is either rare in this document or common across all documents → Less important

Common words like "the", "is", "and" get low scores because they appear everywhere.
Specific/unique words get high scores because they're distinctive.

## Important Parameters

```python
TfidfVectorizer(
    max_features=None,      # Limit vocabulary size
    stop_words='english',   # Remove common words
    ngram_range=(1, 1),     # (min_n, max_n) for n-grams
    min_df=1,               # Minimum document frequency
    max_df=1.0,             # Maximum document frequency
    lowercase=True,         # Convert to lowercase
    norm='l2',              # Normalization (l2, l1, or None)
    use_idf=True,           # Enable IDF weighting
    smooth_idf=True,        # Add 1 to avoid zero division
    sublinear_tf=False      # Use log scaling for TF
)
```

### Parameter Examples

```python
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "I love machine learning and deep learning",
    "Machine learning is amazing",
    "Deep learning uses neural networks"
]

# Bigrams (2-word phrases)
vectorizer_bigram = TfidfVectorizer(ngram_range=(1, 2))
tfidf = vectorizer_bigram.fit_transform(documents)
print("With bigrams:", vectorizer_bigram.get_feature_names_out())

# Limit vocabulary to top 5 features
vectorizer_limited = TfidfVectorizer(max_features=5)
tfidf = vectorizer_limited.fit_transform(documents)
print("\nTop 5 features:", vectorizer_limited.get_feature_names_out())

# Ignore words that appear in > 50% of documents
vectorizer_max_df = TfidfVectorizer(max_df=0.5)
tfidf = vectorizer_max_df.fit_transform(documents)
print("\nMax df=0.5:", vectorizer_max_df.get_feature_names_out())
```

**Each document vector is normalized to length 1 (L2 norm = 1)**

## When to Use TF-IDF

**Use when:**
- Text classification and clustering
- Information retrieval and search
- Document similarity comparison
- Keyword extraction
- Feature engineering for NLP tasks
- Working with traditional ML models

**Avoid when:**
- Need semantic understanding (use word embeddings instead)
- Word order matters (TF-IDF is bag-of-words)
- Working with very short texts
- Need context-aware representations (use transformers like BERT)

## TF-IDF vs Other Methods

| Method | Captures | Order | Semantics | Use Case |
|--------|----------|-------|-----------|----------|
| TF-IDF | Word importance | No | No | Traditional ML, search |
| Word2Vec | Word context | No | Yes | Semantic similarity |
| BERT | Contextual meaning | Yes | Yes | Modern NLP tasks |
| Count Vectorizer | Word frequency | No | No | Simple baseline |

## Key sklearn Methods

```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

# Fit and transform
tfidf = vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
features = vectorizer.get_feature_names_out()

# Get IDF values
idf_values = vectorizer.idf_

# Transform new documents
new_tfidf = vectorizer.transform(new_documents)

# Get vocabulary dictionary
vocab = vectorizer.vocabulary_
```

## Complete Example: Text Classification Pipeline

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset
texts = [
    "Great product, highly recommend",
    "Terrible quality, waste of money",
    "Amazing service and fast delivery",
    "Poor customer support, disappointed",
    "Excellent value for money",
    "Do not buy, completely broken"
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.33, random_state=42
)

# Vectorize with TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predict
y_pred = model.predict(X_test_tfidf)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Test on new text
new_text = ["This product is fantastic"]
new_tfidf = vectorizer.transform(new_text)
prediction = model.predict(new_tfidf)
print(f"\nNew text prediction: {prediction[0]} (1=positive, 0=negative)")
```

# Manual TF-IDF Calculation

In [8]:
import numpy as np
from collections import Counter
import math

def calculate_tfidf_manual(documents):
    """
    Manual TF-IDF calculation for understanding
    """
    # Step 1: Calculate TF for each document
    tf_scores = []
    for doc in documents:
        words = doc.lower().split()
        word_count = Counter(words)
        total_words = len(words)
        tf = {word: count/total_words for word, count in word_count.items()}
        tf_scores.append(tf)
    
    # Step 2: Calculate IDF
    all_words = set(word for doc in documents for word in doc.lower().split())
    num_docs = len(documents)
    idf_scores = {}
    
    for word in all_words:
        docs_with_word = sum(1 for doc in documents if word in doc.lower())
        idf_scores[word] = math.log(num_docs / docs_with_word)
    
    # Step 3: Calculate TF-IDF
    tfidf_scores = []
    for tf in tf_scores:
        tfidf = {word: tf_val * idf_scores[word] for word, tf_val in tf.items()}
        tfidf_scores.append(tfidf)
    
    return tfidf_scores, idf_scores

# Test
docs = ["cat dog", "dog dog", "cat"]
tfidf, idf = calculate_tfidf_manual(docs)

print("IDF Scores:", idf)
print("\nTF-IDF Scores:")
for i, doc_tfidf in enumerate(tfidf):
    print(f"Document {i}: {doc_tfidf}")

IDF Scores: {'dog': 0.4054651081081644, 'cat': 0.4054651081081644}

TF-IDF Scores:
Document 0: {'cat': 0.2027325540540822, 'dog': 0.2027325540540822}
Document 1: {'dog': 0.4054651081081644}
Document 2: {'cat': 0.4054651081081644}


# Quick Implementation with scikit-learn

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I love machine learning",
    "Machine learning is awesome",
    "I enjoy deep learning"
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform documents
tfidf_matrix = vectorizer.fit_transform(documents)

# View feature names (words)
print("Features:", vectorizer.get_feature_names_out())
print()
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

Features: ['awesome' 'deep' 'enjoy' 'is' 'learning' 'love' 'machine']

TF-IDF Matrix:
[[0.         0.         0.         0.         0.42544054 0.72033345
  0.54783215]
 [0.5844829  0.         0.         0.5844829  0.34520502 0.
  0.44451431]
 [0.         0.65249088 0.65249088 0.         0.38537163 0.
  0.        ]]



## Understanding the Example

**Document 1**: "I love machine learning"
- "learning" appears in all docs → lower TF-IDF (0.4472)
- "love" and "machine" are more distinctive → higher TF-IDF (0.6316)

**Document 2**: "Machine learning is awesome"
- "awesome" only appears here → high TF-IDF (0.6316)
- "learning" appears everywhere → lower TF-IDF (0.4472)

**Document 3**: "I enjoy deep learning"
- "deep" and "enjoy" only appear here → high TF-IDF (0.6316)
- "learning" appears everywhere → lower TF-IDF (0.4472)


# Basic Text Vectorization

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Documents
docs = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are enemies"
]

# Create and fit vectorizer
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)

# View results
import pandas as pd
df = pd.DataFrame(
    tfidf.toarray(),
    columns=vectorizer.get_feature_names_out()
)
print(df)

        and       are       cat      cats       dog      dogs   enemies  \
0  0.000000  0.000000  0.427554  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.427554  0.000000  0.000000   
2  0.447214  0.447214  0.000000  0.447214  0.000000  0.447214  0.447214   

        log       mat        on       sat       the  
0  0.000000  0.427554  0.325166  0.325166  0.650331  
1  0.427554  0.000000  0.325166  0.325166  0.650331  
2  0.000000  0.000000  0.000000  0.000000  0.000000  


# Text Classification

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Sample data
texts = [
    "I love this movie, it's amazing",
    "Terrible film, waste of time",
    "Great acting and story",
    "Boring and poorly made"
]
labels = [1, 0, 1, 0]  # 1=positive, 0=negative

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# Vectorize text
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Predict
predictions = classifier.predict(X_test_tfidf)
print("Predictions:", predictions)

Predictions: [1]


# Find Similar Documents

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Documents
documents = [
    "Python is a programming language",
    "Java is also a programming language",
    "Machine learning uses Python",
    "I love eating pizza"
]

# Vectorize
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Calculate similarity
similarity = cosine_similarity(tfidf_matrix)

# Find most similar documents to document 0
doc_idx = 0
similarities = similarity[doc_idx]
similar_docs = sorted(
    enumerate(similarities), 
    key=lambda x: x[1], 
    reverse=True
)[1:]  # Exclude itself

print(f"Documents similar to '{documents[doc_idx]}':")
for idx, score in similar_docs:
    print(f"  Doc {idx}: {score:.4f} - '{documents[idx]}'")

Documents similar to 'Python is a programming language':
  Doc 1: 0.6016 - 'Java is also a programming language'
  Doc 2: 0.2071 - 'Machine learning uses Python'
  Doc 3: 0.0000 - 'I love eating pizza'


# Keyword Extraction

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# A single long document
document = [
    "Machine learning is a field of artificial intelligence. "
    "Machine learning algorithms build models based on sample data. "
    "Deep learning is part of machine learning methods."
]

# Vectorize
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(document)

# Get feature names and scores
feature_names = vectorizer.get_feature_names_out()
scores = tfidf.toarray()[0]

# Get top keywords
top_n = 5
top_indices = np.argsort(scores)[::-1][:top_n]

print("Top Keywords:")
for idx in top_indices:
    print(f"  {feature_names[idx]}: {scores[idx]:.4f}")

Top Keywords:
  learning: 0.5898
  machine: 0.4423
  is: 0.2949
  of: 0.2949
  field: 0.1474


# Remove Stop Words

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "This is a sample document",
    "This document is another sample"
]

# Without stop words removal
vectorizer_default = TfidfVectorizer()
tfidf_default = vectorizer_default.fit_transform(documents)
print("Without stop words:")
print(vectorizer_default.get_feature_names_out())

# With stop words removal
vectorizer_no_stop = TfidfVectorizer(stop_words='english')
tfidf_no_stop = vectorizer_no_stop.fit_transform(documents)
print("\nWith stop words removed:")
print(vectorizer_no_stop.get_feature_names_out())

Without stop words:
['another' 'document' 'is' 'sample' 'this']

With stop words removed:
['document' 'sample']


# Verifying TF-IDF Results

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

documents = [
    "cat dog cat",
    "dog dog dog",
    "cat cat dog"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print("Features:", vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Check: Each document vector should have L2 norm = 1 (by default)
for i, doc_vector in enumerate(tfidf_matrix.toarray()):
    norm = np.linalg.norm(doc_vector)
    print(f"\nDocument {i} L2 norm: {norm:.4f}")

Features: ['cat' 'dog']

TF-IDF Matrix:
[[0.93219169 0.361965  ]
 [0.         1.        ]
 [0.93219169 0.361965  ]]

Document 0 L2 norm: 1.0000

Document 1 L2 norm: 1.0000

Document 2 L2 norm: 1.0000
