## TF-IDF 

TF-IDF is a way to find important words in a document.

## TF-IDF = Two Simple Ideas

### 1. Term Frequency (TF)

**Term Frequency (TF)** measures how many times a word appears in **a document**.

#### Example

"I love AI and I love NLP"

- **love** → appears **2 times** → high TF  
- **AI** → appears **1 time** → lower TF  

### 2. Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how rare or important a word is across all documents.

- Common words like **is**, **the**, **and**  
  → appear in almost every document  
  → **low IDF (not important)**

- Specific words like **NLP**, **AI**  
  → appear in fewer documents  
  → **high IDF (important)**

In [22]:
documents = [
    "I love NLP",
    "I love AI",
    "AI and NLP are useful"
]

# Term Frequency (TF)

In [23]:
from collections import Counter

doc = documents[0]  # "I love NLP"

words = doc.lower().split()
tf = Counter(words)

print("Document:", doc)
print("Term Frequency (TF):")
print(tf)

Document: I love NLP
Term Frequency (TF):
Counter({'i': 1, 'love': 1, 'nlp': 1})


In [24]:
for i, doc in enumerate(documents):
    words = doc.lower().split()
    tf = Counter(words)
    print(f"\nDocument {i+1} TF:")
    print(tf)



Document 1 TF:
Counter({'i': 1, 'love': 1, 'nlp': 1})

Document 2 TF:
Counter({'i': 1, 'love': 1, 'ai': 1})

Document 3 TF:
Counter({'ai': 1, 'and': 1, 'nlp': 1, 'are': 1, 'useful': 1})


# Inverse Document Frequency

In [25]:
import math

total_documents = len(documents)
print("Total documents:", total_documents)

Total documents: 3


In [26]:
document_frequency = {}

for doc in documents:
    words = set(doc.lower().split())
    for word in words:
        document_frequency[word] = document_frequency.get(word, 0) + 1

print("Document Frequency:")
for word, count in document_frequency.items():
    print(word, "→", count)


Document Frequency:
nlp → 2
i → 2
love → 2
ai → 2
useful → 1
are → 1
and → 1


In [27]:
document_frequency = {}

for doc in documents:
    words = set(doc.lower().split())  # unique words per document
    for word in words:
        document_frequency[word] = document_frequency.get(word, 0) + 1

print("Document Frequency:")
for word, count in document_frequency.items():
    print(word, "→", count)

Document Frequency:
nlp → 2
i → 2
love → 2
ai → 2
useful → 1
are → 1
and → 1


In [28]:
idf = {}

for word, doc_count in document_frequency.items():
    idf[word] = math.log(total_documents / doc_count)

print("\nIDF Scores:")
for word, value in idf.items():
    print(f"{word}: {value:.3f}")



IDF Scores:
nlp: 0.405
i: 0.405
love: 0.405
ai: 0.405
useful: 1.099
are: 1.099
and: 1.099


### Key Takeaway 

- **IDF measures rarity**, not frequency  
- **Rare words get higher importance**  
- **IDF alone does NOT consider** how often a word appears in a document  
- **TF + IDF together form TF-IDF**

# Combine TF and IDF → TF-IDF

In [29]:
tfidf = {}

for word, tf_value in tf.items():
    tfidf[word] = tf_value * idf[word]

print("TF-IDF:")
for word, value in tfidf.items():
    print(f"{word}: {value:.3f}")


TF-IDF:
ai: 0.405
and: 1.099
nlp: 0.405
are: 1.099
useful: 1.099


### Final Key Takeaway

- **TF** → frequency  
- **IDF** → rarity  
- **TF-IDF** → importance  

**TF-IDF is good for:**
- Keyword-based search  
- Simple text matching  

**TF-IDF is bad for:**
- Meaning-based (semantic) search  

Modern **LLMs use embeddings instead of TF-IDF** to understand meaning and context.