# TF-IDF(Term Frequency-Inverse Document Frequency)
This notebook explains the TF-IDF method in Natural Language Processing (NLP), 

covering both mathematical foundations and practical implementation in Python.

## 1. History and Background

TF-IDF is one of the most popular weighting schemes in information retrieval and text mining.

- **1950s-1960s**: The concept originates from information retrieval research
- **1972**: First formal mention by Karen Spärck Jones in her paper "A Statistical Interpretation of Term Specificity and Its Application in Retrieval"
- **1980s-1990s**: Became standard in search engines and document classification
- **Present**: Still widely used as a baseline in text processing and NLP tasks

TF-IDF reflects how important a word is to a document in a collection or corpus.

## 2. Mathematical Foundations

TF-IDF is composed of two components:

### Term Frequency (TF)
Measures how frequently a term occurs in a document:

$$ tf(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

### Inverse Document Frequency (IDF)
Measures how important a term is across the entire corpus:

$$ idf(t,D) = \log\left(\frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t}\right) $$

### TF-IDF
The product of TF and IDF:

$$ tfidf(t,d,D) = tf(t,d) \times idf(t,D) $$

Variations exist for both TF and IDF calculations (logarithmic, augmented, etc.)

## 3. Python Implementation

Let's implement TF-IDF from scratch and compare with scikit-learn's implementation.

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict
import math

# Sample documents
documents = [
    "the sky is blue",
    "the sun is bright",
    "the sun in the sky is bright",
    "we can see the shining sun, the bright sun"
]

In [None]:
def compute_tf(text):
    """Compute Term Frequency for a single document"""
    tf_dict = {}
    words = text.split()
    word_count = len(words)
    
    for word in words:
        tf_dict[word] = tf_dict.get(word, 0) + 1/word_count
    
    return tf_dict

def compute_idf(docs):
    """Compute Inverse Document Frequency for all documents"""
    idf_dict = defaultdict(lambda: 0)
    total_docs = len(docs)
    
    for doc in docs:
        words = set(doc.split())
        for word in words:
            idf_dict[word] += 1
    
    for word in idf_dict:
        idf_dict[word] = math.log(total_docs / idf_dict[word])
    
    return idf_dict

def compute_tfidf(docs):
    """Compute TF-IDF for all documents"""
    tfidf = []
    idf = compute_idf(docs)
    
    for doc in docs:
        tf = compute_tf(doc)
        doc_tfidf = {}
        
        for word in tf:
            doc_tfidf[word] = tf[word] * idf[word]
        
        tfidf.append(doc_tfidf)
    
    return tfidf

In [None]:
# Compute TF-IDF using our implementation
tfidf_results = compute_tfidf(documents)

# Display results
for i, doc in enumerate(tfidf_results):
    print(f"Document {i+1}:")
    for word, score in doc.items():
        print(f"  {word}: {score:.4f}")
    print()

### Using scikit-learn's TfidfVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_sklearn = vectorizer.fit_transform(documents)

# Convert to DataFrame for better visualization
df_tfidf = pd.DataFrame(
    tfidf_sklearn.toarray(),
    columns=vectorizer.get_feature_names_out()
)

print("TF-IDF Matrix from scikit-learn:")
df_tfidf

## 4. Applications and References

### Applications:
- Search engine ranking
- Document classification
- Keyword extraction
- Text summarization
- Information retrieval systems

### References:
1. Spärck Jones, K. (1972). "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation.
2. Manning, C. D., Raghavan, P., & Schütze, H. (2008). "Introduction to Information Retrieval". Cambridge University Press.
3. Jurafsky, D., & Martin, J. H. (2020). "Speech and Language Processing". Pearson.
4. Scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

### Limitations:
- Doesn't capture semantic meaning
- Treats words as independent features
- Can be computationally expensive for large vocabularies

Modern alternatives include word embeddings (Word2Vec, GloVe) and transformer-based models (BERT).