#💡TF-IDF:
TF-IDF = Term Frequency - Inverse Document Frequency

Term Frequency (TF):
How often a word appears in a document.

Inverse Document Frequency (IDF):
How rare a word is across all documents (gives more importance to unique words).

🔍 TF-IDF gives higher weight to words that appear frequently in a specific document, but not across all documents.

🧪 TF-IDF Formula Breakdown (with smooth_idf=False):
💥 TF-IDF(w, d) = TF(w, d) × IDF(w)
Where:

TF(w, d) = Term frequency of word w in document d

IDF(w) = log(N / df(w))

N = total number of documents

df(w) = number of documents containing word w



##📌keypoints:

If a word appears in every document, its IDF = 0 ➝ TF-IDF = 0 ✅

If a word appears in only one document, its IDF is high ➝ TF-IDF = high ✅

Stop words have low or 0 TF-IDF score ➝ they’re often removed ❌

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents(corpus)
documents = [
    "I love Python and machine learning",
    "Python is great and powerful",
    "I enjoy learning about AI and Python",
    "Machine learning and AI are the future"
]

# Create TfidfVectorizer with specific parameters
vectorizer = TfidfVectorizer(
    stop_words='english',        # Removes common stop words like 'I', 'and', 'the'
    use_idf=True,                # Use Inverse Document Frequency in calculation
    smooth_idf=False,            # Do NOT smooth IDF, i.e., no +1 in denominator
    ngram_range=(1, 1)           # Only single words (unigrams), no bigrams/trigrams
)

# Learn vocabulary and compute TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents) #Tokenizes each document
                                                   #Builds a vocabulary
                                                   #Computes TF-IDF scores for each word in each document


# Convert to DataFrame for better readability
feature_names = vectorizer.get_feature_names_out()
df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Print the final TF-IDF matrix
print("TF-IDF Matrix (use_idf=True, smooth_idf=False, ngram_range=(1,1)):\n")
print(df)


TF-IDF Matrix (use_idf=True, smooth_idf=False, ngram_range=(1,1)):

         ai    enjoy   future     great  learning     love   machine  \
0  0.000000  0.00000  0.00000  0.000000  0.373635  0.69241  0.491286   
1  0.000000  0.00000  0.00000  0.660648  0.000000  0.00000  0.000000   
2  0.491286  0.69241  0.00000  0.000000  0.373635  0.00000  0.000000   
3  0.468049  0.00000  0.65966  0.000000  0.355963  0.00000  0.468049   

   powerful    python  
0  0.000000  0.373635  
1  0.660648  0.356496  
2  0.000000  0.373635  
3  0.000000  0.000000  


##📌Use Cases
Document Classification

Search Engines (ranking pages)

Keyword Extraction

Spam Detection

Sentiment Analysis

