# 5.3 TF-IDF

1️⃣ First, Remember Bag of Words (BoW)
From before, our two sentences are:

"I love cricket"

"I love playing cricket"

Bag of Words counted how many times each word appeared:

Word	Sentence 1	Sentence 2
I	1	1
love	1	1
cricket	1	1
playing	0	1

2️⃣ Problem with just counting 🤔
Some words, like “I” or “love”, appear in almost every sentence. They don’t tell us much about what the sentence is really about.
But words like “cricket” or “playing” are more special.
We want the computer to give more importance to special words and less importance to common words.

This is where TF-IDF comes in. ✅

3️⃣ TF-IDF in Simple Words
TF = Term Frequency → How many times the word appears in the sentence (just like BoW).

IDF = Inverse Document Frequency → If a word appears in many sentences, it’s less important. If it appears in few sentences, it’s more important.

So:

ini
Copy
Edit
TF-IDF = TF × IDF
4️⃣ Let’s Calculate TF-IDF for our example
We have 2 sentences and 4 words (I, love, cricket, playing).

Step 1: Term Frequency (TF)
This is just count of word ÷ total words in sentence.

Sentence 1 ("I love cricket"):

I → 1/3 = 0.33

love → 1/3 = 0.33

cricket → 1/3 = 0.33

playing → 0/3 = 0

Sentence 2 ("I love playing cricket"):

I → 1/4 = 0.25

love → 1/4 = 0.25

cricket → 1/4 = 0.25

playing → 1/4 = 0.25

Step 2: Inverse Document Frequency (IDF)
Formula:

mathematica
Copy
Edit
IDF = log(Total Sentences ÷ Sentences Containing the Word)
I → log(2/2)=0 (word appears everywhere → not special)

love → log(2/2)=0 (common word → not special)

cricket → log(2/2)=0 (appears in both)

playing → log(2/1)=log(2)=0.693 (appears only in 1 sentence → special)

Step 3: TF × IDF
Sentence 1:

I = 0.33 × 0 = 0

love = 0.33 × 0 = 0

cricket = 0.33 × 0 = 0

playing = 0 × 0.693 = 0

Sentence 2:

I = 0.25 × 0 = 0

love = 0.25 × 0 = 0

cricket = 0.25 × 0 = 0

playing = 0.25 × 0.693 = 0.173

✅ Meaning:
TF-IDF gives more importance to "playing" because it’s unique to Sentence 2.

Common words like "I", "love", "cricket" get less weight because they appear everywhere.

So computers can understand which words are important for identifying the meaning of a sentence.

💡 Think of TF-IDF like a cricket commentary:

If every player is hitting "singles", it's common.

But if one player hits a "six" (special word), the crowd gives more attention! 🎉

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [4]:
tfidfvec = TfidfVectorizer()

In [5]:
tfidfvec_fit = tfidfvec.fit_transform(data)

In [6]:
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns = tfidfvec.get_feature_names_out())

In [7]:
print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 

In [3]:

sentences = ["I love cricket", "I love playing cricket"]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences)

print("🔹 Words:", vectorizer.get_feature_names_out())
print("🔹 TF-IDF Scores:\n", tfidf_matrix.toarray())

🔹 Words: ['cricket' 'love' 'playing']
🔹 TF-IDF Scores:
 [[0.70710678 0.70710678 0.        ]
 [0.50154891 0.50154891 0.70490949]]
