Yesterday, we used the "Bag of Words" (CountVectorizer). It simply counted words. But there is a huge flaw in that method.

Sentence A: "The man walked the dog."

Sentence B: "The girl walked the cat."

The word "The" appears twice in both. To a simple counter, "The" looks like the most important word in the universe because it has the highest count. But actually, "The" is useless noise. The important words are "Dog" and "Cat," even though they only appear once.

Today, we fix this using TF-IDF. It stands for Term Frequency - Inverse Document Frequency. It is a math formula that says: "If a word appears everywhere (like 'the'), punish it. If a word appears rarely (like 'aliens'), boost it."

TF-IDF is your brain at that party.

TF (Term Frequency): "How often did this person say the word?" (Local Importance).

IDF (Inverse Document Frequency): "How rare is this word across the whole party?" (Global Importance).

High Count + Rare Word = High Score (e.g., "Python", "Recipe").

High Count + Common Word = Low Score (e.g., "is", "the").

# 1: Setup

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [3]:
#1 fake data
# 0 == tech || 1== food

sentences = ['python code is crashing','I need to debug my server','The spicy pasta recipe is great','Cook the veggies for 20 minutes',
             'The code compilation Failed','Add salt and pepper to the soup']
labels = [0,0,1,1,0,1]


# 2: The TF-IDF Transformation

In [5]:
# Watch how the numbers look different than yesterday. They won't just be integers (1, 2, 3); they will be decimals (0.45, 0.12).

In [7]:
#create a vectorizer
# stop_words='english' automatically removes "the", "is", "and" completely!

tfidf = TfidfVectorizer(stop_words ='english')

#fit and transform

X = tfidf.fit_transform(sentences)

In [9]:
print("Vocabulary :" ,tfidf.get_feature_names_out())

Vocabulary : ['20' 'add' 'code' 'compilation' 'cook' 'crashing' 'debug' 'failed'
 'great' 'minutes' 'need' 'pasta' 'pepper' 'python' 'recipe' 'salt'
 'server' 'soup' 'spicy' 'veggies']


In [11]:
#lets look at the socres

df = pd.DataFrame(X.toarray(), columns = tfidf.get_feature_names_out())

In [14]:
from tabulate import tabulate
print((df))

    20  add      code  compilation  cook  crashing    debug    failed  great  \
0  0.0  0.0  0.501613     0.000000   0.0  0.611713  0.00000  0.000000    0.0   
1  0.0  0.0  0.000000     0.000000   0.0  0.000000  0.57735  0.000000    0.0   
2  0.0  0.0  0.000000     0.000000   0.0  0.000000  0.00000  0.000000    0.5   
3  0.5  0.0  0.000000     0.000000   0.5  0.000000  0.00000  0.000000    0.0   
4  0.0  0.0  0.501613     0.611713   0.0  0.000000  0.00000  0.611713    0.0   
5  0.0  0.5  0.000000     0.000000   0.0  0.000000  0.00000  0.000000    0.0   

   minutes     need  pasta  pepper    python  recipe  salt   server  soup  \
0      0.0  0.00000    0.0     0.0  0.611713     0.0   0.0  0.00000   0.0   
1      0.0  0.57735    0.0     0.0  0.000000     0.0   0.0  0.57735   0.0   
2      0.0  0.00000    0.5     0.0  0.000000     0.5   0.0  0.00000   0.0   
3      0.5  0.00000    0.0     0.0  0.000000     0.0   0.0  0.00000   0.0   
4      0.0  0.00000    0.0     0.0  0.000000     0.0  

# 3: Analyze

Look at the output table.

Find the word "code". In the first sentence, it might have a score of 0.5.

Find the word "the". It might not even exist (because we removed stop words), or it would have a very low score.

# 4 : Train & Predict

In [16]:
#train

model = MultinomialNB()
model.fit(X, labels)

In [22]:
#predict new text

new_text = ['My server is running hot']
new_X = tfidf.transform(new_text)

pred = model.predict(new_X)

In [21]:
print(f"Prediction: {pred[0]} (0=Tech, 1=Food)")

Prediction: 0 (0=Tech, 1=Food)
