### TF-IDF Vectorizer

ðŸ§  What is TF-IDF?
TF-IDF stands for:

TF: Term Frequency â€“ How often a word appears in a document.

IDF: Inverse Document Frequency â€“ How rare a word is across all documents.

The idea is:

Common words (like "the", "is", "a") are less important.

Rare but meaningful words (like "diabetes", "engineer", "offer") are more important.

In [2]:
import pandas as pd
series = pd.read_pickle('../data/text_clean.pkl')
series

0                    life lemon lemonade
1                     lemon maven market
2            dozen lemon gallon lemonade
3    lemon lemon lemon lemon lemon lemon
4              s market lemon sale today
5        maven market eureka lemon lemon
6           palmer lemonade half ice tea
7                       ice tea favorite
Name: sentence, dtype: object

In [5]:
# basic tfidf vectorizer code
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer()
tfidf = tv.fit_transform(series)
tfidf_df = pd.DataFrame(tfidf.toarray(), columns=tv.get_feature_names_out())
tfidf_df

Unnamed: 0,dozen,eureka,favorite,gallon,half,ice,lemon,lemonade,life,market,maven,palmer,sale,tea,today
0,0.0,0.0,0.0,0.0,0.0,0.0,0.375318,0.543168,0.75107,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.411442,0.0,0.0,0.595449,0.690041,0.0,0.0,0.0,0.0
2,0.600547,0.0,0.0,0.600547,0.0,0.0,0.3001,0.434311,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.3001,0.0,0.0,0.434311,0.0,0.0,0.600547,0.0,0.600547
5,0.0,0.556913,0.0,0.0,0.0,0.0,0.556591,0.0,0.0,0.402755,0.466736,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.504577,0.422875,0.0,0.364907,0.0,0.0,0.0,0.504577,0.0,0.422875,0.0
7,0.0,0.0,0.644859,0.0,0.0,0.540443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.540443,0.0


In [7]:
# tfidf vectorizer with some parameter tweaks
tv2 = TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=.2, max_df=.8)
tfidf2 = tv2.fit_transform(series)
tfidf_df2 = pd.DataFrame(tfidf2.toarray(), columns=tv2.get_feature_names_out())
tfidf_df2

Unnamed: 0,ice,ice tea,lemon,lemon lemon,lemonade,market,maven,maven market,tea
0,0.0,0.0,0.568471,0.0,0.822704,0.0,0.0,0.0,0.0
1,0.0,0.0,0.338644,0.0,0.0,0.490093,0.567948,0.567948,0.0
2,0.0,0.0,0.568471,0.0,0.822704,0.0,0.0,0.0,0.0
3,0.0,0.0,0.581897,0.813262,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.568471,0.0,0.0,0.822704,0.0,0.0,0.0
5,0.0,0.0,0.524634,0.439939,0.0,0.379631,0.439939,0.439939,0.0
6,0.516768,0.516768,0.0,0.0,0.445928,0.0,0.0,0.0,0.516768
7,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735


In [8]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [10]:
v = TfidfVectorizer()
v.fit(corpus)
transform_output = v.transform(corpus)

In [11]:
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [12]:
i = v.vocabulary_.get('thor')
v.idf_[i]

np.float64(2.386294361119891)

In [13]:
#Print the idf of each word

all_feature_names = v.get_feature_names_out()

for word in all_feature_names:

  indx = v.vocabulary_.get(word)

  #get the score
  idf_score = v.idf_[indx]

  print(f"{word}: {idf_score}")

already: 2.386294361119891
am: 2.386294361119891
amazon: 2.386294361119891
and: 2.386294361119891
announcing: 1.2876820724517808
apple: 2.386294361119891
are: 2.386294361119891
ate: 2.386294361119891
biryani: 2.386294361119891
dot: 2.386294361119891
eating: 1.9808292530117262
eco: 2.386294361119891
google: 2.386294361119891
grapes: 2.386294361119891
iphone: 2.386294361119891
ironman: 2.386294361119891
is: 1.1335313926245225
loki: 2.386294361119891
microsoft: 2.386294361119891
model: 2.386294361119891
new: 1.2876820724517808
pixel: 2.386294361119891
pizza: 2.386294361119891
surface: 2.386294361119891
tesla: 2.386294361119891
thor: 2.386294361119891
tomorrow: 1.2876820724517808
you: 2.386294361119891


In [14]:
# Print the transformed output from tf-idf
print(transform_output.toarray())

[[0.24266547 0.         0.         0.         0.         0.
  0.         0.24266547 0.         0.         0.40286636 0.
  0.         0.         0.         0.24266547 0.11527033 0.24266547
  0.         0.         0.         0.         0.72799642 0.
  0.         0.24266547 0.         0.        ]
 [0.         0.         0.         0.         0.30652086 0.5680354
  0.         0.         0.         0.         0.         0.
  0.         0.         0.5680354  0.         0.26982671 0.
  0.         0.         0.30652086 0.         0.         0.
  0.         0.         0.30652086 0.        ]
 [0.         0.         0.         0.         0.30652086 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.26982671 0.
  0.         0.5680354  0.30652086 0.         0.         0.
  0.5680354  0.         0.30652086 0.        ]
 [0.         0.         0.         0.         0.30652086 0.
  0.         0.         0.         0.         0.         0.
  0.

#### **Custom Use case**

- E-commerce data
- 4 labels: Household, Electronics, Clothing & Books
- Task is to create a classification model that can predict a given description of a product and classify them as one of the labels using TfIdf vectorization technique

In [17]:
df = pd.read_csv('../data/Ecommerce_data.csv')

In [18]:
df.head(5)

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [19]:
df.label.value_counts()

label
Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: count, dtype: int64

In [20]:
df.shape

(24000, 2)

In [21]:
df['label_num'] = df['label'].map({
    'Household': 0,
    'Electronics': 1,
    'Clothing & Accessories': 2,
    'Books': 3
})

In [22]:
df.head(5)

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,1
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,2
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,2


#### **Train Test Split**

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Text, df.label_num, test_size=0.2)

In [24]:
len(X_train)

19200

In [25]:
len(X_test)

4800

#### **Tfidf Vectorizer**

In [26]:
tf = TfidfVectorizer()

X_train_tf = tf.fit_transform(X_train)
X_test_tf = tf.transform(X_test)

#### **Classification Model**

In [27]:
clf = DecisionTreeClassifier()
clf.fit(X_train_tf,y_train)

y_pred = clf.predict(X_test_tf)

In [28]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.92      0.92      1231
           1       0.95      0.94      0.95      1159
           2       0.96      0.95      0.96      1175
           3       0.95      0.96      0.96      1235

    accuracy                           0.94      4800
   macro avg       0.94      0.94      0.94      4800
weighted avg       0.94      0.94      0.94      4800



#### **Testing on a new data**

In [29]:
msg = ["Satyajit's designer women art saree silk blouse piece, saree with pipili chandua work"]
msg_tf = tf.transform(msg)

clf.predict(msg_tf)

array([2])