### Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF is a statistical method used to measure the importance of a word in a document. It considers the term frequency (TF) while balancing it with inverse document frequency (IDF) to reduce the impact of commonly occurring words (such as stop words)

#### 1. Term Frequency (TF)

$$
TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of unique words in document } d}
$$

#### 2. Inverse Document Frequency (IDF)

$$
IDF(t) = \log \left(\frac{N}{1 + DF(t)}\right)
$$

Where:

- \( N \) = Total number of documents in the corpus.
- \( DF\_(t) \) = Number of documents that contain the term \( t \).
- \( 1 + DF(t) \) prevents division by zero.

#### 3. TF-IDF Calculation

$$
TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)
$$


---


In [11]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["Cat is a very cute animal", "Cats and dogs are very cute animals"]

tfidf_vectorizer = TfidfVectorizer()

X = tfidf_vectorizer.fit_transform(documents)
feature_names = tfidf_vectorizer.get_feature_names_out()

df_tfidf_example = pd.DataFrame(X.toarray(), columns=feature_names)
print(df_tfidf_example)

# Mean (cute)
cute_tfidf = df_tfidf_example["cute"]
cute_mean_tfidf = np.mean(cute_tfidf)
print(cute_mean_tfidf)

        and    animal   animals       are       cat      cats     cute  \
0  0.000000  0.499221  0.000000  0.000000  0.499221  0.000000  0.35520   
1  0.407824  0.000000  0.407824  0.407824  0.000000  0.407824  0.29017   

       dogs        is     very  
0  0.000000  0.499221  0.35520  
1  0.407824  0.000000  0.29017  
0.3226851472299316


---


#### Real-Life Application of TF-IDF Using the Spam Dataset

- The dataset link &rarr; [Spam_Dataset.csv](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)


In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Getting the data
df = pd.read_csv("../data/Spam_Dataset.csv")

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df["v2"])
feature_names = vectorizer.get_feature_names_out()

tfidf_score = X.mean(axis=0).A1  # Mean of tf-idf values
df_tfidf = pd.DataFrame({"word": feature_names, "tfidf_score": tfidf_score})
df_tfidf_score = df_tfidf.sort_values(by="tfidf_score", ascending=False)
print(df_tfidf_score)

            word  tfidf_score
8583         you     0.044174
7733          to     0.037040
7604         the     0.026417
4074          in     0.021997
4920          me     0.021242
...          ...          ...
6095      proove     0.000013
5982     praises     0.000013
4832     makiing     0.000013
1280  attraction     0.000013
7054     sorrows     0.000013

[8625 rows x 2 columns]
