# 5.3 TF-IDF (Term Frequency-Inverse Document Frequency)
This notebook demonstrates how to use the TF-IDF technique to convert text data into numerical features. TF-IDF is widely used in text mining and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus).

In [None]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Sample data
# This dataset contains sentences with diverse topics to demonstrate the TF-IDF transformation.
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [None]:
# Initialize the TF-IDF Vectorizer
# This object will be used to transform the text data into numerical features.
tfidfvec = TfidfVectorizer()

In [None]:
# Fit the TF-IDF Vectorizer to the data
# The `fit_transform` method learns the vocabulary and computes the TF-IDF matrix.
tfidfvec_fit = tfidfvec.fit_transform(data)

In [None]:
# Convert the TF-IDF matrix to a DataFrame
# This step makes it easier to visualize the numerical representation of the text data.
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns = tfidfvec.get_feature_names_out())

In [None]:
# Display the TF-IDF DataFrame
# Each row corresponds to a document, and each column corresponds to a term's TF-IDF score.
print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 