***Text Vectorization with scikit-learn***

This script demonstrates how to convert raw text into numerical features using two common vectorization techniques: Bag of Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF), both implemented with scikit-learn.

A small collection of example sentences is used to create word-document matrices, which can be later applied in machine learning models for tasks like classification or clustering.

Main steps:

Prepare a list of example text documents

Convert the text into a numerical format using CountVectorizer (BoW)

Display the resulting Bag of Words matrix as a pandas DataFrame

Convert the text using TfidfVectorizer (TF-IDF)

Display the TF-IDF matrix as a pandas DataFrame

This example is useful for understanding the basics of text preprocessing and feature extraction for Natural Language Processing (NLP) tasks.

In [29]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
# Sample text data (short example sentences)

data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [31]:
# Bag of Words (BoW)

countvec = CountVectorizer()

In [32]:
countvec_fit = countvec.fit_transform(data)

In [33]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns=countvec.get_feature_names_out())

In [34]:
print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          0      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 

In [35]:
# TF-IDF

tfidfvec = TfidfVectorizer()

In [36]:
tfidfvec_fit = tfidfvec.fit_transform(data)

In [37]:
tfidf_bag = pd.DataFrame(tfidfvec_fit.toarray(), columns=tfidfvec.get_feature_names_out())

In [38]:
print(tfidf_bag)

         10     about  admirable     ahead       are        as   attacks  \
0  0.257061  0.257061   0.000000  0.000000  0.210794  0.000000  0.257061   
1  0.000000  0.000000   0.293641  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000   0.000000  0.000000  0.000000  0.292313  0.000000   
3  0.000000  0.000000   0.000000  0.000000  0.222257  0.000000  0.000000   
4  0.000000  0.000000   0.000000  0.290766  0.000000  0.000000  0.000000   
5  0.000000  0.000000   0.000000  0.000000  0.000000  0.178615  0.000000   

      back     bait     beach  ...      were     west     when     where  \
0  0.00000  0.00000  0.257061  ...  0.000000  0.00000  0.00000  0.257061   
1  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
2  0.00000  0.00000  0.000000  ...  0.356474  0.00000  0.00000  0.000000   
3  0.00000  0.00000  0.000000  ...  0.000000  0.27104  0.27104  0.000000   
4  0.00000  0.00000  0.000000  ...  0.000000  0.00000  0.00000  0.000000   
5  0.21782 