# 5.2 Bag of Words
This notebook demonstrates the Bag of Words (BoW) model, a simple and effective method for text vectorization. BoW represents text data as a sparse matrix of word counts, where each row corresponds to a document and each column corresponds to a unique word in the corpus.

In [None]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Sample data
# This dataset contains sentences with diverse topics to demonstrate the Bag of Words transformation.
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [None]:
# Initialize the CountVectorizer
# This object will be used to transform the text data into numerical features.
countvec = CountVectorizer()

In [None]:
# Fit the CountVectorizer to the text data
# The `fit_transform` method learns the vocabulary and computes the Bag of Words matrix.
countvec_fit = countvec.fit_transform(data)

In [None]:
# Convert the Bag of Words matrix to a DataFrame
# This step makes it easier to visualize the numerical representation of the text data.
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [None]:
# Display the Bag of Words DataFrame
# Each row corresponds to a document, and each column corresponds to a term's count.
print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          0      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 