# 5.2 Bag of Words

📖 Story: Bag of Words
Let's say you and your friends wrote these two sentences:

1️⃣ "I love cricket"
2️⃣ "I love playing cricket"

Now, imagine you take all the words from both sentences and put them into a bag.

Your bag will look like this:

css
Copy
Edit
["I", "love", "cricket", "I", "love", "playing", "cricket"]
But a bag doesn’t care about order, right? If you mix the toffees, you don’t remember who threw them first.
So, you just count how many of each word you have, like counting candies. 🍭

🏏 Example Table
Word	Count
I	2
love	2
cricket	2
playing	1

This is called a Bag of Words (BoW) because:

We ignore the order of words. (Doesn’t matter if you said "love cricket" or "cricket love")

We just count how many times each word appeared, like counting candies.

🧠 Why is this useful?
Computers don’t understand language like we do.

BoW gives them numbers to work with.

Example:

Sentence 1 = [1, 1, 1, 0] (no "playing")

Sentence 2 = [1, 1, 1, 1] (has "playing")

Now the computer can do math to compare sentences! ✅

🎯 Real-life Example
Suppose you are analyzing your friends' chat messages:

Chat 1: "I like samosa"

Chat 2: "I like dosa"

Bag of words helps you count and compare:

Word list: ["I", "like", "samosa", "dosa"]

Counts:

Chat 1 → [1, 1, 1, 0]

Chat 2 → [1, 1, 0, 1]

So you know:

Both friends like food (I, like) 🍽️

One likes samosa, the other dosa 😄

💡 Think of Bag of Words = A bag filled with words like candies, where only the count matters, not the order.



In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [9]:
countvec = CountVectorizer()

In [14]:
countvec_fit = countvec.fit_transform(data)

In [15]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [16]:
print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          1      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 

In [17]:
# Example sentences
sentences = [
    "I love cricket",
    "I love playing cricket"
]

# Step 1️⃣: Split sentences into words
words = []
for sentence in sentences:
    words.extend(sentence.lower().split())

print("🔹 All words collected:", words)

# Step 2️⃣: Create a list of unique words (like unique candies)
unique_words = list(set(words))
print("🔹 Unique words:", unique_words)

# Step 3️⃣: Count occurrences of each word for each sentence
bag_of_words = []

for sentence in sentences:
    word_count = []
    for word in unique_words:
        word_count.append(sentence.lower().split().count(word))
    bag_of_words.append(word_count)

# Step 4️⃣: Show the Bag of Words matrix
print("\n🔹 Bag of Words Matrix:")
for row in bag_of_words:
    print(row)


🔹 All words collected: ['i', 'love', 'cricket', 'i', 'love', 'playing', 'cricket']
🔹 Unique words: ['playing', 'cricket', 'love', 'i']

🔹 Bag of Words Matrix:
[0, 1, 1, 1]
[1, 1, 1, 1]
