<h1 style="color:DodgerBlue; text-align:center; font-weight:bold; font-size:50px; background-color:lightblue; padding:20px 20px">Bag of Words (BoW)</h1>


<h2>Bag of Words (BoW) is a simple and widely used method to convert text into a numerical representation by counting the occurrence of each word in the document. This method does not consider the order of the words but only their frequency.</h2>

In [1]:
# import requirments
import numpy as np
import pandas as pd

import string
from nltk.tokenize import word_tokenize

In [2]:
# take a sentence list 
sentences = [
    "The cat sat on the mat.",
    "The dog barked at the cat.",
    "The cat and the dog are friends.",
    "Birds can fly high in the sky."
]

sentences

['The cat sat on the mat.',
 'The dog barked at the cat.',
 'The cat and the dog are friends.',
 'Birds can fly high in the sky.']

<h2 style="color:SlateBlue ; font-size:35px">Manually Create a Bag of Word Representation</h2>

In [3]:
# word Tokenize for each sentence if a word not in punctuations 

sentence_tokens = [
    [word for word in word_tokenize(sentence.lower()) if word not in string.punctuation]
    for sentence in sentences
]

sentence_tokens

[['the', 'cat', 'sat', 'on', 'the', 'mat'],
 ['the', 'dog', 'barked', 'at', 'the', 'cat'],
 ['the', 'cat', 'and', 'the', 'dog', 'are', 'friends'],
 ['birds', 'can', 'fly', 'high', 'in', 'the', 'sky']]

In [4]:
# create a vocabulary of unique words
## vocabulary = set([word for token in sentence_tokens for word in token])  # ==> list compreshion

# use a for loop
vocabulary = set()
for sent in sentence_tokens:
    for word in sent:
        if word not in vocabulary:
            vocabulary.add(word)

# arrange the vocabulary into sorted list
vocabulary = sorted(vocabulary)
print(vocabulary)

['and', 'are', 'at', 'barked', 'birds', 'can', 'cat', 'dog', 'fly', 'friends', 'high', 'in', 'mat', 'on', 'sat', 'sky', 'the']


In [5]:
# create a BOW representation of zeroes with shape of no.of sentence & no.of Vocabularies

bow_mattrix = np.zeros(shape=(len(sentence_tokens) , len(vocabulary)) ,dtype='int')
bow_mattrix

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [6]:
# based on the frequency of word increase the count

for i , sent in enumerate(sentence_tokens):
    for word in sent:
        if word in vocabulary:
            bow_mattrix[i ,vocabulary.index(word)] += 1

# print bow matrix
bow_mattrix

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1]])

In [7]:
# for understanding , create a dataframe with index as sentences and columns a s vocabularies

pd.DataFrame(data=bow_mattrix ,index=sentences , columns=vocabulary)

Unnamed: 0,and,are,at,barked,birds,can,cat,dog,fly,friends,high,in,mat,on,sat,sky,the
The cat sat on the mat.,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,2
The dog barked at the cat.,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,2
The cat and the dog are friends.,1,1,0,0,0,0,1,1,0,1,0,0,0,0,0,0,2
Birds can fly high in the sky.,0,0,0,0,1,1,0,0,1,0,1,1,0,0,0,1,1


<h2 style="color:SlateBlue ; font-size:35px">Create BOW using sciket-learn CounterVectorizer</h2>

In [8]:
sentences

['The cat sat on the mat.',
 'The dog barked at the cat.',
 'The cat and the dog are friends.',
 'Birds can fly high in the sky.']

In [9]:
# import CounterVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

BOW = cv.fit_transform(sentences)

In [10]:
# check counter vector vocabulary based on our input data

print(cv.vocabulary_)

{'the': 16, 'cat': 6, 'sat': 14, 'on': 13, 'mat': 12, 'dog': 7, 'barked': 3, 'at': 2, 'and': 0, 'are': 1, 'friends': 9, 'birds': 4, 'can': 5, 'fly': 8, 'high': 10, 'in': 11, 'sky': 15}


In [11]:
# get the list of vocabulary

cv_vocabulary = sorted(cv.vocabulary_)
cv_vocabulary

['and',
 'are',
 'at',
 'barked',
 'birds',
 'can',
 'cat',
 'dog',
 'fly',
 'friends',
 'high',
 'in',
 'mat',
 'on',
 'sat',
 'sky',
 'the']

In [12]:
# see the Bag of Word matrix

cv_bow_matrix = BOW.toarray()
cv_bow_matrix

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1]])

In [13]:
# to see the cv_vocabulary dataframe

pd.DataFrame(data=cv_bow_matrix , index=sentences , columns=cv_vocabulary)

Unnamed: 0,and,are,at,barked,birds,can,cat,dog,fly,friends,high,in,mat,on,sat,sky,the
The cat sat on the mat.,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,2
The dog barked at the cat.,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,2
The cat and the dog are friends.,1,1,0,0,0,0,1,1,0,1,0,0,0,0,0,0,2
Birds can fly high in the sky.,0,0,0,0,1,1,0,0,1,0,1,1,0,0,0,1,1


### Advantages of Bag of Words:
- Simple and easy to understand.
- Captures the frequency of words in the document.


### Disadvantages of Bag of Words:
- Ignores the order of words.
- Can lead to a large and sparse representation if the vocabulary is large.
- Does not capture semantic meaning or context of words.

        _______________________________________________ End _______________________________________________