# Bag of Words

Its technique, where count/frequency of each word in the document(sentence) is considered. The feature matrix is prepared based on vocabulary.

In [None]:
import pandas as pd

df = pd.DataFrame({
    "text": ["people watch campusx", "campusx watch campusx", "people write comment", "campusx write comment"]
})

In [None]:
df

Unnamed: 0,text
0,people watch campusx
1,campusx watch campusx
2,people write comment
3,campusx write comment


Now Lets find the corpus and vocabulary of this

In [None]:
corpus = " ".join([ sentence for sentence in df.text])
corpus

'people watch campusx campusx watch campusx people write comment campusx write comment'

In [None]:
vocabulary = list(set(corpus.split()))
vocabulary

['watch', 'people', 'campusx', 'comment', 'write']

The bag of words : creates a feature matrix, based on this vocabulary. A vector based on the vocabulary where each word's frequency in the document is mentioned such a vector becomes associated with that document, Hence is its vector representation.

In [None]:
feature_matrix :list[list[int]] =[]
for sentence in df.text:
  vector = []
  for word in vocabulary:
    vector.append(sentence.count(word))
  print(sentence, " -> ", vector)
  feature_matrix.append(vector)

people watch campusx  ->  [1, 1, 1, 0, 0]
campusx watch campusx  ->  [1, 0, 2, 0, 0]
people write comment  ->  [0, 1, 0, 1, 1]
campusx write comment  ->  [0, 0, 1, 1, 1]


In [None]:
feature_matrix

[[1, 1, 1, 0, 0], [1, 0, 2, 0, 0], [0, 1, 0, 1, 1], [0, 0, 1, 1, 1]]

# SkLearn Based BoW ( Count Vectorizer)

Note that since All vectorization of text data are kind of ways to convert textual data into numerical features, they can be called as Feature Extraction. Hence sklearn saves this under **sklearn.feature_extraction.text module**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

feature_matrix = vectorizer.fit_transform(df.text) # Since it fits the data to matrix using vocabulary, for the tables the transformed data is readily avaiable.
feature_matrix

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 11 stored elements and shape (4, 5)>

In [None]:
# Its a sparse matrix convert to straight matrix/arrays with .toarray()
feature_matrix = feature_matrix.toarray()
feature_matrix

array([[1, 0, 1, 1, 0],
       [2, 0, 0, 1, 0],
       [0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1]])

On the first look, It seems different, Lets examine the vocabulary. ( Remember : vocabulary_)

In [None]:
vectorizer.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

In [None]:
# Compare with our prev vocabulary
vocabulary

['watch', 'people', 'campusx', 'comment', 'write']

Yes the order has been reversed. Its okay because, BoW is ignores **order of words in vocabulary**