# Bag of Words

Bag of Words is the simplest form of word embedding. Bag of words models encode every word in the vocabulary as one-hot-encoded vector. The process for Bag of Words goes through the following steps:

1. Construct a vocabulary of words.
2. Construct a vector of dimension *d* (*d* being the vocabulary size). Each index/dimension of the vector corresponds to a unique word in the vocabulary. The value in each shell of the vector represents the number of times the word with that index occurs in the corpus.
    
**Drawbacks of Bag-of-Words:**

- Vector length is insanely large for large corpus.
- BoW results to sparse matrix, which is what we would like to avoid.
- Retains no information about grammar and ordering of words in a corpus.

![](./../assets/embedding/bow.jpg)

`CountVectorizer` module from scikit-learn serves well to generate the document-term matrix. As always let's start by installing the required library.

`pip3 install sklearn`

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
corpus = [
    'About the bird, the bird, bird bird bird',
    'You heard about the bird',
    'The bird is the word'
]

In [3]:
vectorizer = CountVectorizer()
output = vectorizer.fit_transform(corpus)

print(output.todense())

[[1 5 0 0 2 0 0]
 [1 1 1 0 1 0 1]
 [0 1 0 1 2 1 0]]


Each row in the output matrix is the sparse vector representation of the sentence at the corresponding index of the corpus list.