# BOW

The Bag of Words (BoW) model is a simple and widely used method for text representation in natural language processing (NLP). It represents text data as a collection of words (features) without considering grammar, word order, or context. This model is particularly useful for text classification, information retrieval, and other text-related tasks.

<img src="images/bow.png">

### Layman Terms
Imagine you have a bunch of documents, and you want to represent each document by the words it contains. The Bag of Words model creates a list of all unique words in your documents and then represents each document as a list of word counts. It doesn't care about the order of the words; it just counts how many times each word appears.

### Example
Let's say you have two sentences:
1. "The cat sits on the mat."
2. "The dog plays with the cat."

The Bag of Words model would create a list of unique words: ["the", "cat", "sits", "on", "mat", "dog", "plays", "with"]

Each document is then represented as a vector of word counts:
- Sentence 1: [2, 1, 1, 1, 1, 0, 0, 0]
- Sentence 2: [2, 1, 0, 0, 0, 1, 1, 1]

### Math Behind It
1. **Vocabulary**: Create a list of all unique words in the corpus.
2. **Vector Representation**: For each document, create a vector where each element represents the count of a word in the document.

### Example Code
Here's an example of how to implement the Bag of Words model using the `sklearn` library in Python.

### Installation
First, you need to install the `scikit-learn` and `nltk` libraries if you haven't already:
```bash
pip install scikit-learn nltk
```

### Example Code

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import pandas as pd
import nltk

In [2]:
# Download NLTK data files (only need to run once)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gauravkandel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Sample corpus
documents = [
    "The cat sits on the mat",
    "The dog plays with the cat",
    "Dogs and cats are great pets",
    "The mat is under the table"
]

In [4]:
# Preprocess the documents (tokenization)
tokenized_documents = [" ".join(word_tokenize(doc.lower())) for doc in documents]

In [5]:
# Initialize the Count Vectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
bow_matrix = vectorizer.fit_transform(tokenized_documents)

In [6]:
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

In [7]:
# Convert the BoW matrix to a dense format and display it
dense_bow = bow_matrix.todense()
bow_df = pd.DataFrame(dense_bow, columns=feature_names)

In [8]:
print("Bag of Words Matrix:")
print(bow_df)

Bag of Words Matrix:
   and  are  cat  cats  dog  dogs  great  is  mat  on  pets  plays  sits  \
0    0    0    1     0    0     0      0   0    1   1     0      0     1   
1    0    0    1     0    1     0      0   0    0   0     0      1     0   
2    1    1    0     1    0     1      1   0    0   0     1      0     0   
3    0    0    0     0    0     0      0   1    1   0     0      0     0   

   table  the  under  with  
0      0    2      0     0  
1      0    2      0     1  
2      0    0      0     0  
3      1    2      1     0  


In [9]:
# Example: Get the word count for a specific word in a specific document
doc_index = 0  # First document
word = 'cat'
word_index = feature_names.tolist().index(word)
word_count = bow_matrix[doc_index, word_index]

In [10]:
print(f"Word count for '{word}' in document {doc_index}: {word_count}")

Word count for 'cat' in document 0: 1


### Explanation
1. **Tokenize Documents**: We tokenize the documents using NLTK's `word_tokenize` and convert them to lowercase.
2. **Count Vectorizer**: We initialize a `CountVectorizer` from `sklearn`.
3. **Fit and Transform**: We fit the vectorizer to the tokenized documents and transform them into a Bag of Words matrix.
4. **Feature Names**: We retrieve the feature names (words) from the vectorizer.
5. **Dense Format**: We convert the sparse Bag of Words matrix to a dense format for easier display and create a DataFrame for better readability.
6. **Word Count**: We get the word count for a specific word in a specific document.

### Note:
In a real-world scenario, you might need to preprocess the text further (e.g., removing stop words, stemming/lemmatization) and use a larger corpus of documents. The example above is simplified for clarity.