# 1954 Bag of Words (BoW)

[BoW](https://en.wikipedia.org/wiki/Bag-of-words_model)

Imagine we have three sentences, and we want to count how often each word appears in these sentences. We’ll ignore the order of the words and just focus on how many times each word shows up.

### Sentences:
1. "The quick brown fox jumps over the lazy dog."
2. "Never jump over the lazy dog quickly."
3. "The quick brown fox is quick and fast."

### Step-by-Step Explanation:

1. **Collect Words**: First, we look at all the sentences and make a list of every unique word. This gives us a "vocabulary" of all the words that appear.
2. **Count Words**: Next, we count how many times each word in the vocabulary appears in each sentence.
3. **Create a Table**: Finally, we put this information into a table. Each row represents a sentence, and each column represents a word from our vocabulary. The numbers in the table tell us how many times each word appears in each sentence.

### Vocabulary
Here's the list of unique words we found:  
**Vocabulary**: ['and', 'brown', 'dog', 'fast', 'fox', 'is', 'jump', 'jumps', 'lazy', 'never', 'over', 'quick', 'quickly', 'the']

### Bag of Words Table
Now let's put these counts into a table:

| Sentence # | and | brown | dog | fast | fox | is | jump | jumps | lazy | never | over | quick | quickly | the |
|------------|-----|-------|-----|------|-----|----|------|-------|------|-------|------|-------|---------|-----|
| Sentence 1 |  0  |   1   |  1  |  0   |  1  |  0 |   0  |   1   |  1   |   0   |  1   |   1   |    0    |  2  |
| Sentence 2 |  0  |   0   |  1  |  0   |  0  |  0 |   1  |   0   |  1   |   1   |  1   |   0   |    1    |  1  |
| Sentence 3 |  1  |   1   |  0  |  1   |  1  |  1 |   0  |   0   |  0   |   0   |  0   |   2   |    0    |  1  |

### Explanation of the Table
- **Rows**: Each row corresponds to one of the sentences.
- **Columns**: Each column represents a unique word from the vocabulary.
- **Numbers**: The numbers in the table show how many times each word appears in each sentence.

### Example:
- In **Sentence 1** ("The quick brown fox jumps over the lazy dog."):
  - The word "the" appears 2 times.
  - The word "quick" appears 1 time.
  - The word "fox" appears 1 time.
- In **Sentence 2** ("Never jump over the lazy dog quickly."):
  - The word "never" appears 1 time.
  - The word "over" appears 1 time.
  - The word "quickly" appears 1 time.

### How It Works
- This table is called a "Bag of Words" because it tells us what words are in the sentence and how often they appear, just like counting different objects in a bag.
- It’s a way to turn sentences into numbers so that computers can understand and work with them. It doesn't consider the order of the words, just the frequency of each word.

This is the basic idea behind how the Bag of Words model works. It's a simple way to represent text data that helps computers process and analyze sentences.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "The quick brown fox is quick and fast."
]

# Initialize the CountVectorizer
# CountVectorizer will tokenize the documents and count the occurrences of each word
vectorizer = CountVectorizer()

# Fit and transform the documents
# This step learns the vocabulary from the documents and transforms them into a BoW matrix
bow_matrix = vectorizer.fit_transform(documents)

# Get the vocabulary (unique words in the corpus)
# get_feature_names_out() provides the list of all unique words in the documents
vocab = vectorizer.get_feature_names_out()

# Convert the BoW matrix to an array
# bow_matrix.toarray() converts the sparse matrix to a dense numpy array for easier viewing
bow_array = bow_matrix.toarray()

# Display the results
# Print the vocabulary which lists all unique words in the documents
print("Vocabulary:\n", vocab)
# Print the BoW matrix which shows the count of each word in each document
print("\nBag of Words Matrix:\n", bow_array)


Vocabulary:
 ['and' 'brown' 'dog' 'fast' 'fox' 'is' 'jump' 'jumps' 'lazy' 'never'
 'over' 'quick' 'quickly' 'the']

Bag of Words Matrix:
 [[0 1 1 0 1 0 0 1 1 0 1 1 0 2]
 [0 0 1 0 0 0 1 0 1 1 1 0 1 1]
 [1 1 0 1 1 1 0 0 0 0 0 2 0 1]]


In [7]:
print(bow_matrix)


  (0, 13)	2
  (0, 11)	1
  (0, 1)	1
  (0, 4)	1
  (0, 7)	1
  (0, 10)	1
  (0, 8)	1
  (0, 2)	1
  (1, 13)	1
  (1, 10)	1
  (1, 8)	1
  (1, 2)	1
  (1, 9)	1
  (1, 6)	1
  (1, 12)	1
  (2, 13)	1
  (2, 11)	2
  (2, 1)	1
  (2, 4)	1
  (2, 5)	1
  (2, 0)	1
  (2, 3)	1


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "The quick brown fox is quick and fast."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
bow_matrix = vectorizer.fit_transform(documents)

# Get the vocabulary (word to index mapping)
word_to_index = vectorizer.vocabulary_

# Convert the BoW matrix to an array
bow_array = bow_matrix.toarray()

# Get the vocabulary (unique words in the corpus)
vocab = vectorizer.get_feature_names_out()

# Display the word to index mapping
print("Word to Index Mapping:\n", word_to_index)

# Display the vocabulary and BoW matrix
print("\nVocabulary:\n", vocab)
print("\nBag of Words Matrix:\n", bow_array)


Word to Index Mapping:
 {'the': 13, 'quick': 11, 'brown': 1, 'fox': 4, 'jumps': 7, 'over': 10, 'lazy': 8, 'dog': 2, 'never': 9, 'jump': 6, 'quickly': 12, 'is': 5, 'and': 0, 'fast': 3}

Vocabulary:
 ['and' 'brown' 'dog' 'fast' 'fox' 'is' 'jump' 'jumps' 'lazy' 'never'
 'over' 'quick' 'quickly' 'the']

Bag of Words Matrix:
 [[0 1 1 0 1 0 0 1 1 0 1 1 0 2]
 [0 0 1 0 0 0 1 0 1 1 1 0 1 1]
 [1 1 0 1 1 1 0 0 0 0 0 2 0 1]]


### Bag of Words (BoW) Representation  
Given a set of documents and a vocabulary extracted from these documents, the Bag of Words (BoW) vector for each document $ D_j $ is represented as:

$$
\text{BoW}(D_j) = \left[ f(w_1, D_j), f(w_2, D_j), \ldots, f(w_n, D_j) \right]
$$

Where:

- $ f(w_i, D_j) $ is the frequency of word $ w_i $ in document $ D_j $.
- $ n $ is the total number of unique words in the vocabulary across all documents.
- Each document $ D_j $ is transformed into a vector of size $ n $, where $ n $ is the number of unique words in the entire corpus.

### Example for the Provided Code  
Given the vocabulary $ V $ extracted from the documents:

$$
V = \{\text{"and"}, \text{"brown"}, \text{"dog"}, \text{"fast"}, \text{"fox"}, \text{"is"}, \text{"jump"}, \text{"jumps"}, \text{"lazy"}, \text{"never"}, \text{"over"}, \text{"quick"}, \text{"quickly"}, \text{"the"}\}
$$

For each document $ D_j $ in the set of documents $ D $, the BoW vector $ \text{BoW}(D_j) $ is constructed as:

$$
\text{BoW}(D_1) = \left[ f(\text{"and"}, D_1), f(\text{"brown"}, D_1), \ldots, f(\text{"the"}, D_1) \right]
$$

$$
\text{BoW}(D_2) = \left[ f(\text{"and"}, D_2), f(\text{"brown"}, D_2), \ldots, f(\text{"the"}, D_2) \right]
$$

$$
\text{BoW}(D_3) = \left[ f(\text{"and"}, D_3), f(\text{"brown"}, D_3), \ldots, f(\text{"the"}, D_3) \right]
$$

Where each $ f(w_i, D_j) $ is the count of the word $ w_i $ in the document $ D_j $.

### Final BoW Matrix  
The resulting BoW matrix for all documents is an $ m \times n $ matrix $ M $ where $ m $ is the number of documents and $ n $ is the size of the vocabulary:

$$
M = 
\begin{bmatrix}
f(w_1, D_1) & f(w_2, D_1) & \cdots & f(w_n, D_1) \\
f(w_1, D_2) & f(w_2, D_2) & \cdots & f(w_n, D_2) \\
\vdots & \vdots & \ddots & \vdots \\
f(w_1, D_m) & f(w_2, D_m) & \cdots & f(w_n, D_m) \\
\end{bmatrix}
$$

Each row $ D_j $ in the matrix $ M $ represents a document vector where each element is the frequency count of the corresponding word in the vocabulary $ V $.
