# Bag of Words (BoW)

[BoW](https://en.wikipedia.org/wiki/Bag-of-words_model)

Imagine we have three sentences, and we want to count how often each word appears in these sentences. We’ll ignore the order of the words and just focus on how many times each word shows up.

### Sentences:
1. "The quick brown fox jumps over the lazy dog."
2. "Never jump over the lazy dog quickly."
3. "The quick brown fox is quick and fast."

### Step-by-Step Explanation:

1. **Collect Words**: First, we look at all the sentences and make a list of every unique word. This gives us a "vocabulary" of all the words that appear.
2. **Count Words**: Next, we count how many times each word in the vocabulary appears in each sentence.
3. **Create a Table**: Finally, we put this information into a table. Each row represents a sentence, and each column represents a word from our vocabulary. The numbers in the table tell us how many times each word appears in each sentence.

### Vocabulary
Here's the list of unique words we found:  
**Vocabulary**: ['and', 'brown', 'dog', 'fast', 'fox', 'is', 'jump', 'jumps', 'lazy', 'never', 'over', 'quick', 'quickly', 'the']

### Bag of Words Table
Now let's put these counts into a table:

| Sentence # | and | brown | dog | fast | fox | is | jump | jumps | lazy | never | over | quick | quickly | the |
|------------|-----|-------|-----|------|-----|----|------|-------|------|-------|------|-------|---------|-----|
| Sentence 1 |  0  |   1   |  1  |  0   |  1  |  0 |   0  |   1   |  1   |   0   |  1   |   1   |    0    |  2  |
| Sentence 2 |  0  |   0   |  1  |  0   |  0  |  0 |   1  |   0   |  1   |   1   |  1   |   0   |    1    |  1  |
| Sentence 3 |  1  |   1   |  0  |  1   |  1  |  1 |   0  |   0   |  0   |   0   |  0   |   2   |    0    |  1  |

### Explanation of the Table
- **Rows**: Each row corresponds to one of the sentences.
- **Columns**: Each column represents a unique word from the vocabulary.
- **Numbers**: The numbers in the table show how many times each word appears in each sentence.

### Example:
- In **Sentence 1** ("The quick brown fox jumps over the lazy dog."):
  - The word "the" appears 2 times.
  - The word "quick" appears 1 time.
  - The word "fox" appears 1 time.
- In **Sentence 2** ("Never jump over the lazy dog quickly."):
  - The word "never" appears 1 time.
  - The word "over" appears 1 time.
  - The word "quickly" appears 1 time.

### How It Works
- This table is called a "Bag of Words" because it tells us what words are in the sentence and how often they appear, just like counting different objects in a bag.
- It’s a way to turn sentences into numbers so that computers can understand and work with them. It doesn't consider the order of the words, just the frequency of each word.

This is the basic idea behind how the Bag of Words model works. It's a simple way to represent text data that helps computers process and analyze sentences.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "The quick brown fox is quick and fast."
]

# Initialize the CountVectorizer
# CountVectorizer will tokenize the documents and count the occurrences of each word
vectorizer = CountVectorizer()

# Fit and transform the documents
# This step learns the vocabulary from the documents and transforms them into a BoW matrix
bow_matrix = vectorizer.fit_transform(documents)

# Get the vocabulary (unique words in the corpus)
# get_feature_names_out() provides the list of all unique words in the documents
vocab = vectorizer.get_feature_names_out()

# Convert the BoW matrix to an array
# bow_matrix.toarray() converts the sparse matrix to a dense numpy array for easier viewing
bow_array = bow_matrix.toarray()

# Display the results
# Print the vocabulary which lists all unique words in the documents
print("Vocabulary:\n", vocab)
# Print the BoW matrix which shows the count of each word in each document
print("\nBag of Words Matrix:\n", bow_array)


Vocabulary:
 ['and' 'brown' 'dog' 'fast' 'fox' 'is' 'jump' 'jumps' 'lazy' 'never'
 'over' 'quick' 'quickly' 'the']

Bag of Words Matrix:
 [[0 1 1 0 1 0 0 1 1 0 1 1 0 2]
 [0 0 1 0 0 0 1 0 1 1 1 0 1 1]
 [1 1 0 1 1 1 0 0 0 0 0 2 0 1]]
