### Introducing Bag of Words (BoW)

The Bag of Words (BoW) model is a simple and commonly used approach in natural language processing (NLP) for text representation. It transforms text into a numerical format that machine learning algorithms can understand. The main idea behind BoW is to represent a document as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency.

#### How it Works
The process of creating a BoW model generally involves these steps:

Corpus and Preprocessing: Start with a set of documents (the corpus). Before creating the bag of words, the text is typically preprocessed. This can include:

1. Tokenization: Breaking down sentences into individual words (or tokens).

2. Lowercasing: Converting all words to lowercase to treat "The" and "the" as the same word.

3. Removing stop words: Eliminating common, low-information words like "a," "is," "the," etc.

4. Stemming or Lemmatization: Reducing words to their root form (e.g., "running" becomes "run").

5. Creating a Vocabulary: A list of all unique words from the preprocessed corpus is created. This list forms the vocabulary, and its size determines the dimensionality of the vectors.

6. Generating Document Vectors: For each document, a vector is created with the same length as the vocabulary. Each entry in the vector represents a word from the vocabulary, and its value is the count of how many times that word appeared in the document. A value of zero means the word was not present.

For example, consider these two sentences:

- Document 1: "John likes to watch movies. Mary likes movies too."
- Document 2: "Mary also likes to watch football games."

The combined vocabulary (after lowercasing and removing punctuation) would be:
`["john", "likes", "to", "watch", "movies", "mary", "too", "also", "football", "games"]`

The BoW vectors for each document would then be:

For Document 1: "The cat sat on the mat."

"the": 2

"cat": 1

"sat": 1

"on": 1

"mat": 1

"dog": 0

"chased": 0

Vector for Doc 1: [2, 1, 1, 1, 1, 0, 0]

For Document 2: "The dog chased the cat."

"the": 2

"cat": 1

"sat": 0

"on": 0

"mat": 0

"dog": 1

"chased": 1

Vector for Doc 2: [2, 1, 0, 0, 0, 1, 1]




**Advantages**
- Simplicity: It's a straightforward and easy-to-implement method.
- Computational Efficiency: It works well for small to medium-sized datasets.	
- Effective for certain tasks: It is sufficient for tasks like spam detection and document classification, where word frequency is a strong indicator.

**Disadvantages**
- Loss of word order: It completely ignores the sequence of words, losing valuable semantic and syntactic information.
- High dimensionality and sparsity: The vocabulary can become very large, leading to high-dimensional vectors with many zeros.
- No semantic meaning: It doesn't capture the meaning or context of words. "Dog bites man" and "Man bites dog" would have the same vector.

### Implementing Bag of Words from scratch

In [2]:
documents = ['Hello, how are you!',
             'I am good, how are you?',
             'What is your favorite food?',
             'My favorite food is Chinese food',
             'Chinese food is delicious']

lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())
print(lower_case_documents)

['hello, how are you!', 'i am good, how are you?', 'what is your favorite food?', 'my favorite food is chinese food', 'chinese food is delicious']


In [3]:
#remove punctuation from each string in the lower_case_documents list.
sans_punctuation_documents = []
import string

for i in lower_case_documents:
    sans_punctuation_documents.append(''.join(c for c in i if c not in string.punctuation))

print(sans_punctuation_documents)

['hello how are you', 'i am good how are you', 'what is your favorite food', 'my favorite food is chinese food', 'chinese food is delicious']


In [4]:
# strip the space
preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(' '))
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['i', 'am', 'good', 'how', 'are', 'you'], ['what', 'is', 'your', 'favorite', 'food'], ['my', 'favorite', 'food', 'is', 'chinese', 'food'], ['chinese', 'food', 'is', 'delicious']]


In [9]:
# Count frequency
frequency_list = []
import pprint
from collections import Counter

for i in preprocessed_documents:
    frequency_list.append(Counter(i))

pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'i': 1, 'am': 1, 'good': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'what': 1, 'is': 1, 'your': 1, 'favorite': 1, 'food': 1}),
 Counter({'food': 2, 'my': 1, 'favorite': 1, 'is': 1, 'chinese': 1}),
 Counter({'chinese': 1, 'food': 1, 'is': 1, 'delicious': 1})]


## Implementing Bag of Words in scikit-learn

### Corpus of Documents:

In [10]:
corpus = ['I love data science',
          'Data science is great',
          'I love coding in Python',
          'Python is great for data analysis']

### Applying Bag of Words:

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
df_bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df_bow


Unnamed: 0,analysis,coding,data,for,great,in,is,love,python,science
0,0,0,1,0,0,0,0,1,0,1
1,0,0,1,0,1,0,1,0,0,1
2,0,1,0,0,0,1,0,1,1,0
3,1,0,1,1,1,0,1,0,1,0
