# Bag of Words (BoW)
- The Bag of Words (BoW) model is a fundamental method used in Natural Language Processing (NLP) to convert text data into numerical representations.
- This approach disregards grammar and word order but considers the frequency or presence of words in a document.
- It is particularly useful for text classification, sentiment analysis, and information retrieval tasks.

## Key Concepts of Bag of Words
- **Vocabulary:** The set of all unique words in the corpus.
- **Vector Representation:** Each document is represented as a vector of word frequencies or binary indicators of word presence.
- **Simplicity:** The model is easy to understand and implement, making it a good starting point for text analysis.

## Steps to Create a Bag of Words Model
- **Tokenization:** Splitting the text into words or tokens.
- **Building the Vocabulary:** Creating a list of all unique tokens in the corpus.
- **Vectorization:** Representing each document as a vector of word counts or binary values.

# Implementation

In [2]:
#Step 1: Tokenization

In [1]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "Cats are beautiful animals.",
    "Dogs are loyal and friendly animals.",
    "Cats and dogs are popular pets."
]

# Tokenize the documents (you may add more preprocessing steps like stemming or stop word removal)
nltk.download('punkt')
tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Flatten the list of lists and remove duplicates to create a vocabulary
vocabulary = sorted(set([word for doc in tokenized_docs for word in doc]))

print("Vocabulary:", vocabulary)


Vocabulary: ['.', 'and', 'animals', 'are', 'beautiful', 'cats', 'dogs', 'friendly', 'loyal', 'pets', 'popular']


[nltk_data] Error loading punkt: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


In [3]:
#Step 2: Building the Vocabulary and Vectorization

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents to create the bag of words model
X = vectorizer.fit_transform(documents)

# Convert the result to an array for better readability
bow_array = X.toarray()

print("Bag of Words Model:\n", bow_array)
print("Feature Names:\n", vectorizer.get_feature_names_out())


Bag of Words Model:
 [[0 1 1 1 1 0 0 0 0 0]
 [1 1 1 0 0 1 1 1 0 0]
 [1 0 1 0 1 1 0 0 1 1]]
Feature Names:
 ['and' 'animals' 'are' 'beautiful' 'cats' 'dogs' 'friendly' 'loyal' 'pets'
 'popular']
