<a href="https://colab.research.google.com/github/abhishek1998s/NLP-Learning/blob/main/8_Bag_Of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Bag of Words (BoW) model is a fundamental technique in natural language processing (NLP) used to represent text data. It is a simple and effective way to convert text into numerical features that can be used by machine learning algorithms.

**How Bag of Words Works**

**Text Preprocessing:**

*   Tokenization: Split the text into individual words or tokens.
*   Normalization: Convert all words to lowercase to ensure uniformity.
Stopword Removal: Optionally remove common words (e.g., "and", "the") that may not carry significant meaning.
*   Stemming/Lemmatization: Optionally reduce words to their base or root form.

**Vocabulary Creation:**
Create a vocabulary, which is a list of all unique words present in the corpus (the collection of documents).

**Vectorization:**
For each document, create a vector of the same length as the vocabulary.
Each position in the vector corresponds to a word in the vocabulary.
The value at each position is the count (or frequency) of the word in the document.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

In [4]:
#creating the count vectorizer instance
vectorizer = CountVectorizer()

In [5]:
#fit and Transform the document
bow_matrix = vectorizer.fit_transform(documents)

In [6]:
#Convert the array for easier viewing
bow_array = bow_matrix.toarray()

In [7]:
#get the feature names
feature_names = vectorizer.get_feature_names_out()

In [8]:
print('Vocabulary:', feature_names)
print('Bag of Words Matrix:')
print(bow_array)

Vocabulary: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Bag of Words Matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
