# 1. Introduction
## 1.1 Definition
**Bag of Words (BoW)** is a simple and widely used text representation technique in **natural language processing (NLP)** and information retrieval. It transforms text into numerical features, which can then be used in machine learning models.

**BoW** is foundational in text processing and serves as a stepping stone to more advanced techniques like
* TF-IDF,
* word embeddings, and
* transformers.

# 2. Import libraries

In [1]:
# NLTK library works with human language data (text) such as tokenization, part-of-speech tagging, and more.
import nltk

# A specific stemming algorithm from 'nltk' that reduces words to their base or root form.
from nltk.stem import PorterStemmer

# WordNetLemmatizer class from the nltk.stem module, which is responsible for lemmatizing words.
from nltk.stem import WordNetLemmatizer

# Regular expression operations from the 're' module, to check if a string contains the specified search pattern.
import re

# nltk.download('all')

# 3. Load the necessary 'NLTK' resources

In [2]:
# Punkt tokenizer models required for word tokenization.
nltk.download('punkt')

# WordNet lexical database, provides the data needed for lemmatization.
nltk.download('wordnet')

# POS tagging model, used to tag words with their parts of speech.
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 3.1 What Are Stopwords?
* **Stopwords** are common words in a language that are often filtered out during text processing because they carry little meaningful information. Examples include **"the," "is," "in," "and," etc.**
* These words are usually removed in tasks like ***text analysis, information retrieval, and machine learning*** to improve the performance of algorithms by focusing on more relevant words.

In [3]:
# import re
# The 'stopwords' module from 'nltk' provides a list of common stopwords for various languages.
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 4. Load the dataset
Load the dataset either by specifying a URL or by directly creating it by writing the entire paragraph here; for further data analysis.

In [4]:
paragraph="""I have three visions for India. In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my first vision is that of freedom. I believe that India got its first vision of this in 1857, when we started the War of Independence. It is this freedom that we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career."""

# 5. Data Pre-processing
## 5.1 Tokenize the entire **paragraph**
* In data pre-processing, the first step is to clean and prepare raw text data for analysis.
* **Tokenization**, a fundamental technique in **Natural Language Processing (NLP)**, involves breaking down a paragraph into
  * individual words,
  * phrases, or
  * sentences,

  making the data easier to analyze and process.

This section will guide you through tokenizing a given paragraph to facilitate further text analysis.



In [5]:
# Tokenize (split) the given paragraph into a list of sentences.
sent_list = nltk.sent_tokenize(paragraph)
corpus = []
sent_list

['I have three visions for India.',
 'In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my first vision is that of freedom.',
 'I believe that India got its first vision of this in 1857, when we started the War of Independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s development.',
 'For fifty years we have been a developing nation.',
 'It is time we see ourselves as a devel

## 5.2 Initialize the 'Stemmer' and "Lemmatizer' class
#### Create **stemmer** and **lemmatizer** object

In [6]:
# ps; an instance of the PorterStemmer, class
ps = PorterStemmer()

# Initialize the WordNet Lemmatizer from the NLTK library to reduce words to their base form
wnl = WordNetLemmatizer()

# 6. Processing of dataset
* removing unwanted characters,
* converting to lowercase,
* splitting into words,
* removing stopwords,
* stemming, and then
* reconstructing the cleaned sentence into the corpus.

In [7]:
# iterate over each element in 'sent_list' one at a time
for i in range(len(sent_list)):

  # removed all the auxillary things like comma, dot, numbers, etc. except alphabets letters both upper & lower case.
  review = re.sub('[^a-zA-Z]', ' ', sent_list[i])

  # lowered down all the alphabets and words in a 'sent_list' to ensures uniformity and helps in comparing words without case sensitivity
  review = review.lower()

  # split each words as single entity kind of array of words based on spaces, creating a list of words for each sentence
  review = review.split()

  # stemming to each word, checks if the word is not a stopword. 'set()' is used for faster lookups, as checking membership in a set is generally quicker than in a list.
  review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]

  # Joins the stemmed words back into a single sentence, separated by spaces, converting the list of words back into a string.
  review = ' '.join(review)

  # adds back all processed sentence (now cleaned, lowercased, stemmed, and without stopwords) from a list of string into a single sentence
  corpus.append(review)
  print(review)

three vision india
year histori peopl world come invad us captur land conquer mind
alexand onward greek turk mogul portugues british french dutch came loot us took
yet done nation
conquer anyon
grab land cultur histori tri enforc way life

respect freedom other first vision freedom
believ india got first vision start war independ
freedom must protect nurtur build
free one respect us
second vision india develop
fifti year develop nation
time see develop nation
among top nation world term gdp
percent growth rate area
poverti level fall
achiev global recognis today
yet lack self confid see develop nation self reliant self assur
incorrect
third vision
india must stand world
believ unless india stand world one respect us
strength respect strength
must strong militari power also econom power
must go hand hand
good fortun work three great mind
dr vikram sarabhai dept
space professor satish dhawan succeed dr brahm prakash father nuclear materi
lucki work three close consid great opportun life


In [8]:
corpus

['three vision india',
 'year histori peopl world come invad us captur land conquer mind',
 'alexand onward greek turk mogul portugues british french dutch came loot us took',
 'yet done nation',
 'conquer anyon',
 'grab land cultur histori tri enforc way life',
 '',
 'respect freedom other first vision freedom',
 'believ india got first vision start war independ',
 'freedom must protect nurtur build',
 'free one respect us',
 'second vision india develop',
 'fifti year develop nation',
 'time see develop nation',
 'among top nation world term gdp',
 'percent growth rate area',
 'poverti level fall',
 'achiev global recognis today',
 'yet lack self confid see develop nation self reliant self assur',
 'incorrect',
 'third vision',
 'india must stand world',
 'believ unless india stand world one respect us',
 'strength respect strength',
 'must strong militari power also econom power',
 'must go hand hand',
 'good fortun work three great mind',
 'dr vikram sarabhai dept',
 'space profess

# 7. Implementation of Bag of Word (BoW) model
* imports the `CountVectorizer` class, from the `sklearn.feature_extraction.text` module; which is used to transform text data into a matrix of token counts.
* creating an instance of `CountVectorizer`, to reduce dimensionality and focus on the most common words.
* `cv.fit_transform(corpus)`, method first learns the vocabulary from the corpus (*i.e., it identifies all unique words or tokens*) and then transforms the corpus into a numerical matrix where
  * each row corresponds to a document, and
  * each column corresponds to a token from the vocabulary.
* `.toarray()`; converts the sparse matrix (*which efficiently stores the counts of words*) into a dense numpy array.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer instance with max_features set to 1500 number of rows
cv = CountVectorizer(max_features=1500)

# Fit the CountVectorizer to the corpus and transform it into a numerical array (bag of word)
X = cv.fit_transform(corpus).toarray()

# Display the resulting matrix
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 1 1 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [10]:
print("\nDisplay shape of matrix in row and columns : ", X.shape)
print("\nDisplay the size of matrix : ", X.size)

# Display the feature names (tokens)
print("\nDisplay the feature names (tokens) of matrix : \n", cv.get_feature_names_out())


Display shape of matrix in row and columns :  (31, 113)

Display the size of matrix :  3503

Display the feature names (tokens) of matrix : 
 ['achiev' 'alexand' 'also' 'among' 'anyon' 'area' 'assur' 'believ' 'brahm'
 'british' 'build' 'came' 'captur' 'career' 'close' 'come' 'confid'
 'conquer' 'consid' 'cultur' 'dept' 'develop' 'dhawan' 'done' 'dr' 'dutch'
 'econom' 'enforc' 'fall' 'father' 'fifti' 'first' 'fortun' 'four' 'free'
 'freedom' 'french' 'gdp' 'global' 'go' 'good' 'got' 'grab' 'great'
 'greek' 'growth' 'hand' 'histori' 'incorrect' 'independ' 'india' 'invad'
 'lack' 'land' 'level' 'life' 'loot' 'lucki' 'materi' 'mileston'
 'militari' 'mind' 'mogul' 'must' 'nation' 'nuclear' 'nurtur' 'one'
 'onward' 'opportun' 'other' 'peopl' 'percent' 'portugues' 'poverti'
 'power' 'prakash' 'professor' 'protect' 'rate' 'recognis' 'reliant'
 'respect' 'sarabhai' 'satish' 'second' 'see' 'self' 'space' 'stand'
 'start' 'strength' 'strong' 'succeed' 'term' 'third' 'three' 'time'
 'today' 'took

In [11]:
sentences = nltk.sent_tokenize(paragraph)
corpus = []

### **Lemmetizing** the entire paragraph

In [12]:
for i in range(len(sentences)):
  review = re.sub('[^a-zA-Z]', ' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [wnl.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [14]:
print(X.shape) # row x columns
print(X.size)

(31, 114)
3534
