In [1]:
! pip install nltk



# Bag of Words ( BOW )

The bag-of-words (BoW) is an essential technique to represent text data in a numerical format that machine learning algorithms can understand. We 
normally use this technique when we’ve cleaned the text data and need to use it for machine-learning model training. It allows us to treat text data as 
an unordered collection of words and disregard grammar, word order, and context. As a result, we find its application in scenarios where the context
or sequence of words is less important than the frequency of individual words.

Calculating BoW :

Let’s consider a simple BoW calculation for a given document. Suppose we have the following document A: “I love to eat cakes. Cakes are delicious.”
To perform a BoW calculation:

We first tokenize the document, which means splitting it into individual words: [“I”, “love”, “to”, “eat”, “cakes”, “Cakes”, “are”, “delicious”].

Next, we create a vector representation of the document where each element represents the count of a specific word in the document. We consider each
unique word in the document and count how many times it appears. BoW vector: [1, 1, 1, 1, 2, 1, 1, 1]. In this case, the BoW vector shows that the 
word “I” appears once, “love” appears once, “to” appears once, “eat” appears once, “cakes” appears twice, “are” appears once, and “delicious” appears 
once in the document. This BoW vector representation allows us to capture the word frequencies in the document, disregarding the order or structure 
of the text.

Steps for applying BOW for text preprocessing :

![image.png](attachment:da6cdf18-fe1d-48b7-a889-7e5924028795.png)



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [5]:
reviews = df['text']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)
feature_names = vectorizer.get_feature_names_out()
X_array = X.toarray()
bow_df = pd.DataFrame(X_array, columns=feature_names)
print(bow_df)

    about  active  address  advanced  after  analytics  and  appreciate  are  \
0       0       0        0         0      1          0    0           1    0   
1       0       0        0         0      0          0    1           0    0   
2       0       0        0         0      0          0    1           0    0   
3       0       0        0         0      0          0    0           0    0   
4       1       0        0         0      0          0    0           0    0   
5       0       0        0         0      0          1    0           0    0   
6       0       0        0         0      0          0    1           1    0   
7       0       0        0         1      0          0    1           0    0   
8       0       0        0         0      0          0    0           0    1   
9       0       0        0         0      0          0    0           0    0   
10      0       0        0         0      0          0    1           0    0   
11      0       0        1         0    

In [10]:
len(feature_names)

150

# Advantages and limitations

Advantages of BoW include:

Simplicity and efficiency: BoW is straightforward and computationally efficient, making it suitable for large text datasets.

Language agnostic: We can create BoW for various languages without requiring linguistic knowledge, making it versatile for multilingual tasks.

Versatility in applications: We use this technique for various NLP tasks like text classification, sentiment analysis, and information retrieval.

On the other hand, limitations of BoW include:

Loss of word order: BoW disregards word order and sentence structure, leading to a loss of crucial semantic information. For example, consider the 
phrases “hot coffee” vs. “coffee hot.” BoW would treat both phrases as identical, even though the word order plays a crucial role in distinguishing
between a beverage (hot coffee) and an adjective-noun phrase (coffee hot).

Semantic meaning: BoW can’t capture semantic relationships between words, which restricts its ability to understand context and meaning. For instance, 
BoW treats “big” and “large” as separate and unrelated words, disregarding their similar meanings and limiting the model’s ability to understand the
context in which they’re used interchangeably.

Equal weighting: All words are treated equally, regardless of their importance or rarity in the language, potentially leading to suboptimal results.
For example, in a medical document, certain specialized terms like “diagnosis,” “treatment,” or “symptoms” might hold pivotal information. However, in 
BoW, these terms receive no special treatment, and their significance might diminish when compared to more common words like “the” or “and.” 
One approach to addressing this issue involves utilizing the TF-IDF method for text representation, which considers the importance of words.

Generation of a large and sparse matrix: The size and sparsity of the generated matrix are a limitation due to the nature of BoW representation, where
each unique word in the corpus is typically converted into a feature/column in the matrix, resulting in a high-dimensional representation. We can use 
dimensionality reduction or sparse matrix representations to mitigate this challenge.

