### Bag of Words (BoW)

It is a simple and widely used technique for obtaining a numerical representation of text data. BoW creates a "bag" containing words or tokens but does not consider the order of the words or grammatical structures. In other words, it focuses solely on the frequency of the words when representing the meaning of the text.

!["bow"](../images/2/2-bow.png)


---


In [19]:
from sklearn.feature_extraction.text import CountVectorizer

documents_example = [
    "Cats are very cute and dogs are very playful.",
    "Cats are not only cute but also clever.",
]

vectorizer = CountVectorizer()

# Text -> Vector / -> [Vectorizer]
X = vectorizer.fit_transform(documents_example)

# Display the BoW representation and the corresponding words
print("Feature Names (Words): \n", vectorizer.get_feature_names_out())
print("BoW Array: \n", X.toarray())

Feature Names (Words): 
 ['also' 'and' 'are' 'but' 'cats' 'clever' 'cute' 'dogs' 'not' 'only'
 'playful' 'very']
BoW Array: 
 [[0 1 2 0 1 0 1 1 0 0 1 2]
 [1 0 1 1 1 1 1 0 1 1 0 0]]


---


#### Real-Life Application of BoW Using the IMDB Dataset

- The dataset link &rarr; [IMDB_Dataset.csv](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)


In [20]:
import re
import string
from collections import Counter

import nltk
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

# Stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))


# Getting the data
df = pd.read_csv("../data/IMDB_Dataset.csv")
documents = df["review"]
labels = df["sentiment"]


# Text Cleaning
def clean_text(text):
    text = text.lower()  # Lowercase

    text = text.translate(str.maketrans("", "", string.punctuation))  # Punctuation

    text = re.sub(r"\d+", "", text)  # Numbers
    text = re.sub(r"[^\w\s]", "", text)  # Special characters

    text = " ".join(
        [word for word in text.split() if word not in stop_words and len(word) > 2]
    )  # Stopwords and [len(word) > 2 words]

    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
# Clean texts
cleaned_documents = [clean_text(doc) for doc in documents]

In [22]:
# BoW
vectorizer = CountVectorizer()

# Text -> Vector
X = vectorizer.fit_transform(cleaned_documents[:100])

# Word set
feature_names = vectorizer.get_feature_names_out()

# Vector Representation
print("Vector Representation:")
print(X.toarray()[:100])

Vector Representation:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [23]:
# DataFrame of Vector Representation
df_bow = pd.DataFrame(X.toarray(), columns=feature_names)
print(df_bow.head())

   abbot  abbreviated  abetted  abiding  ability  able  aboveaverage  abraham  \
0      0            0        0        0        0     0             0        0   
1      0            0        0        0        0     0             0        0   
2      0            0        0        0        0     0             0        0   
3      0            0        0        0        0     0             0        0   
4      0            0        0        0        0     0             0        0   

   abrahams  absolute  ...  zany  zellweger  zerog  zeus  zombie  zombiebr  \
0         0         0  ...     0          0      0     0       0         0   
1         0         0  ...     0          0      0     0       0         0   
2         0         0  ...     0          0      0     0       0         0   
3         0         0  ...     0          0      0     0       1         1   
4         0         0  ...     0          0      0     0       0         0   

   zone  zoo  zooms  zwick  
0     0    0   

In [24]:
# Frequency of Words
word_counts = X.sum(axis=0).A1
word_frequency = dict(zip(feature_names, word_counts))

In [25]:
# Common 5 word
most_common = Counter(word_frequency).most_common(5)
print(most_common)

[('movie', np.int64(169)), ('film', np.int64(127)), ('one', np.int64(100)), ('like', np.int64(80)), ('even', np.int64(58))]
