### Bag of Words (BoW)

It is a simple and widely used technique for obtaining a numerical representation of text data. BoW creates a "bag" containing words or tokens but does not consider the order of the words or grammatical structures. In other words, it focuses solely on the frequency of the words when representing the meaning of the text.


---


In [8]:
from sklearn.feature_extraction.text import CountVectorizer

documents_example = [
    "Cats are very cute and dogs are very playful.",
    "Cats are not only cute but also clever.",
]

vectorizer = CountVectorizer()

# Text -> Vector / -> [Vectorizer]
X = vectorizer.fit_transform(documents_example)

# Display the BoW representation and the corresponding words
print("Feature Names (Words): \n", vectorizer.get_feature_names_out())
print("BoW Array: \n", X.toarray())

Feature Names (Words): 
 ['also' 'and' 'are' 'but' 'cats' 'clever' 'cute' 'dogs' 'not' 'only'
 'playful' 'very']
BoW Array: 
 [[0 1 2 0 1 0 1 1 0 0 1 2]
 [1 0 1 1 1 1 1 0 1 1 0 0]]


---


#### Real-Life Application of BoW Using the IMDB Dataset

- The dataset link &rarr; [IMDB_Dataset.csv](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)


In [None]:
import re
import string
from collections import Counter

import nltk
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

# Stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))


# Getting the data
df = pd.read_csv("../data/IMDB_Dataset.csv")
documents = df["review"]
labels = df["sentiment"]


# Text Cleaning
def clean_text(text):
    text = text.lower()  # Lowercase

    text = text.translate(str.maketrans("", "", string.punctuation))  # Punctuation

    text = re.sub(r"\d+", "", text)  # Numbers
    text = re.sub(r"[^\w\s]", "", text)  # Special characters

    text = " ".join(
        [word for word in text.split() if word not in stop_words and len(word) > 2]
    )  # Stopwords and [len(word) > 2 words]

    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
# Clean texts
cleaned_documents = [clean_text(doc) for doc in documents]
print("\n".join(cleaned_documents[:2]))

one reviewers mentioned watching episode youll hooked right exactly happened mebr first thing struck brutality unflinching scenes violence set right word trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordbr called nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awaybr would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannered middle class inmates turned 

In [11]:
# BoW
vectorizer = CountVectorizer()

# Text -> Vector
X = vectorizer.fit_transform(cleaned_documents[:100])

# Word set
feature_names = vectorizer.get_feature_names_out()

# Vector Representation
print("Vector Representation:")
print(X.toarray()[:100])

Vector Representation:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [12]:
# DataFrame of Vector Representation
df_bow = pd.DataFrame(X.toarray(), columns=feature_names)
print(df_bow[:10])

   abbot  abbreviated  abetted  abiding  ability  able  aboveaverage  abraham  \
0      0            0        0        0        0     0             0        0   
1      0            0        0        0        0     0             0        0   
2      0            0        0        0        0     0             0        0   
3      0            0        0        0        0     0             0        0   
4      0            0        0        0        0     0             0        0   
5      0            0        0        0        0     0             0        0   
6      0            0        0        0        0     0             0        0   
7      0            0        0        0        0     0             0        0   
8      0            0        0        0        0     0             0        0   
9      0            0        0        0        0     0             0        0   

   abrahams  absolute  ...  zany  zellweger  zerog  zeus  zombie  zombiebr  \
0         0         0  ...    

In [18]:
# Frequency of Words
word_counts = X.sum(axis=0).A1
word_frequency = dict(zip(feature_names, word_counts))
print(word_frequency)

{'abbot': 1, 'abbreviated': 1, 'abetted': 2, 'abiding': 2, 'ability': 1, 'able': 2, 'aboveaverage': 1, 'abraham': 1, 'abrahams': 1, 'absolute': 1, 'absolutely': 7, 'absorb': 1, 'abstracted': 1, 'absurd': 2, 'abusive': 1, 'academy': 1, 'accent': 2, 'accented': 1, 'accents': 1, 'accentuate': 1, 'accepted': 3, 'accepting': 1, 'accepts': 1, 'accident': 2, 'accidentally': 1, 'accomplish': 1, 'accomplished': 1, 'accomplishes': 1, 'according': 2, 'account': 1, 'accountant': 1, 'accounts': 1, 'accurate': 3, 'accuses': 1, 'accustomed': 1, 'achieve': 1, 'achieved': 1, 'achievement': 1, 'achingly': 1, 'acolyte': 1, 'acquired': 1, 'across': 2, 'acroyd': 1, 'act': 7, 'acting': 27, 'action': 6, 'actions': 1, 'activate': 1, 'actor': 9, 'actors': 19, 'actress': 6, 'actresses': 2, 'acts': 1, 'actual': 4, 'actually': 16, 'ada': 4, 'adabr': 1, 'adams': 1, 'adapted': 1, 'adas': 2, 'add': 1, 'added': 2, 'addict': 1, 'addiction': 1, 'adding': 3, 'addition': 2, 'address': 1, 'adequate': 1, 'adjusterbr': 1, '

In [14]:
# Common 5 word
most_common = Counter(word_frequency).most_common(5)
print(most_common)

[('movie', 169), ('film', 127), ('one', 100), ('like', 80), ('even', 58)]
