# Bag of Words

Bag of Words (BoW) is a text representation technique in Natural Language Processing. It is widely used to transform textual data into machine-readable format, specifically numerical values, without considering grammar and word order.

A Bag of Words is based on the occurrence of words in a document. The process starts with finding the vocabulary in the text and measuring their occurrence. It is called a bag because the order and structure of words are not considered, just their occurrence.



## Theory

Note: In any NLP pipeline, the first step is text preprocessing. Only after the text is preprocessed, the feature extraction technique of choice is applied.

The sentences we are going to consider are:

"I love NLP."
"NLP is amazing!"

After preprocessing, we will obtain the two sentences in the form:

"love nlp"
"nlp amazing"

1. Building the Vocabulary

    Combine all documents or sentences into a corpus.

    Identify all unique words in the corpus to create a vocabulary.
    Vocabulary: ["love", "nlp", "amazing"]

2. Representing Text as Vectors

    Each document or sentence is converted into a vector based on the vocabulary. There are two variants of the bag-of-words model:

    1. Word Presence: Binary encoding (1 if the word is present, 0 otherwise).
        Example:
       * "love nlp" → [1, 1, 0] 
       * "nlp amazing" → [0, 1, 1]

    2. Word Frequency: Count the occurrences of each word.
        Example:
       * "love nlp nlp" → [1, 2, 0]
       * "nlp amazing nlp" → [0, 2, 1]

## Using bag-of-words representation to build a spam email classifier

Prerequisites
1. Python 3.10 and above
2. Pandas
3. Spacy
4. Scikit Learn

In [2]:
# Importing libraries
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

In [3]:
# Load and inspect the first few samples of the dataset
mail_data = pd.read_csv("./mail_data.csv")
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Print the class distribution of the data
mail_data["Category"].value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

The Category column contains two labels, "spam" if the message is spam and "ham" if it is not.

We will convert the category column into binary labels where 0 and 1 will represent ham and spam respectively.

In [5]:
mail_data["Category"] = mail_data["Category"].apply(lambda x: 1 if x == "spam" else 0)
mail_data

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will ü b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


Now, the message column is preprocessed for further analysis.

In [6]:
def clean_text(text):
    """
    Cleans the input text using spaCy.
    - Removes stopwords
    - Removes punctuation
    - Converts words to lowercase
    - Converts words to their lemma
    - Keeps only alphabetic tokens
    """
    
    # Process the text
    doc = nlp(text)
    
    cleaned_tokens = []
    
    for token in doc:
        if not token.is_stop\
        and not token.is_punct\
        and token.is_alpha:
            token = token.lemma_ and token.lower_
            cleaned_tokens.append(token)
    
    # Join the tokens again to form a sentence(a single string)
    cleaned_text = " ".join(cleaned_tokens)
    return cleaned_text

# Apply it to the message column
mail_data["cleaned_message"] = mail_data["Message"].apply(lambda x: clean_text(x))

mail_data

Unnamed: 0,Category,Message,cleaned_message
0,0,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts tex...
3,0,U dun say so early hor... U c already then say...,u dun early hor u c
4,0,"Nah I don't think he goes to usf, he lives aro...",nah think goes usf lives
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,time tried contact u won pound prize claim eas...
5568,0,Will ü b going to esplanade fr home?,ü b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...",pity mood suggestions
5570,0,The guy did some bitching but I acted like i'd...,guy bitching acted like interested buying week...


In [7]:
# Drop columns we do not need
cleaned_data = mail_data[["cleaned_message", "Category"]]
cleaned_data

Unnamed: 0,cleaned_message,Category
0,jurong point crazy available bugis n great wor...,0
1,ok lar joking wif u oni,0
2,free entry wkly comp win fa cup final tkts tex...,1
3,u dun early hor u c,0
4,nah think goes usf lives,0
...,...,...
5567,time tried contact u won pound prize claim eas...,1
5568,ü b going esplanade fr home,0
5569,pity mood suggestions,0
5570,guy bitching acted like interested buying week...,0


In [8]:
# Split the cleaned data into train and test data
X_train, X_test, y_train, y_test = train_test_split(cleaned_data["cleaned_message"], cleaned_data["Category"], test_size=0.2, random_state=42)

# Print the shapes
print("X_train_shape: ", X_train.shape)
print("X_test_shape: ", X_test.shape)
print("y_train_shape: ", y_train.shape)
print("y_test_shape: ", y_test.shape)

X_train_shape:  (4457,)
X_test_shape:  (1115,)
y_train_shape:  (4457,)
y_test_shape:  (1115,)


In [11]:
# Data types
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [124]:
# Get bag-of-words representation of the text
vect = CountVectorizer()

# X_train and X_test are pandas Series objects,
# the .values property returns the values in an array without the indexes
X_train_cv = vect.fit_transform(X_train.values)
X_test_cv = vect.transform(X_test.values)

In [125]:
# vocabulary length
len(vect.vocabulary_)

6252

In [126]:
# Print the shape of the first sentence in the training data, it will be (1, vocabulary length)
print(X_train_cv[0].shape)

(1, 6252)


In [127]:
# Instantiate the model
mnb = MultinomialNB()

# Fit the training data to the model
mnb.fit(X_train_cv, y_train)

# Get predictions
y_pred = mnb.predict(X_test_cv)

# Generate classification report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       966
           1       0.93      0.93      0.93       149

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [None]:
# A compact way of doing vectorization and modelling is to use a Pipeline
clf = Pipeline([
    ("Vectorizer", CountVectorizer()),
    ("Naive Bayes", MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       966
           1       0.93      0.93      0.93       149

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



## Advantages

1. **Simple to implement and interpret**: The Bag of Words model is one of the most straightforward text representation techniques, making it ideal for beginners. Its simplicity allows for fast implementation without the need for complex preprocessing or specialized models.
   
2. **Easy to use for text classification tasks**: Bag of Words is well-suited for basic tasks like text classification, sentiment analysis, and spam detection. These tasks often don’t require sophisticated language models, so a BOW representation is sufficient and efficient.

## Disadvantages

1. **Produces sparse matrices that are computationally expensive**: Since each document is represented by the frequency of each word in a potentially large vocabulary, the resulting matrices are often mostly zeros, which can be inefficient to store and process in machine learning pipelines. Sparse matrices consume significant memory and often require specialized tools and libraries for efficient storage and computation, especially with large datasets.

2. **Loses meaning and context**: BOW disregards word order and sentence structure, which results in the loss of grammatical relationships and meaning. This limitation makes it less suitable for tasks where context, nuance, and word order matter, such as translation or sentiment detection in complex sentences.

3. **Out-of-Vocabulary (OOV) issues**: New or unseen words are not represented in the vocabulary.

## Sources
1. Youtube: Krish Naik, Codebasics
   
2. Datacamp: Derrick Mwiti
   
3. Spam Email Dataset: [Kaggle](<https://www.kaggle.com/datasets/abdmental01/email-spam-dedection>)