# Bag of Words

The Bag of Words (BoW) model is a simple and commonly used model for representing text data. It is a way of extracting features from text for use in machine learning algorithms. In this model, text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The BoW model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

In this notebook, we will see how to implement the Bag of Words model in Python using the scikit-learn library.


In [1]:
import pandas as pd 

In [2]:
messages=pd.read_csv('spam_ham_dataset.csv')

In [3]:
messages.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [6]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [7]:
messages = messages.drop(['label_num', 'Unnamed: 0'], axis=1)

In [8]:
messages.head()

Unnamed: 0,label,text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,spam,"Subject: photoshop , windows , office . cheap ..."
4,ham,Subject: re : indian springs\r\nthis deal is t...


In [10]:
import re 
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\AL KHAIR
[nltk_data]     COMPUTER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wn = WordNetLemmatizer()

In [13]:
stop_words = set(stopwords.words('english'))

In [14]:
# Vectorized function to preprocess text
def preprocess_text(text):
    if isinstance(text, str):
        text = re.sub('[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
        text = text.lower()  # Lowercase
        words = text.split()  # Tokenize
        words = [wn.lemmatize(word) for word in words if word not in stop_words]  # Lemmatize and remove stopwords
        return ' '.join(words)  # Join back into a string
    return ""  # Handle invalid entries

# Apply to the 'text' column of messages
messages['cleaned_text'] = messages['text'].apply(preprocess_text)

In [18]:
messages['cleaned_text']

0       subject enron methanol meter follow note gave ...
1       subject hpl nom january see attached file hpln...
2       subject neon retreat ho ho ho around wonderful...
3       subject photoshop window office cheap main tre...
4       subject indian spring deal book teco pvr reven...
                              ...                        
5166    subject put ft transport volume decreased cont...
5167    subject following noms hpl take extra mmcf wee...
5168    subject calpine daily gas nomination julie men...
5169    subject industrial worksheet august activity a...
5170    subject important online banking alert dear va...
Name: cleaned_text, Length: 5171, dtype: object

### Creating the Bag of Words model

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=100, binary=True)

In [27]:
X = cv.fit_transform(messages['cleaned_text']).toarray()

In [28]:
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000, 
    formatter=dict(float=lambda x: "%.3g" % x))
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..., 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0

In [29]:
cv.vocabulary_

{'subject': 80,
 'enron': 26,
 'meter': 54,
 'flow': 30,
 'daren': 18,
 'please': 67,
 'daily': 17,
 'volume': 92,
 'gas': 35,
 'change': 9,
 'hpl': 40,
 'nom': 60,
 'see': 75,
 'attached': 2,
 'file': 28,
 'xl': 99,
 'time': 88,
 'know': 43,
 'go': 37,
 'week': 94,
 'like': 46,
 'following': 31,
 'need': 57,
 'get': 36,
 'let': 45,
 'first': 29,
 'would': 97,
 'could': 16,
 'www': 98,
 'com': 11,
 'also': 0,
 'one': 64,
 'email': 24,
 'back': 4,
 'available': 3,
 'make': 50,
 'deal': 20,
 'price': 69,
 'message': 53,
 'http': 41,
 'take': 82,
 'use': 91,
 'purchase': 71,
 'day': 19,
 'sale': 74,
 'today': 89,
 'mail': 49,
 'click': 10,
 'free': 34,
 'new': 58,
 'best': 5,
 'delivery': 21,
 'want': 93,
 'forwarded': 33,
 'texas': 83,
 'cc': 8,
 'mmbtu': 55,
 'nomination': 61,
 'month': 56,
 'april': 1,
 'next': 59,
 'service': 78,
 'company': 12,
 'system': 81,
 'call': 7,
 'information': 42,
 'north': 62,
 'number': 63,
 'help': 38,
 'forward': 32,
 'may': 52,
 'look': 48,
 'send': 76

## N-grams

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the number and "gram", for example, "four-gram", "five-gram", and so on.


In [30]:
## Create the Bag OF Words model with ngram
from sklearn.feature_extraction.text import CountVectorizer
## for Binary BOW enable binary=True
cv=CountVectorizer(max_features=100,binary=True,ngram_range=(2,3))
X=cv.fit_transform(messages['cleaned_text']).toarray()

In [31]:
cv.vocabulary_

{'subject enron': 87,
 'subject hpl': 90,
 'hpl nom': 58,
 'see attached': 83,
 'attached file': 3,
 'subject hpl nom': 91,
 'see attached file': 84,
 'let know': 62,
 'enron com': 35,
 'com cc': 9,
 'cc subject': 4,
 'enron com cc': 36,
 'com cc subject': 10,
 'enron hpl': 40,
 'hpl actuals': 57,
 'teco tap': 95,
 'tap enron': 93,
 'gas daily': 51,
 'subject enron hpl': 88,
 'enron hpl actuals': 41,
 'teco tap enron': 96,
 'tap enron hpl': 94,
 'tenaska iv': 97,
 'vance taylor': 98,
 'robert cotten': 81,
 'cotten hou': 13,
 'hou ect': 54,
 'ect ect': 22,
 'ect cc': 18,
 'julie meyers': 60,
 'meyers hou': 66,
 'smith hou': 85,
 'melissa graf': 64,
 'graf hou': 52,
 'ect subject': 32,
 'ect pm': 30,
 'ect robert': 31,
 'enron enron': 37,
 'gary hank': 50,
 'ect pat': 28,
 'pat clynes': 72,
 'clynes corp': 7,
 'corp enron': 11,
 'na enron': 67,
 'ect daren': 20,
 'daren farmer': 15,
 'farmer hou': 46,
 'robert cotten hou': 82,
 'cotten hou ect': 14,
 'hou ect ect': 55,
 'ect ect cc': 23,

In [32]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0