## Bag of Words Model from Scratch

This is a quick tutorial on how to write a "Bag of Words" model in python from scratch. The Bag of Words model is used widely in the field of Natural Language Processing.

For this tutorial, we'll be using the SMS Dataset from <b><a href="https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection">here</a></b> [SMS Spam Collection Data Set from UCI Repository].

In [1]:
# Import Dataset
from collections import Counter
from string import punctuation
import pandas as pd

Let's see how the dataset is organized.

In [2]:
# Load Dataset
data = pd.read_table('./dataset/SMSSpamCollection', sep='\t', header=None, names=['label', 'sms_message'])

data.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As we can see above, this dataset has two columns, "label" and "sms_message". The label column has values of two types i.e. either the sms is a "Spam" or a "ham". The sms_message column gives us the text message corresponding to the labels.

In the next step, let's check that if our dataset requires any data augmentation or dropping any data points due to unbalaced dataset.

In [3]:
# Check if data is balanced or not
data.describe()

Unnamed: 0,label,sms_message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


So, our dataset looks pretty balanced. The next step would be to converting Text data and labels to numbers for easy manipulation.

### Converting Labels to Numbers

In [4]:
data['label'] = data.label.map({'ham':0, 'spam':1})

data.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


What we did above is that we replaced all the labels with "ham" with a "0" and the ones with "spam " with a "1". Next, let's get all the text from the dataset and put each line into an array.

In [5]:
len(data['sms_message'])

5572

### Get All Lines from Text into an Array

In [6]:
word_lines = []

for i in range(len(data['sms_message'])):
    word_lines.append(data['sms_message'][i])

print(word_lines)



Next step would be to convert all the text to Lowercase Text so as to avoid the repetition of words due to any lower or uppercase text.

### Convert Text to Lowercase

In [7]:
lowercase_text = []

for text in word_lines:
    lowercase_text.append(text.lower())
    
print(lowercase_text)



### Remove StopWords and Punctuations

In [8]:
filtered_text = []

for words in lowercase_text:
    filtered_text.append(words.translate(str.maketrans("","",punctuation)))

print(filtered_text)



### Tokenization

In [9]:
tokenized_text = []

for words in filtered_text:
    tokenized_text.append(words.split(' '))

print(tokenized_text)



### Count Word Frequency

In [10]:
frequency_list = []

for words in tokenized_text:
    frequency_list.append(Counter(words))

print(frequency_list)



In [11]:
count_list = []

for words in tokenized_text:
    count_list.append(Counter(words).values())

print(count_list)

[dict_values([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), dict_values([1, 1, 1, 1, 1, 1]), dict_values([1, 1, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), dict_values([1, 1, 1, 2, 1, 1, 1, 1, 2]), dict_values([1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1]), dict_values([1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), dict_values([1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]), dict_values([1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 1]), dict_values([1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1]), dict_values([1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2]), dict_values([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), dict_values([1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), dict_values([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), dict_values([1, 2, 2, 1, 1, 1, 1, 1,

There you go. Now we have the word count for all the words in the text.