Building a Spam Filter with Naive Bayes
===

In this project, we are going to "teach" the computer how to classify 5572 SMS messages. We'll use the multinomial Naive Bayes algorithm to do that. The dataset is consist of 5572 SMS messages that are already classified by humans.

In [1]:
import pandas as pd
sms = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['Label','SMS'])

In [2]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [3]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Let's see what percentage of the messages is spam and what percentage is ham.

In [4]:
(sms["Label"].value_counts(normalize=True)*100).round(2)

ham     86.59
spam    13.41
Name: Label, dtype: float64

In this dataset, 13% of the messages are spam.

Trainin and Test Set
---

First we are going to have a training set and a test set. 
* A training set, which we'll use to "train" the computer how to classify messages.
* A test set, which we'll use to test how good the spam filter is with classifying new messages.
* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [5]:
random_sms = sms.sample(frac=1, random_state=1)
#randomizing the entire dataset to ensure that spam and ham messages are spread properly

In [6]:
training_test_index = round(len(random_sms) * 0.8)

In [7]:
training = random_sms[:training_test_index].reset_index(drop=True)
test = random_sms[training_test_index:].reset_index(drop=True)

In [8]:
training.shape

(4458, 2)

In [9]:
test.shape

(1114, 2)

In [10]:
training["Label"].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [11]:
test["Label"].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

We can see that for test and training datasets, we have very close percentages of ham and spam messages compared to our actual dataset.

Letter Case and Punctuation
---

First we need to do a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. We are going to replace SMS column by a series of unique word from the vocabulary.

In [12]:
#removing punctuations
training["SMS"] = training["SMS"].str.replace("\W"," ")

In [13]:
#transforming every word to lowercase
training["SMS"] = training["SMS"].str.lower()

In [14]:
training.head(10)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


Creating the Vocabulary
---

First we'll create a list with all of the unique words that occur in the messages of our training set to create the vocabulary.

In [15]:
training["SMS"] = training["SMS"].str.split()

In [16]:
vocabulary = []

In [17]:
for sms in training["SMS"]:
    for word in sms:
        vocabulary.append(word)

In [18]:
vocabulary = list(set(vocabulary))

In [19]:
len(vocabulary)

7783

It looks like we have 7783 unique words in vocabulary.

The Final Training Set
---

First, we will create a dictionary to get the word counts for each SMS.

In [20]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [21]:
word_counts = pd.DataFrame(word_counts_per_sms)

In [22]:
training_clean = pd.concat([training, word_counts], axis=1)

In [23]:
training_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Calculating Constants First
---

Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter.

In [24]:
spams = training_clean[training_clean["Label"] == "spam"]
hams = training_clean[training_clean["Label"] == "ham"]

In [25]:
#calculating spam and ham(non-spam) probabilities in the clean training set
p_spam = len(spams)/len(training_clean)
p_ham = len(hams)/len(training_clean)

In [26]:
#calculating the number of words per spam messages
n_spam = spams["SMS"].apply(len).sum()

In [27]:
#calculating the number of words per ham messages
n_ham = hams["SMS"].apply(len).sum()

In [28]:
#the number of words in vocabulary
n_vocab = len(vocabulary)

In [29]:
#Laplace smoothing
alpha = 1

Calculating Parameters
---

First, we'll initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0. We'll need one dictionary to store the parameters for P(wi|Spam), and the other for P(wi|Ham).

In [30]:
#initializing parameters
param_spam = {unique_word:0 for unique_word in vocabulary}
param_ham = {unique_word:0 for unique_word in vocabulary}

In [31]:
#calculating parameters
for word in vocabulary:
    n_word_given_spam = spams[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam+(alpha*n_vocab))
    param_spam[word] = p_word_given_spam
    
    n_word_given_ham = hams[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham+(alpha*n_vocab))
    param_ham[word] = p_word_given_ham

Classifying A New Message
---

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. Below, we have the function for filtering messages.

In [32]:
import re

def classify(message):
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]
            
        if word in param_ham:
            p_ham_given_message *= param_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

Let's test our function with the examples below.

In [33]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [34]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Measuring the Spam Filter's Accuracy
---

We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

In [35]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in param_spam:
            p_spam_given_message *= param_spam[word]

        if word in param_ham:
            p_ham_given_message *= param_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

We did a small modficaiton in our function so that it returns the labels. We will then use it in our test set and measure its accuracy.

In [36]:
test["predicted"] = test["SMS"].apply(classify_test_set)
test.head(10)

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
5,ham,But my family not responding for anything. Now...,ham
6,ham,U too...,ham
7,ham,Boo what time u get out? U were supposed to ta...,ham
8,ham,Genius what's up. How your brother. Pls send h...,ham
9,ham,I liked the new mobile,ham


In [37]:
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row["Label"] == row["predicted"]:
        correct += 1

In [38]:
accuracy = correct/total

In [39]:
print("Correct:", correct)
print("Incorrect:", total-correct)
print("Accuracy:", round(accuracy,4))

Correct: 1100
Incorrect: 14
Accuracy: 0.9874


The accuracy of our filter is 98.74% which is a very high. This means our filter function works almost 99% of the time properly.

Let's see the incorrectly classified messages.

In [42]:
incorrect= pd.DataFrame(columns=["Label","SMS","predicted"])

for index,row in test.iterrows():
    if row["Label"] != row["predicted"]:
        incorrect.loc[index] = row

In [50]:
pd.set_option('max_colwidth', 200)

In [51]:
incorrect

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,ham
135,spam,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,"A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d gi...",needs human classification
302,ham,No calls..messages..missed calls,spam
319,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us",spam
504,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50",ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text",ham


It looks like most of the spam messages are incorrectly classified as ham messages. Maybe we should improve our algorithm for even better filtering.