## Introduction

This project - as the name suggests - works to build a Naive Bayes spam classifier. 

The data used is from the UCI Machine Learning Repo. The spam filter will work for SMS messages (which are shorter and thus take less resource to classify than e-mails), but the principles are the same.

In [25]:
#Import packages
import pandas as pd
import numpy as np
import re 
import matplotlib.pyplot as plt
%matplotlib inline

In [26]:
df=pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["Label", "SMS"])
df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [27]:
df.Label.value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

This is a very low number of spam messages but if they are short and very redundant, then they might suffice.

As a first step, I shall split the data into test and training set.
I prefer to use sklearn's split but here I don't think there is any harm in using pandas's sampler

In [28]:

train=df.sample(frac=1, random_state=1).reset_index(drop=True).loc[:4457,]
test=df.sample(frac=1, random_state=1).reset_index(drop=True).loc[4458:,]

In [29]:
train.Label.value_counts()

ham     3858
spam     600
Name: Label, dtype: int64

In [30]:
test.Label.value_counts()

ham     967
spam    147
Name: Label, dtype: int64

The ratios are in the same ballpark so I will accept this division

In [31]:
test.reset_index(inplace=True, drop=True)

In [32]:
test.tail()

Unnamed: 0,Label,SMS
1109,ham,"We're all getting worried over here, derek and..."
1110,ham,Oh oh... Den muz change plan liao... Go back h...
1111,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
1112,spam,Text & meet someone sexy today. U can find a d...
1113,ham,K k:) sms chat with me.


## Method

As this project is supposed to be in the "probability and statistics" segment, no ML libraries are used currently. 

As a next step we should remove punctuation and capitalization. 
Here I shall mention that with "real" e-mails I would not remove these as they are pretty telling in the case of spam messages in my experience....

Anyway, I would use some numbers to represent those and "stack" with the standard prediction.

In [33]:
train.SMS=train.SMS.str.replace(r"\W", " ").str.lower()
test.SMS=test.SMS.str.replace(r"\W", " ").str.lower()

Now let's build a vocabulary (which is for now a list of words).

In [34]:
voca={}
message_count=len(train.SMS)
for msg in train.SMS.str.split():
    for word in msg:
        if word not in voca.keys():
            voca[word]=[0]*message_count
#With this I have deviated from the guided project a bit, it wanted to build a list, then 
#transform it to a set and back (to get rid of duplicates) and then use that as a key-list.
# This was not necessary

In [35]:
for idx, txt in train.SMS.iteritems():
    for word in txt.split():
        voca[word][idx]+=1

In [36]:
vocabulary=pd.DataFrame(voca)

Ok, this dataframe is gigantic, but hey...
I will just merge it into the original so I also have the label and sms column...

Ok, so the kernel dies if I try to concat these... so I shall just add the columns.

Anyway, from here I shall calculate basic values such as p(spam), p(ham) and the number of spam, ham, and voca letters.

In [37]:
vocabulary=vocabulary.assign(label=train.Label)
vocabulary.head()

Unnamed: 0,yep,by,the,pretty,sculpture,yes,princess,are,you,going,...,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre,label
0,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham
1,0,0,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,ham
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham


In [59]:

# Isolating spam and ham messages first
spam = vocabulary[vocabulary.label == 'spam']
ham = vocabulary[vocabulary.label == 'ham']
# probabilities of spam and ham
p_spam = len(spam) / len(vocabulary)
p_ham = len(ham) / len(vocabulary)

#Numbers of words in spam/ham
spam_word_count = spam.sum(axis=1)# the sum conviniently leaves the last column alone
n_spam = spam_word_count.sum()
ham_word_count = ham.sum(axis=1)
n_ham = ham_word_count.sum()
n_vocabulary=len(vocabulary.columns)-1#number of unique words

# Laplace smoothing - this is needed to deal with the zero probs. values
alpha = 1

## Pre-calculated values

The probability values of all the words shall be held in dictionaries for both spam and ham values. 

In [60]:
p_spam_dict=dict.fromkeys(spam.columns[:-1])
p_ham_dict=dict.fromkeys(ham.columns[:-1])# I know that the columns are the same, 
#but this is "more intuitive"

Now comes the calculation of each "probability" value:

$$P(w_i|\textrm{Spam})=\frac{N_{w_i | \textrm{Spam}}+ \alpha}{N_\textrm{Spam}+ \alpha \cdot N_\textrm{Vocabulary}} $$

$$P(w_i|\textrm{Ham})=\frac{N_{w_i | \textrm{Ham}}+ \alpha}{N_\textrm{Ham}+ \alpha \cdot N_\textrm{Vocabulary}} $$



In [61]:
for word in vocabulary.columns[:-1]:
    Nw_inspam=spam[word].sum()
    p_spam_dict[word]=(Nw_inspam+alpha)/(n_spam + alpha*n_vocabulary)
    #same for ham:
    Nw_inham=ham[word].sum()
    p_ham_dict[word]=(Nw_inham+alpha)/(n_ham + alpha*n_vocabulary)
    

This is done, now I have values associated with all words in the training set.

Start building the spam filter.
This filter is essentially a function that:

    -Takes an input message (and splits+cleans it...)
    -Calculates pspam and pham based on the words in it.
    -Afterwards it decides based on the vaues whether this is spam, ham, or it needs help.

In [72]:
def classify_sph(message, quiet=True):#If I test manually, I might want to see the values. If I classify in bulk, I don't
    #cean the message
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    #spam & ham values "on the ground"
    p_spam_on_message=p_spam
    p_ham_on_message=p_ham
    #evaluate the words
    for word in message:
        if word in p_spam_dict:#this is needed in case a new word is found
            p_spam_on_message *=p_spam_dict[word]
        if word in p_ham_dict:#this is needed in case a new word is found - also, the two dictkeys should be identical, so this 
            #can be written without the second 'if'
            p_ham_on_message *=p_ham_dict[word]
    #finally, print the p values and decide
    if not quiet:
        print("P(Spam): "+ str(p_spam_on_message))
        print("P(Ham): "+ str(p_ham_on_message))
        if p_spam_on_message>p_ham_on_message:
            print("Label: Spam")
        elif p_spam_on_message<p_ham_on_message:
            print("Label: Ham")
        else:
            print("I have no clue.")# this is VERY unlikely given that the ground values are already different. 
            #I would probably employ a "confidence interval" here. 
   # I also need a return
    if p_spam_on_message>p_ham_on_message:
            return "spam"
    elif p_spam_on_message<p_ham_on_message:
            return "ham"
    else:
            return np.nan

## Testing

Try it on two messages!

In [66]:
classify_sph("WINNER!! This is the secret code to unlock the money: C3421.", quiet=False)

P(Spam): 1.3481290211300841e-25
P(Ham): 1.9368049028589875e-27
Label: Spam


'Spam'

In [67]:
classify_sph("Sounds good, Tom, then see u there", quiet=False)

P(Spam): 2.4372375665888117e-25
P(Ham): 3.687530435009238e-21
Label: Ham


'Ham'

This seems to be working so far, let's try it on the test set!


In [73]:
test["predicted"]=test.SMS.agg(classify_sph)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


In [74]:
test["correct"]=(test.predicted==test.Label)
test.head()

Unnamed: 0,Label,SMS,predicted,correct
0,ham,later i guess i needa do mcat study too,ham,True
1,ham,but i haf enuff space got like 4 mb,ham,True
2,spam,had your mobile 10 mths update to latest oran...,spam,True
3,ham,all sounds good fingers makes it difficult ...,ham,True
4,ham,all done all handed in don t know if mega sh...,ham,True


In [75]:
test.correct.value_counts()

True     1100
False      14
Name: correct, dtype: int64

This is not a terrible accuracy from something this simple. 

Let's look at the messages that got classified incorrectly:

In [79]:
test[~test.correct].sort_values("Label")#Sorting to see ham & spam together

Unnamed: 0,Label,SMS,predicted,correct
152,ham,unlimited texts limited minutes,spam,False
159,ham,26th of july,spam,False
284,ham,nokia phone is lovly,spam,False
293,ham,a boy loved a gal he propsd bt she didnt mind...,,False
302,ham,no calls messages missed calls,spam,False
319,ham,we have sent jd for customer service cum accou...,spam,False
114,spam,not heard from u4 a while call me now am here...,ham,False
135,spam,more people are dogging in your area now call...,ham,False
504,spam,oh my god i ve found your number again i m s...,ham,False
546,spam,hi babe its chloe how r u i was smashed on s...,ham,False


## To sum up: 

We did indeed get a single NAN value- which was supposed to be ham, so that is not a mis-classification, but rather a no-classification. 

- We also got 5 Ham messages classified as spam, which might be a problem should this be used as a 'real' classifier.
- It also did let 8 spam messages through, which is less of a problem, as it does not mean that important messages can get lost. Looking at these texts I would hardly classify them as "messages", so I don't fault this simple classifier for not knowing the words in them. 



## Further improvement:

The project can be improved further to increase the accuracy by:

    -Increasing the word pool so that the classifier knows more words with better confidence
    - Introducing a "gray zone" in the middle for the values that are too close together, so the algorithm has a hard time to decide. These might need be classified as 'ham' since those are necessary to get - and added to the training sample after 'human classification'.
    - Introducing some metric to take the punctuation and capitalization into consideration.
    
    
Overall I've had fun coding this excersise, but now I shall move forward to the next projects as the "free week" is limited to... well, a week.
    