# Building a Spam Filter with Naive Bayes

##### In this project we will using Naive Bayes statistics to assign spam / not spam probabilities to various.

###### The basis of Naive Bayes is conditional probability and we will be applying this to a pre-labelled collection of SMS messages from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)

### Load In The Data

In [3]:
import pandas as pd
#sep='\t' denotes the datapoints are tab separated#
sms = pd.read_csv('SMSSpamCollection', sep ='\t', header=None, names=['Label', 'SMS'])
sms.sample(5)

Unnamed: 0,Label,SMS
453,ham,K:)k:)what are detail you want to transfer?acc...
2994,ham,So i'm doing a list of buyers.
2050,ham,How much is blackberry bold2 in nigeria.
2771,ham,No problem. Talk to you later
4572,ham,"CHA QUITEAMUZING THATSCOOL BABE,PROBPOP IN & ..."


In [4]:
sms.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
sms.isnull().sum()

Label    0
SMS      0
dtype: int64

###### The dataset is 'clean'. There are no missing values.

###### Column 0 denotes the spam/ not spam label. Column 1 denotes the actual message.

###### Ham in this context is the opposite of spam (i.e. A genuine SMS)

###### We have 5.6 thousand rows here so should be able to make an effective model.

In [6]:
print(sms['Label'].value_counts())
print("Ham =",100*4825/5572, "%")
print("Spam =",100*747/5572, "%")

ham     4825
spam     747
Name: Label, dtype: int64
Ham = 86.59368269921033 %
Spam = 13.406317300789663 %


###### 87% is not spam. 13% is spam.

### Considerations
###### There are some key considerations before we begin.

###### We will need:
###### A way of creating the model

> 
- Formatting the words into columns.
- Creating a list of all words.
- Assigning probabilities of the words
- Creating a list of all unique words
- Finding probabilites of spam / not spam

###### A way of testing the model
> 
- We can sub sample 80% of the data and use it to predict the remaining 20%.
- Then calculate the % of answers that were correct.

### Creating The Training and Test Datasets

###### Next task is to split the dataset into training and test datasets.

###### Here we want a balance of a large enough training set to give a good model and a sufficiently sized test database to give representative accuracy of the model

###### We will use an 80:20 (training:test) split for this purpose.


In [7]:
#use samples random_state function to randomise the database before splitting#
model_data = sms.sample(frac=1, random_state=1)
#Find the value of 80% of the dataset#
eighty_percent_length  = round(len(model_data) * 0.8)
#Set training as 0 to 80% of the dataset#
training  = model_data[:eighty_percent_length].reset_index(drop=True)
#Set test as 80% - 100% of the dataset#
test = model_data[eighty_percent_length:].reset_index(drop=True)

In [11]:
print("In the training set we have proportions of:","\n",training['Label'].value_counts(normalize = True))
print('\n')
print("In the test set we have proportions of:","\n",test['Label'].value_counts(normalize = True))

In the training set we have proportions of: 
 ham     0.86541
spam    0.13459
Name: Label, dtype: float64


In the test set we have proportions of: 
 ham     0.868043
spam    0.131957
Name: Label, dtype: float64


###### Our test and training data are very representive of the original data (and each other). This should reduce some bias in our model.

### Formatting the Training Data To Permit Modelling
###### We want our model to assign probabilites of spam/not spam based on the words present in the messages.

###### To do this first we should isolate each of the words in each SMS.

In [12]:
training.head(5)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [13]:
#'\W' will detect any character that is not a-z,A-Z or 0-9#
training['SMS'] = training['SMS'].str.replace('\W', ' ')
training['SMS'] = training['SMS'].str.lower()
training['SMS'] = training['SMS'].str.split()

In [14]:
training.head(5)

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


### Building The Vocabulary

###### Next we will need a list of the unique words in order to create our parameters.

###### Later we will calculate for each word a probability of spam or not spam.

In [16]:
vocabulary = []
for sms in training['SMS']:
    for word in sms:
        vocabulary.append(word)

#Set function removes all duplicates#
#List function returns this to a list allowing iteration#
vocabulary = list(set(vocabulary))

In [17]:
print(len(vocabulary))
print(vocabulary[:15])

7783
['essential', 'ecstasy', 'recorder', 'xxuk', 'surya', 'exactly', 'pax', 'gymnastics', 'split', 'shows', 'srt', 'frontierville', '2', 'pick', 'dream']


###### We have 7783 unique words (or something similar to words considering we may have abbreviated forms, the so called 'text speak')

In [20]:
#Create a dictionary where each key is a unique word from the vocabulary, filled with 0 to the length of the training set#
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

In [21]:
#For each word in the dictionary sum up the occurances of each word in each message#
for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [22]:
#Convert to a dataframe#
words_database = pd.DataFrame(word_counts_per_sms)

In [23]:
words_database.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


###### Lots of empty fields here, lets preview a common word and see how frequent it is.

In [29]:
words_database['the'].value_counts()

0     3622
1      661
2      133
3       29
4       10
10       1
5        1
8        1
Name: the, dtype: int64

###### So 'the' appears in ~800 messages and doesnt in ~3600 messages in the training set.

###### Could well be true as text speak is often dis- jointed or non grammatical sentences.

In [30]:
#Next step is to join this database with our original training database#
#This will attach the spam / not spam labels we need#
training_database = pd.concat([training, words_database], axis=1)

In [31]:
training_database.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


### Steps To Create The Spam Filter
###### Next step is to create the function that acts as a spam filter.

###### To do this we shall utelise the Naive Bayes algorithm.

###### This states:
> Probability of a message given it is spam =
- Is proportional to the Probability of Spam x (Probability of Word 1 given its Spam) x (Probability of Word 2 given its Spam) etc..

>Probability of Word 1 given its spam =
- Number that word 1 appears in spam messages + α / (number of spam words including duplicates + (α x number of words in the vocabulary)

> α is the Naive part of the algorithm, which =
- A factor to ensure where number of instances for a word is zero that it does not generate a zero probability in the final result. We shall set this to 1, known as Laplace smoothing.

###### More information on Laplace smoothing can be found [here](https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf)

In [32]:
#Separate spam and non spam messages#
spam_messages = training_database[training_database['Label'] == 'spam']
ham_messages = training_database[training_database['Label'] == 'ham']

#Probabilities of the message being spam or not spam#
p_spam = len(spam_messages) / (len(training_database))
p_ham = len(ham_messages) / (len(training_database))

#Number of words per each spam or not spam message, duplicates are not ignored#
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_words_per_ham_message = ham_messages['SMS'].apply(len)

#Total number of spam or not spam words, duplicates are not ignored#
n_spam = n_words_per_spam_message.sum()
n_ham = n_words_per_ham_message.sum()

#Total number of words, duplicates ignored#
n_vocabulary = len(vocabulary)

#Laplace smoothing factor#
alpha = 1

### Calculating The Parameters
###### This is the model part of the spam filter:

- We will need to create dictionarys of spam & ham
- Assign probabilities of each word in the spam and ham word lists
- We shall use alpha as the laplace smoothing to ensure we get no unwanted zero probabilities

In [33]:
#Create two dictionarys to hold the number of times a word is seen in spam and ham#
parameters_spam = {unique_word: 0 for unique_word in vocabulary}
parameters_ham = {unique_word: 0 for unique_word in vocabulary}

In [35]:
for word in vocabulary:
    #total occurences of the word in spam#
    n_word_given_spam = spam_messages[word].sum()
    #probability of the word in all spam messages#
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam +(alpha * n_vocabulary))
    parameters_spam[word] = p_word_given_spam
    
    #total occurences of the word in spam#    
    n_word_given_ham = ham_messages[word].sum()
    #probability of the word in all spam messages#
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham +(alpha * n_vocabulary))
    parameters_ham[word] = p_word_given_ham

### Creating A Classification Function
The next step is to create a function that can:
- Split up a message into words
- Multiply up the probabilty of the words in spam messages x The Probability of Spam (1)
- Multiply up the probabilty of the words in ham messages x The Probability of Ham (2)
- If (1) > (2) then the message is likely to be spam
- If (2) > (1) then the message is likely to not be spam (ie ham)
- If (1) == (2) then we should flag this needs human attention

In [37]:
#import Regex for message editing
import re

def classifier(message):
    #Split the message into its words#
    message = re.sub('\W',' ',message)
    message = message.lower().split()
    
    #Grab the parameters from previous#
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    #For each word multiply the probability of spam (or not spam)#
    # By each words probability#
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]    
    
    #The results give the probability of the message
    # as a whole being spam or not spam.
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    if p_ham_given_message > p_spam_given_message:
        return 'Ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'Spam'
    else:
        return 'Equal probabilites calculated. Suggest human classify this!'    

Now lets test the function with 'obvious' Spam and not spam messages.

In [38]:
classifier('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27


'Spam'

In [39]:
classifier("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21


'Ham'

###### This has worked as expected. It is interesting to not how small the probabilities are, but this is related to our 8000 words in the vocabulary

###### A real world spam filter will be far larger than this, and the probabilities far smaller.

### Classification Accuracy
###### Now we have built a function we can use it to predict the test data.

###### As the test data has labels we can also use this to show how many correct predictions there were and determine the % accuracy.

###### (We will update the model slightly to remove the print parts just to save scrolling through a mass of information)

In [40]:
def classifier(message):
    message = re.sub('\W',' ',message)
    message = message.lower().split()      
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham      
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'Equal probabilites calculated. Suggest human classify this!'    

In [41]:
test['predicted'] = test['SMS'].apply(classifier)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [42]:
total_rows = len(test)
correct_filter = test['Label'] == test['predicted']
correct = test[correct_filter]
correct_rows = len(correct)
print('Total correct rows in test dataset = {} out of {}'.format(correct_rows, total_rows))
print('Percent accuracy = {}%'.format(round((100* correct_rows/total_rows),2)))

Total correct rows in test dataset = 1100 out of 1114
Percent accuracy = 98.74%


###### ~99% accuracy means this model has exceptional predictive capability.

###### Lets preview the messages that didnt work for our model.

In [43]:
incorrect_filter = test['Label'] != test['predicted']
incorrect = test[incorrect_filter]
incorrect[['Label','predicted']]

Unnamed: 0,Label,predicted
114,spam,ham
135,spam,ham
152,ham,spam
159,ham,spam
284,ham,spam
293,ham,Equal probabilites calculated. Suggest human c...
302,ham,spam
319,ham,spam
504,spam,ham
546,spam,ham


In [44]:
incorrect['predicted'].value_counts()

ham                                                            8
spam                                                           5
Equal probabilites calculated. Suggest human classify this!    1
Name: predicted, dtype: int64

###### Majority of incorrect messages were classified as not spam when they were infact spam.

###### Lets look at the actual spam messages and see if there is a reason the model failed

In [46]:
spam_filter = incorrect['Label'] == 'spam'
incorrect_spam = incorrect[spam_filter]
for index, values in enumerate(incorrect_spam['SMS']):
    print(index, values)
    print('\n')

0 Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net


1 More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB


2 Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50


3 Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text


4 0A$NETWORKS allow companies to bill for SMS, so they are responsible for their "suppliers", just as a shop has to give a guarantee on what they sell. B. G.


5 RCT' THNQ Adrian for U text. Rgds Vatian


6 2/2 146tf150p


7 Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r




###### There are lots of things here that would give an indication that they are spam messages:
- References to money/bills
- Phone numbers

###### It could be that most of these messages are quite large and the other words are providing a greater probability from the non spam model data.

###### Or it could be that the words/numbers do not exist in the training model set.

###### We will now look at the one instance where the model assigned equal probabilities when the data was pre-labelled as not spam

In [47]:
unknown_spam_filter = incorrect['predicted'] == 'Equal probabilites calculated. Suggest human classify this!'
unknown_incorrect = incorrect[unknown_spam_filter]
unknown_incorrect['SMS'].values

array(['A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked \'hw cn u run so fast?\' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8'],
      dtype=object)

###### This was labelled as ham in the dataset. Reading through it gives the impression this is an advert for a boost energy drink?. 

###### I think I would classify this a spam. Perhaps there is not 100% accuracy in the labelling?

### Conclusion

###### Limitations of the model:
> As with all supervised learning the model is only as good as the data input:
- Do we have enough data to be meaningful?
- Are the labels 100% correct?
- Should the messages be case sensitive (our model was case insensitive)?
- Should symbols be included?

###### A successful spam filter can utelise Naive Bayes to form a highly predictive model. 

###### Modern spam filters allow for humans to declare messages as spam / ham in real time updating and improving the predictive capability.

###### However there are limitations, scammers are well aware of the wide spread application of spam filters and often will try to evade detection. Lots of times this will include misspelling or irregular characters(e.g. 'c@$h 4 uGetNow').

###### This is in effect an arms race, and bayes statistics is at the heart of it.