# Building a Spam Filter with Naive Bayes

In this project a spam filter is built based on the Naive Bayes algorithm. It will be used to classify SMS messages as spam or non-spam. The spam-filter is being trained with a dataset of 5572 SMS messages that are already classified by humans [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)
See wikipedia for background information [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

## Import dataset and analyze the data

In [1]:
import pandas as pd

data = pd.read_csv('SMSSpamCollection.csv', sep='\t', header=None, names=['Label', 'SMS']) 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB


In [2]:
# show the first five rows
data.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# how many spam vs non-spam messages are in the dataset?
data['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [4]:
percentage_spam = round(747/(4825+747)*100,1)
"In the dataset {} % of the messages is spam".format(percentage_spam)

'In the dataset 13.4 % of the messages is spam'

## Create a train and test set

In [5]:
# Randomize the entire dataset
random = data.sample(frac=1, random_state=1)

In [6]:
# Split the randomized dataset in a train and test set
# 80% is training that is 4458 records
train = random[:4458].reset_index(drop=True)
test = random[4458:].reset_index(drop=True)
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4458 entries, 0 to 4457
Data columns (total 2 columns):
Label    4458 non-null object
SMS      4458 non-null object
dtypes: object(2)
memory usage: 69.8+ KB


In [7]:
# number of ham and spam messages in the train set
train['Label'].value_counts()

ham     3858
spam     600
Name: Label, dtype: int64

In [8]:
percentage_spam_train = round(600/(3858+600)*100,1)
"In the training dataset {} % of the messages is spam".format(percentage_spam_train)

'In the training dataset 13.5 % of the messages is spam'

In [9]:
# number of ham and spam messages in the test set
test['Label'].value_counts()

ham     967
spam    147
Name: Label, dtype: int64

In [10]:
percentage_spam_test = round(147/(967+147)*100,1)
"In the test dataset {} % of the messages is spam".format(percentage_spam_test)

'In the test dataset 13.2 % of the messages is spam'

Percentages of spam messages in train en test set are in the same range as the original dataset.

In [11]:
train.head(10)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...
5,ham,Ok i thk i got it. Then u wan me 2 come now or...
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...
7,ham,No dear i was sleeping :-P
8,ham,Ok pa. Nothing problem:-)
9,ham,Ill be there on &lt;#&gt; ok.


## Clean the dataset

In [12]:
# remove punctuation
train['SMS'] = train['SMS'].str.replace('\W', ' ')
train.head(10)

Unnamed: 0,Label,SMS
0,ham,Yep by the pretty sculpture
1,ham,Yes princess Are you going to make me moan
2,ham,Welp apparently he retired
3,ham,Havent
4,ham,I forgot 2 ask ü all smth There s a card on ...
5,ham,Ok i thk i got it Then u wan me 2 come now or...
6,ham,I want kfc its Tuesday Only buy 2 meals ONLY ...
7,ham,No dear i was sleeping P
8,ham,Ok pa Nothing problem
9,ham,Ill be there on lt gt ok


In [13]:
# Transform every letter in every word to lowercase
train['SMS'] = train['SMS'].str.lower()
train.head(5)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [14]:
# Create a vocabulary for the messages in the training set.
# The vocabulary should be a Python list containing all the unique words across all messages, 
# where each word is represented as a string.

# Create a list of words
train['SMS'] = train['SMS'].str.split()
vocabulary = []

for item in train['SMS']:
    for word in item:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
vocabulary

['loads',
 'lies',
 '41782',
 'want',
 'werethe',
 '37819',
 '24',
 'min',
 'arabian',
 'ugadi',
 'came',
 'tke',
 'chit',
 'able',
 'lotsof',
 'invitation',
 'deserve',
 'active',
 'thirunelvali',
 'spoon',
 'behalf',
 'pretend',
 'rob',
 'unconvinced',
 'sextextuk',
 '〨ud',
 'raping',
 'opt',
 'file',
 'man',
 'pan',
 'suply',
 'invest',
 'student',
 'pick',
 'tor',
 'processed',
 'is',
 'lennon',
 'astoundingly',
 'ummifying',
 'body',
 '2morow',
 'laptop',
 'zyada',
 'church',
 'hype',
 'goodnight',
 'intro',
 'netcollex',
 'limping',
 'pleasure',
 'shipping',
 'permission',
 'general',
 'mileage',
 'aldrine',
 'enough',
 'plane',
 'ppm',
 'frosty',
 'snow',
 'outside',
 'beerage',
 'jones',
 'thursday',
 'science',
 '3uz',
 'lower',
 'crowd',
 'defeat',
 'staring',
 '087104711148',
 'here',
 'bothering',
 'patrick',
 'watching',
 'woken',
 '2004',
 'nothin',
 'prey',
 'crab',
 'p',
 'ru',
 'inconsiderate',
 'excuses',
 'slo',
 '10',
 'possible',
 'karo',
 'auto',
 'othrwise',
 'di

In [15]:
# Create dictionary with number of words in every SMS
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [16]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,loads,lies,41782,want,werethe,37819,24,min,arabian,ugadi,...,velusamy,desires,crucial,nok,youre,guai,warming,orig,verified,situations
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
training_set_clean = pd.concat([train, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,loads,lies,41782,want,werethe,37819,24,min,...,velusamy,desires,crucial,nok,youre,guai,warming,orig,verified,situations
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Calculate the probabilities P(Spam) and P(Ham)

# P(Spam) = Label(Spam)/Total 

p_spam = len(training_set_clean[training_set_clean['Label']=='spam'])/len(training_set_clean)
p_spam

0.13458950201884254

In [19]:
# P(Ham) = Label(Ham)/Total

p_ham = len(training_set_clean[training_set_clean['Label']=='ham'])/len(training_set_clean)
p_ham

0.8654104979811574

In [20]:
# Calculate N_Spam; total number of words in all spam messages

n_spam = 0

for item in training_set_clean[training_set_clean['Label']=='spam']['SMS']:
    n_spam += len(item)
n_spam

15190

In [21]:
# Calculate N_Ham; total number of words in all ham messages

n_ham = 0

for item in training_set_clean[training_set_clean['Label']=='ham']['SMS']:
    n_ham += len(item)
n_ham

57237

In [22]:
# Calculate N_Vocabulary; Total number of unique words in all messages

n_vocabulary = len(vocabulary)
n_vocabulary

7783

In [23]:
# In the calculations Laplace smoothing with a value of 1 is used

alpha = 1

In [24]:
# Calculate the parameters P(wi|Spam) and P(wi|Ham)

p_wi_spam = {}
p_wi_ham = {}

for item in vocabulary:
    p_wi_spam[item] = 0
    p_wi_ham[item] = 0
    
spam = training_set_clean[training_set_clean['Label']=='spam']
ham = training_set_clean[training_set_clean['Label']=='ham']

for word in vocabulary:
    n_word_given_spam = spam[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam + alpha * n_vocabulary)
    p_wi_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + alpha * n_vocabulary)
    p_wi_ham[word] = p_word_given_ham
    
p_wi_spam    


{'loads': 8.705872110738693e-05,
 'lies': 4.3529360553693465e-05,
 '41782': 8.705872110738693e-05,
 'want': 0.00113176337439603,
 'werethe': 4.3529360553693465e-05,
 '37819': 8.705872110738693e-05,
 '24': 0.00021764680276846734,
 'min': 0.0016105863404866582,
 'arabian': 4.3529360553693465e-05,
 'ugadi': 4.3529360553693465e-05,
 'came': 8.705872110738693e-05,
 'tke': 4.3529360553693465e-05,
 'chit': 8.705872110738693e-05,
 'able': 4.3529360553693465e-05,
 'lotsof': 4.3529360553693465e-05,
 'invitation': 4.3529360553693465e-05,
 'deserve': 4.3529360553693465e-05,
 'active': 0.0001305880816610804,
 'thirunelvali': 4.3529360553693465e-05,
 'spoon': 4.3529360553693465e-05,
 'behalf': 4.3529360553693465e-05,
 'pretend': 4.3529360553693465e-05,
 'rob': 4.3529360553693465e-05,
 'unconvinced': 4.3529360553693465e-05,
 'sextextuk': 8.705872110738693e-05,
 '〨ud': 4.3529360553693465e-05,
 'raping': 4.3529360553693465e-05,
 'opt': 0.0006094110477517085,
 'file': 4.3529360553693465e-05,
 'man': 8.7

In [25]:
# Function to be used as spam filter

import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in p_wi_spam:
            p_spam_given_message *= p_wi_spam[word]
        if word in p_wi_ham:
            p_ham_given_message *= p_wi_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [26]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [27]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Test accuracy

In [28]:
# Test the accuracy of the algorithme

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in p_wi_spam:
            p_spam_given_message *= p_wi_spam[word]

        if word in p_wi_ham:
            p_ham_given_message *= p_wi_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [29]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [30]:
# Accuracy of the spam filter
correct = 0
total = len(test)

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


## Conclusion


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

Next Steps
In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

- Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Make the filtering process more complex by making the algorithm sensitive to letter case
- Get the project portfolio-ready by using a few tips from our style guide for data science projects.