# Goal: 

In this project I want to Build a Spam Filter with Naive Bayes

# Objectives: 

First, I want to teach the computer how to classify messages

Second, Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.

Third, Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam.

dataset: 

To train the algorithm, I will use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. 

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

text_message_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

#Find how many rows and columns it has.

print(text_message_spam.shape)
text_message_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
#Find what percentage of the messages is spam and what percentage is ham ("ham" means non-spam).

round(text_message_spam['Label'].value_counts()/len(text_message_spam)*100)

ham     87.0
spam    13.0
Name: Label, dtype: float64

Now, we learned about the dataset, time to building the spam filter

Agile Project Management Principle: 
    First design the tests for evaluating the software, then start developing the software.

## Training and Test Set


To test the spam filter, I am dividing the dataset into two categories:
A training set & A test set


In [6]:
#randomizing the entire dataset
x = text_message_spam.sample(frac=1, replace=True, random_state=1)

print(x)

     Label                                                SMS
5157   ham                            K k:) sms chat with me.
235   spam  Text & meet someone sexy today. U can find a d...
3980   ham  CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
5192   ham  Oh oh... Den muz change plan liao... Go back h...
905    ham  We're all getting worried over here, derek and...
2763   ham  ARR birthday today:) i wish him to get more os...
2895   ham                   K...k...yesterday i was in cbe .
5056   ham  Am on a train back from northampton so i'm afr...
144    ham         I know you are. Can you pls open the back?
4225   ham  Double eviction this week - Spiral and Michael...
2797   ham  Tell your friends what you plan to do on Valen...
3462   ham  K.. I yan jiu liao... Sat we can go 4 bugis vi...
1202   ham                               I know she called me
5396   ham           As in i want custom officer discount oh.
5374   ham  Do u konw waht is rael FRIENDSHIP Im gving yuo...
4453   h

Split the randomized dataset into a training and a test set

In [7]:
training_size = round(text_message_spam.shape[0]*0.8)
print(training_size)

4458


In [8]:
test_size = text_message_spam.shape[0] - training_size
print(test_size)

1114


In [9]:
training_dataset = x.iloc[0:4458].reset_index(drop=True)
print(training_dataset)

     Label                                                SMS
0      ham                            K k:) sms chat with me.
1     spam  Text & meet someone sexy today. U can find a d...
2      ham  CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
3      ham  Oh oh... Den muz change plan liao... Go back h...
4      ham  We're all getting worried over here, derek and...
5      ham  ARR birthday today:) i wish him to get more os...
6      ham                   K...k...yesterday i was in cbe .
7      ham  Am on a train back from northampton so i'm afr...
8      ham         I know you are. Can you pls open the back?
9      ham  Double eviction this week - Spiral and Michael...
10     ham  Tell your friends what you plan to do on Valen...
11     ham  K.. I yan jiu liao... Sat we can go 4 bugis vi...
12     ham                               I know she called me
13     ham           As in i want custom officer discount oh.
14     ham  Do u konw waht is rael FRIENDSHIP Im gving yuo...
15     h

In [10]:
test_dataset = x.iloc[4458:5572].reset_index(drop=True)
print(test_dataset)

     Label                                                SMS
0      ham  Lol I know! They're so dramatic. Schools alrea...
1      ham                              Ü called dad oredi...
2      ham                  Oh you got many responsibilities.
3      ham                   I'll probably be around mu a lot
4      ham  U studying in sch or going home? Anyway i'll b...
5      ham  Where are you ? What do you do ? How can you s...
6     spam  our mobile number has won £5000, to claim call...
7      ham  Hurry home u big butt. Hang up on your last ca...
8      ham  Or better still can you catch her and let ask ...
9      ham  Hi Shanil,Rakhesh here.thanks,i have exchanged...
10     ham  Hey, I missed you tm of last night as my phone...
11     ham  Oh god i am happy to see your message after 3 ...
12    spam  You are a winner you have been specially selec...
13     ham  Sorry about that this is my mates phone and i ...
14     ham                          I'll see, but prolly yeah
15     h

In [11]:
training_dataset['Label'].value_counts(normalize=True)*100

ham     86.473755
spam    13.526245
Name: Label, dtype: float64

In [12]:
test_dataset['Label'].value_counts(normalize=True)*100

ham     85.278276
spam    14.721724
Name: Label, dtype: float64

In [13]:
training_dataset.sort_index()

Unnamed: 0,Label,SMS
0,ham,K k:) sms chat with me.
1,spam,Text & meet someone sexy today. U can find a d...
2,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
3,ham,Oh oh... Den muz change plan liao... Go back h...
4,ham,"We're all getting worried over here, derek and..."
5,ham,ARR birthday today:) i wish him to get more os...
6,ham,K...k...yesterday i was in cbe .
7,ham,Am on a train back from northampton so i'm afr...
8,ham,I know you are. Can you pls open the back?
9,ham,Double eviction this week - Spiral and Michael...


In [14]:
test_dataset.sort_index()

Unnamed: 0,Label,SMS
0,ham,Lol I know! They're so dramatic. Schools alrea...
1,ham,Ü called dad oredi...
2,ham,Oh you got many responsibilities.
3,ham,I'll probably be around mu a lot
4,ham,U studying in sch or going home? Anyway i'll b...
5,ham,Where are you ? What do you do ? How can you s...
6,spam,"our mobile number has won £5000, to claim call..."
7,ham,Hurry home u big butt. Hang up on your last ca...
8,ham,Or better still can you catch her and let ask ...
9,ham,"Hi Shanil,Rakhesh here.thanks,i have exchanged..."


## Data Cleaning



### Removing the punctuation and bringing all the words to lower case

The \W metacharacter is used to find a non-word character. A word character is a character from a-z, A-Z, 0-9, including the underscore character. This means that if you use [\W] and not [\W_]

In [15]:
training_dataset['SMS'] = training_dataset['SMS'].str.replace('\W', ' ')
training_dataset['SMS'] = training_dataset['SMS'].str.lower()
training_dataset.head()

Unnamed: 0,Label,SMS
0,ham,k k sms chat with me
1,spam,text meet someone sexy today u can find a d...
2,ham,ceri u rebel sweet dreamz me little buddy c...
3,ham,oh oh den muz change plan liao go back h...
4,ham,we re all getting worried over here derek and...


In [16]:
test_dataset['SMS'] = test_dataset['SMS'].str.replace('\W', ' ')
test_dataset['SMS'] = test_dataset['SMS'].str.lower()
test_dataset.head()

Unnamed: 0,Label,SMS
0,ham,lol i know they re so dramatic schools alrea...
1,ham,ü called dad oredi
2,ham,oh you got many responsibilities
3,ham,i ll probably be around mu a lot
4,ham,u studying in sch or going home anyway i ll b...


### Creating a vocabulary

In [17]:
training_dataset['SMS'] = training_dataset['SMS'].str.split()
vocabulary = []
for sms in training_dataset['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [18]:
len(vocabulary)

6466

In [19]:
word_counts_per_sms = {unique_word: [0] * len(training_dataset['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_dataset['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [20]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zoom,zyada,èn,é,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Concatenate the DataFrame we just built above with the DataFrame containing the training set

In [21]:
training_set_clean = pd.concat([training_dataset, word_counts], axis=1)
training_set_clean.tail()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0121,01223585236,01223585334,...,zhong,zindgi,zoe,zoom,zyada,èn,é,ü,〨ud,鈥
4453,ham,"[where, do, you, need, to, go, to, get, it]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4454,ham,"[sorry, my, roommates, took, forever, it, ok, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4455,spam,"[hi, this, is, amy, we, will, be, sending, you...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4456,spam,"[hi, 07734396839, ibh, customer, loyalty, offe...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4457,ham,"[what, is, your, account, number]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

In [23]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

In [24]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [37]:
classify('WINNER!! Lionel is the secret code to unlock the money: C3421.')

P(Spam|message): 7.415247154971831e-23
P(Ham|message): 4.957329987099494e-25
Label: Spam


In [38]:
classify("Sounds good, Tom, ok see u there")

P(Spam|message): 3.3867428577189158e-25
P(Ham|message): 4.212028168008679e-21
Label: Ham


In [39]:
def classify_test_set(message):    
     
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [40]:
test_dataset['predicted'] = test_dataset['SMS'].apply(classify_test_set)
test_dataset.head()

Unnamed: 0,Label,SMS,predicted
0,ham,lol i know they re so dramatic schools alrea...,ham
1,ham,ü called dad oredi,ham
2,ham,oh you got many responsibilities,ham
3,ham,i ll probably be around mu a lot,ham
4,ham,u studying in sch or going home anyway i ll b...,ham


In [41]:
correct = 0
total = test_dataset.shape[0]
    
for row in test_dataset.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1104
Incorrect: 10
Accuracy: 0.9910233393177738
