## Building a Spam Filter with Naive Bayes
*The objective of this project is to create a spam filter.* <br><br>
To classify messages as spam or non-spam the computer must:
- 1) Learns how humans classify messages.
- 2) Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- 3) Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

## Part I - Exploring the Dataset

In [1]:
import pandas as pd
sms=pd.read_csv('SMSSpamCollection',sep='\t',header=None,
                names=['Label','SMS'])
r,c = sms.shape
print('number of rows {}, number of columns {}'.format(r,c))

n_spam=sum(sms['Label']=='spam')
print('percentage of messages spam: {:.2f}%'.format(n_spam/r*100))

n_ham=sum(sms['Label']=='ham') #ham = non-spam
print('percentage of messages ham: {:.2f}%'.format(n_ham/r*100))

sms.head()

number of rows 5572, number of columns 2
percentage of messages spam: 13.41%
percentage of messages ham: 86.59%


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Part II - Training and Test Set
- Typically before creating software (like a spam filter), it is wise to design the test prior to building the software so it is less bias
- In order to test the filter, we will split the dataset into two categories:
    - **training set** - which will be used to "train" the computer how to classify messages (in this case 80%)
    - **test set** - which will be used to test how good the spam filter is with classifying new messages (in this case 20%)
- Goal is to classify new messages with an accuracy greater than 80%

In [2]:
sms_rand=sms.sample(frac=1,random_state=1)
n_split=round(len(sms_rand)*0.8)

training=sms_rand[:n_split].reset_index(drop=True) #indexes first n_split rows,
#because it starts from 0, the row indexes will be from 0 to n_split-1. Reset index of parent df and replace with new one

test=sms_rand[n_split:].reset_index(drop=True) # indexes n_split to las row,
#starts from n_split to last row. Reset index of parent df and replace with new one

len(training)+len(test)-len(sms_rand) #count check

0

In [3]:
r_tr=training.shape[0]

n_spam_training=sum(training['Label']=='spam')
print('percentage of messages spam in training: {:.2f}%'
      .format(n_spam_training/r_tr*100))

n_ham_training=sum(training['Label']=='ham')
print('percentage of messages ham in training: {:.2f}%'
      .format(n_ham_training/r_tr*100))

percentage of messages spam in training: 13.46%
percentage of messages ham in training: 86.54%


In [4]:
r_te=test.shape[0]

n_spam_test=sum(test['Label']=='spam')
print('percentage of messages spam in test: {:.2f}%'
      .format(n_spam_test/r_te*100))

n_ham_test=sum(test['Label']=='ham')
print('percentage of messages ham in test: {:.2f}%'
      .format(n_ham_test/r_te*100))

percentage of messages spam in test: 13.20%
percentage of messages ham in test: 86.80%


percentages of spam and ham in training and test accurate to the nearst ones place compared to the parent dataset (sms)

## Part III - Letter Case and Punctuation

In [5]:
training.head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


In [6]:
#Remove all punctuation from SMS column using regex '\W'
training['SMS']=training['SMS'].str.replace('\W',' ')
training['SMS']=training['SMS'].str.lower()
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Part IV - Creating the Vocabulary

In [7]:
training['SMS']=training['SMS'].str.split()
vocabulary=[]
for each_list in training['SMS']:
    for each_word in each_list:
        vocabulary.append(each_word)
vocabulary=set(vocabulary)
vocabulary=list(vocabulary)
print(vocabulary[:5])
len(vocabulary)

['m', 'season', 'greatly', 'offc', 'come']


7783

## Part V - The Final Training Set

In [8]:
word_counts_per_sms = {unique_word: [0]*len(training) 
                       for unique_word in vocabulary}

#check the keys of the dictionary
for key in list(word_counts_per_sms)[:5]:
    print(key,len(word_counts_per_sms[key]))

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index]+=1
        
word_counts_pd=pd.DataFrame(word_counts_per_sms)

df_training_and_word_counts=pd.concat([training,word_counts_pd], axis=1)

df_training_and_word_counts.head()

signin 4458
responsibility 4458
growing 4458
m 4458
pig 4458


Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Part VI - Calculating Constants First

In [9]:
p_h_s = training['Label'].value_counts(normalize=True)
p_ham = p_h_s[0]
p_spam = p_h_s[1]
print(p_h_s,p_ham,p_spam)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64 0.8654104979811574 0.13458950201884254


In [10]:
t_spam = df_training_and_word_counts[df_training_and_word_counts['Label']=='spam']
t_ham = df_training_and_word_counts[df_training_and_word_counts['Label']=='ham']
n_s = t_spam['SMS'].apply(len).sum()
n_h = t_ham['SMS'].apply(len).sum()
n_v=len(vocabulary)
print(n_s,n_h,n_v)
alpha=1

15190 57237 7783


In [11]:
#TRIED TO DO IT LIKE THIS, BUT KERNEL KEPT DYING
# t_spam = df_training_and_word_counts[df_training_and_word_counts['Label']=='spam']
# t_ham = df_training_and_word_counts[df_training_and_word_counts['Label']=='ham']
# n_s = t_spam.iloc[:,2:].sum().sum()
# n_h = t_ham.iloc[:,2:].sum().sum()
# n_v=len(vocabulary)
# print(n_s,n_h,n_v)

## Part VII - Calculating Parameters
**parameters** - the values that P($w_{i}$|Spam) and P($w_{i}$|Ham) take

In [24]:
s = {v_word: 0 for v_word in vocabulary}
h = {v_word: 0 for v_word in vocabulary}

In [31]:
for v_word in vocabulary:
    s[v_word]=(t_spam[v_word].sum(axis=0)+alpha)/(n_s+alpha*n_v)
    h[v_word]=(t_ham[v_word].sum(axis=0)+alpha)/(n_h+alpha*n_v)
    
for key in list(s)[:3]:
    print(s[key],h[key])

4.3529360553693465e-05 3.075976622577668e-05
4.3529360553693465e-05 4.6139649338665025e-05
0.0001305880816610804 1.537988311288834e-05


## Part - Classifying A New Message

In [41]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    #Initializing values for spam and ham given the message as p of spam or ham in general
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for each_word in message:
        if each_word in s:
            p_spam_given_message*=s[each_word]
        if each_word in h:
            p_ham_given_message*=h[each_word]
     
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [42]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')
classify("Sounds good, Tom, then see u there")

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Part IX - Measuring the Spam Filter's Accuracy
**accuracy** - measured as number of correctly classified messages by the total number of classified messages

In [44]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for each_word in message:
        if each_word in s:
            p_spam_given_message*=s[each_word]
        if each_word in h:
            p_ham_given_message*=h[each_word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [45]:
test['predicted']=test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [50]:
correct=0
total=len(test)
for index, row in test.iterrows():
    if row['Label']==row['predicted']:
        correct+=1
accuracy=correct/total
print('accuracy: '+str(round(accuracy*100,2))+'%')

accuracy: 98.74%


## Part X - Next Steps
- Exceeded accuracy goal of 80% by shy of 19%
- In order to improve the classification process, one could take a look at the messages that were incorrectly classified and speculate why it was identified incorrectly
- Add a level of complexity by making the algorithm sensitive to letter case