## Building a Spam Filter

https://github.com/dataquestio/solutions/blob/master/Mission433Solutions.ipynb

* To get the formulas
* Learn more about Naive Bayes

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Data Exploration

In [5]:
sms_spam = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, 
                      names = ['Label', 'SMS'])

In [6]:
print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
sms_spam.Label.value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

About ~13% of the messages are spam

## Training and Test Set

We need to split the data into a training and a test set (80/20 split).

In [8]:
X = sms_spam.loc[:, 'SMS']
y = sms_spam.loc[:, 'Label']

In [9]:
# Randomize
training_set, test_set, training_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [10]:
training_set = pd.concat([training_set, training_y], axis = 1).reset_index().drop(columns='index')
test_set = pd.concat([test_set, test_y], axis = 1).reset_index().drop(columns='index')

In [11]:
training_set.head()

Unnamed: 0,SMS,Label
0,"Hi , where are you? We're at and they're not ...",ham
1,If you r @ home then come down within 5 min,ham
2,When're you guys getting back? G said you were...,ham
3,Tell my bad character which u Dnt lik in me. ...,ham
4,I'm leaving my house now...,ham


In [12]:
sms_spam.loc[sms_spam.SMS == "I'm leaving my house now..."]

Unnamed: 0,Label,SMS
157,ham,I'm leaving my house now...


In [13]:
training_set.shape

(4457, 2)

In [14]:
test_set.shape

(1115, 2)

In [15]:
training_set.Label.value_counts(normalize = True)

ham     0.86538
spam    0.13462
Name: Label, dtype: float64

In [16]:
test_set.Label.value_counts(normalize = True)

ham     0.868161
spam    0.131839
Name: Label, dtype: float64

## Data Cleaning

We will need to split the words out into individual columns for analysis

In [17]:
def clean_sms_col(df):
    df['sms_split'] = df['SMS'].str.replace('\W', ' ')
    df['sms_split'] = df['sms_split'].str.lower()
    df['sms_split'] = df['sms_split'].str.split()

In [18]:
clean_sms_col(training_set)
clean_sms_col(test_set)

In [19]:
training_set.head()

Unnamed: 0,SMS,Label,sms_split
0,"Hi , where are you? We're at and they're not ...",ham,"[hi, where, are, you, we, re, at, and, they, r..."
1,If you r @ home then come down within 5 min,ham,"[if, you, r, home, then, come, down, within, 5..."
2,When're you guys getting back? G said you were...,ham,"[when, re, you, guys, getting, back, g, said, ..."
3,Tell my bad character which u Dnt lik in me. ...,ham,"[tell, my, bad, character, which, u, dnt, lik,..."
4,I'm leaving my house now...,ham,"[i, m, leaving, my, house, now]"


### Create the Vocabulary

In [20]:
vocabulary = []
for sms in training_set['sms_split']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [21]:
len(vocabulary)

7753


It looks like there are 7,783 unique words in all the messages of our training set.

### Finalize Dataset

For each word, we will have 7783 columns of data, most of it will be filled with 0 (meaning the word did not show up in the SMS). 

In [22]:
def transform_sms_dataset(df):  
    word_counts_per_sms = {unique_word: [0] * len(training_set['sms_split']) for unique_word in vocabulary}

    for index, sms in enumerate(training_set['sms_split']):
        for word in sms:
            word_counts_per_sms[word][index] += 1
    
    word_counts = pd.DataFrame(word_counts_per_sms)
    return pd.concat([df, word_counts], axis = 1)
    

In [23]:
training_set_clean = transform_sms_dataset(training_set)

In [24]:
training_set_clean.head()

Unnamed: 0,SMS,Label,sms_split,0,00,000,008704050406,0121,01223585236,01223585334,...,zindgi,zoe,zoom,zouk,zyada,èn,é,ü,〨ud,鈥
0,"Hi , where are you? We're at and they're not ...",ham,"[hi, where, are, you, we, re, at, and, they, r...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,If you r @ home then come down within 5 min,ham,"[if, you, r, home, then, come, down, within, 5...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,When're you guys getting back? G said you were...,ham,"[when, re, you, guys, getting, back, g, said, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Tell my bad character which u Dnt lik in me. ...,ham,"[tell, my, bad, character, which, u, dnt, lik,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,I'm leaving my house now...,ham,"[i, m, leaving, my, house, now]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
test_set_clean = transform_sms_dataset(test_set)

In [26]:
test_set_clean.head()

Unnamed: 0,SMS,Label,sms_split,0,00,000,008704050406,0121,01223585236,01223585334,...,zindgi,zoe,zoom,zouk,zyada,èn,é,ü,〨ud,鈥
0,"Yep, by the pretty sculpture",ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Yes, princess. Are you going to make me moan?",ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Welp apparently he retired,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Havent.,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,I forgot 2 ask ü all smth.. There's a card on ...,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculate Consonants

We're now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to calculate these consonants


In [27]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['sms_split'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['sms_split'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

## Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters `$P(w_i|Spam)$` and `$P(w_i|Ham)$`. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.



In [28]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

## Classifying New Messages

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

* Takes in as input a new message (w1, w2, ..., wn).
* Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
* Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    * If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    * If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    * If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [29]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [30]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')


P(Spam|message): 1.2787826096383283e-25
P(Ham|message): 2.5863210303544332e-27
Label: Spam


## Measuring the Spam Filter's Accuracy

Test how well the filter does on our test set, which has 1,114 messages.

In [31]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [32]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,SMS,Label,sms_split,predicted
0,"Yep, by the pretty sculpture",ham,"[yep, by, the, pretty, sculpture]",ham
1,"Yes, princess. Are you going to make me moan?",ham,"[yes, princess, are, you, going, to, make, me,...",ham
2,Welp apparently he retired,ham,"[welp, apparently, he, retired]",ham
3,Havent.,ham,[havent],ham
4,I forgot 2 ask ü all smth.. There's a card on ...,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",ham


In [33]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1102
Incorrect: 13
Accuracy: 0.9883408071748879


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.