## Digging Deeper


In [22]:
from IPython.display import HTML
import random

def hide_toggle(for_next=False):
    this_cell = """$('div.cell.code_cell.rendered.selected')"""
    next_cell = this_cell + '.next()'

    toggle_text = 'Toggle show/hide'  # text shown on toggle link
    target_cell = this_cell  # target cell to control with toggle
    js_hide_current = ''  # bit of JS to permanently hide code in current cell (only when toggling next cell)

    if for_next:
        target_cell = next_cell
        toggle_text += ' next cell'
        js_hide_current = this_cell + '.find("div.input").hide();'

    js_f_name = 'code_toggle_{}'.format(str(random.randint(1,2**64)))

    html = """
        <script>
            function {f_name}() {{
                {cell_selector}.find('div.input').toggle();
            }}

            {js_hide_current}
        </script>

        <a href="javascript:{f_name}()">{toggle_text}</a>
    """.format(
        f_name=js_f_name,
        cell_selector=target_cell,
        js_hide_current=js_hide_current, 
        toggle_text=toggle_text
    )

    return HTML(html)
hide_toggle()

### Exploring the Dataset

Let's start by opening the SMSSpamCollection file with the read_csv() function from the pandas package. 

We're going to use:


sep='\t' because the data points are tab separated

header=None because the dataset doesn't have a header row

names=['Label', 'SMS'] to name the columns 

In [23]:
import pandas as pd

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t',
header=None, names=['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()
hide_toggle()

(5572, 2)


In [24]:
sms_spam['Label'].value_counts(normalize=True)
hide_toggle()

### Training and Test Set



We're now going to split our dataset into a training set and a test set. 

We'll use 80% of the data for training and the remaining 20% for testing.

We'll randomize the entire dataset before splitting to ensure that spam and ham messages are spread properly throughout the dataset.

In [30]:
# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Split into training and test sets
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)
hide_toggle()

(4458, 2)
(1114, 2)


In [32]:
training_set['Label'].value_counts(normalize=True)
test_set['Label'].value_counts(normalize=True)

hide_toggle()

### Data Cleaning, Letter Case and Punctuation

Let's begin the data cleaning process by removing the punctuation and making all the words lowercase.

In [33]:
# Before cleaning
training_set.head(3)

hide_toggle()

In [34]:
# After cleaning
training_set['SMS'] = training_set['SMS'].str.replace(
   '\W', ' ') 

# Removes punctuation
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head(3)

hide_toggle()

### Creating the Vocabulary

Let's now create the vocabulary, which in this context means a list with all the unique words in our training set. In the code below:

* We transform each message in the SMS column into a list by splitting the string at the space character — we're using the Series.str.split() method.
  
* We initiate an empty list named vocabulary.
* We iterate over the transformed SMS column.
* Using a nested loop, we iterate over each message in the SMS column and append each string (word) to the vocabulary list. 
* We transform the vocabulary list into a set using the set() function. This will remove the duplicates from the vocabulary list.
* We transform the vocabulary set back into a list using the list() function. 

In [35]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
   for word in sms:
      vocabulary.append(word)

vocabulary = list(set(vocabulary))

hide_toggle()

In [37]:
len(vocabulary)
hide_toggle()

### The Final Training Set

We're now going to use the vocabulary we just created to make the data transformation we want.

Eventually, we're going to create a new DataFrame. We'll first build a dictionary that we'll then convert to the DataFrame we need.

In [38]:
word_counts_per_sms = {'secret': [2,1,1],
                       'prize': [2,0,1],
                       'claim': [1,0,1],
                       'now': [1,0,1],
                       'coming': [0,1,0],
                       'to': [0,1,0],
                       'my': [0,1,0],
                       'party': [0,1,0],
                       'winner': [0,0,1]
                      }

word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

hide_toggle()

To create the dictionary we need for our training set, we can use the code below:

* We start by initializing a dictionary named word_counts_per_sms, where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of the training set, where each element in that list is a 0.

    * The code [0] * 5 outputs [0, 0, 0, 0, 0]. So the code [0] * len(training_set['SMS']) outputs a list of the length of training_set['SMS']. 
    
    
* We loop over training_set['SMS'] using the enumerate() function to get both the index and the SMS message (index and sms).
    * Using a nested loop, we loop over sms (where sms is a list of strings, where each string represents a word in a message).
        * We increment word_counts_per_sms[word][index] by 1. 


In [39]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
   for word in sms:
      word_counts_per_sms[word][index] += 1
        
hide_toggle()

In [47]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

hide_toggle()

In [42]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

hide_toggle()

### Calculating Constants First

Now that we're done with cleaning the training set, we can begin coding the spam filter.

We'll also use Laplace smoothing and set alpha=1.

In [43]:
# Isolating spam and ham messages first
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

hide_toggle()

In [44]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
   n_word_given_spam = spam_messages[word].sum() # spam_messages already defined
   p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
   parameters_spam[word] = p_word_given_spam

   n_word_given_ham = ham_messages[word].sum() # ham_messages already defined
   p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
   parameters_ham[word] = p_word_given_ham
    
hide_toggle()

### Classifying A New Message


Now that we have all our parameters calculated, we can start creating the spam filter. 

Let's start by writing a first version of this function. For the classify() function below, notice that:

* The input variable message needs to be a string.
* We perform a bit of data cleaning on the string message:
    * We remove the punctuation using the re.sub() function.
    * We bring all letters to lower case using the str.lower() method.
    * We split the string at the space character and transform it into a Python list using the str.split() method. 
* We calculate p_spam_given_message and p_ham_given_message.
* We compare p_spam_given_message with p_ham_given_message and then print a classification label. 

In [45]:
import re

def classify(message):
   '''
   message: a string
   '''

   message = re.sub('\W', ' ', message)
   message = message.lower().split()

   p_spam_given_message = p_spam
   p_ham_given_message = p_ham

   for word in message:
      if word in parameters_spam:
         p_spam_given_message *= parameters_spam[word]

      if word in parameters_ham: 
         p_ham_given_message *= parameters_ham[word]

   print('P(Spam|message):', p_spam_given_message)
   print('P(Ham|message):', p_ham_given_message)

   if p_ham_given_message > p_spam_given_message:
      print('Label: Ham')
   elif p_ham_given_message < p_spam_given_message:
      print('Label: Spam')
   else:
      print('Equal proabilities, have a human classify this!')
        
hide_toggle()

In [49]:


classify('WINNER!! This is the secret code to unlock the money: C3421.')



classify("Sounds good, Tom, then see u there")

hide_toggle()

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


### Measuring the Spam Filter's Accuracy

We'll start by writing a function that returns classification labels instead of printing them.

In [46]:
def classify_test_set(message):
   '''
   message: a string
   '''

   message = re.sub('\W', ' ', message)
   message = message.lower().split()

   p_spam_given_message = p_spam
   p_ham_given_message = p_ham

   for word in message:
      if word in parameters_spam:
         p_spam_given_message *= parameters_spam[word]

      if word in parameters_ham:
         p_ham_given_message *= parameters_ham[word]

   if p_ham_given_message > p_spam_given_message:
      return 'ham'
   elif p_spam_given_message > p_ham_given_message:
      return 'spam'
   else:
      return 'needs human classification'
    
hide_toggle()

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

In [28]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

hide_toggle()

We can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. 

In [29]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
   row = row[1]
   if row['Label'] == row['predicted']:
      correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

hide_toggle()

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833
