# Building a Spam Filter
---
In this project, we are going to apply Bayes Theorem to a real world problem: building a spam filter. For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80%.

In [1]:
import pandas as pd

## Explore the dataset

In [2]:
sms = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label', 'SMS'])

In [3]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [5]:
sms.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
sms['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

## Building spam filter

### Split the dataset

In [7]:
# Randomizing the dataset
rand_sms = sms.sample(frac = 1, random_state = 1) # `frac` can be between 0-1 as fraction

In [8]:
# Split the data set to 80% training - 20% testing
import math
sms_train = rand_sms.iloc[:math.ceil(sms.shape[0]*.8)]
sms_test = rand_sms.iloc[math.ceil(sms.shape[0]*.8):]

In [9]:
sms_train.shape[0], sms_test.shape[0]

(4458, 1114)

In [10]:
sms_train.reset_index(inplace = True, drop = True)
sms_test.reset_index(inplace = True, drop = True)

In [11]:
sms_train['Label'].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [12]:
sms_test['Label'].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Confirm that both the training set and the test set maintained the ham ration from the original dataset. 

### Prepare the dataset for training

In [13]:
# Convert all sms to lowercase and strip non-words 
sms_train.SMS = sms_train.SMS.str.lower().str.replace(r'\W', ' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [14]:
# Split each sms to a list of words 
sms_train.SMS = sms_train.SMS.str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [15]:
# Create a list of all the unique vocabulary in the datset 
vocabulary = []
for row in sms_train.SMS:
    vocabulary += row

In [16]:
# Transform vocabulary into a set to get rid of duplicates then transform back to list
vocabulary = list(set(vocabulary))

In [17]:
len(vocabulary)

7783

In [18]:
# Create a dictionary with unique vocabulary as keys and counts of appearance in each sms as values in lists
word_counts_per_sms = {unique_word : [0] * len(sms_train) for unique_word in vocabulary}

# Loop through sms to count unique word appearance
# i = 0
# for row in sms_train.SMS:
#     for word in row:
#         word_counts_per_sms[word][i] += 1
#     i+=1

for index, sms in enumerate(sms_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [19]:
# Create a dataframe for unique word appearance in each sms
sms_train_word_count = pd.DataFrame(word_counts_per_sms)

In [None]:
sms_train_word_count.head()

Unnamed: 0,apparently,cold,gaytextbuddy,mesages,matric,tnc,concentrate,maintain,different,chgs,...,filthyguys,fresh,tampa,sickness,cross,upto,txtstop,bus,korean,surfing
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Concat the sms_train with sms_train_word_count
word_count = pd.concat([sms_train, sms_train_word_count])

### Training algorithm

The Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:
$$
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
$$$$
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
$$

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these equations:

$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
$$$$
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

1. P(Spam) and P(Ham)
2. NSpam, NHam, NVocabulary

> * NSpam is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
* NHam is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

In [None]:
word_count.head()