# Guided Project #13
## Building a Spam Filter with Naive Bayes

In this guided project, we're going to study the practical side of Naive Bayes algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

It can be also downloaded directly from [this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

### Let's start by reading in the dataset.

In [436]:
import pandas as pd
ds = pd.read_csv('SMSSpamCollection.csv', sep='\t', header=None)
ds.columns = ['label','text']
ds.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [437]:
ds.shape

(5572, 2)

In [438]:
# Spam percentage:
sum(ds.label=='spam')/ds.shape[0] * 100

13.406317300789663

About 13% are spam, and the remaining 87% of the messages are ham ("ham" means non-spam). 

### Splitting data to train and test sets
We're going to keep about 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

In [439]:
ds = ds.sample(frac=1, random_state=1)
ds.head()

Unnamed: 0,label,text
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [440]:
train_number = round(0.8 * ds.shape[0])
train_number

4458

In [441]:
test_number = ds.shape[0]-train_number
test_number

1114

In [442]:
train = ds.sample(n=train_number, random_state=1)
test = ds[~ds.index.isin(train.index)]

train = train.reset_index().drop(columns='index')
test = test.reset_index().drop(columns='index')

#### Let's check the percentage of spam and ham in both the training and the test set

In [443]:
# train spam recentage
sum(train.label=='spam')/len(train) * 100

13.324360699865412

In [444]:
# test spam recentage
sum(test.label=='spam')/len(test) * 100

13.734290843806104

#### ^ Looks good. The percentages similar to what we have in the full dataset?

### Data cleaning 
This is how our training data set looks now:

In [445]:
train.head(3)

Unnamed: 0,label,text
0,ham,Good night my dear.. Sleepwell&amp;Take care
1,ham,Sen told that he is going to join his uncle fi...
2,ham,Thank you baby! I cant wait to taste the real ...


To be able to easliy calculate words in text messages, we want to transform it to something like this:
<center><img src="GP13_Data_Cleaning_Goal.png" width="686"/></center>

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [446]:
train['text'] = train.text.str.replace('\W', ' ').str.lower()
train.head(3)

Unnamed: 0,label,text
0,ham,good night my dear sleepwell amp take care
1,ham,sen told that he is going to join his uncle fi...
2,ham,thank you baby i cant wait to taste the real ...


Let's create a list with all of the unique words that occur in the messages of our training set.

In [447]:
vocab = []
train['split'] = train['text'].str.split(' ')
for index, row in train.iterrows():
    vocab.extend(row['split'])
vocab = list(set(vocab))

In [448]:
len(vocab)

7713

In [449]:
vocab[:5]

['', 'language', 'surfing', 'coccooning', 'prestige']

Let's create *word_counts_per_sms* Dictionary and convert it to DataFrame

In [450]:
word_counts_per_sms = {unique_word: [0] * len(train['split']) for unique_word in vocab}

for index, sms in enumerate(train['split']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

# just peeking at what we've got
keys = list(word_counts_per_sms.keys())
for key in keys[:5]:
    print(key, ':', word_counts_per_sms[key][:5])

 : [2, 0, 4, 1, 2]
language : [0, 0, 0, 0, 0]
surfing : [0, 0, 0, 0, 0]
coccooning : [0, 0, 0, 0, 0]
prestige : [0, 0, 0, 0, 0]


In [451]:
word_counts_df = pd.DataFrame(word_counts_per_sms)
word_counts_df.head()

Unnamed: 0,Unnamed: 1,language,surfing,coccooning,prestige,taylor,refund,negative,sehwag,glands,...,pain,2stoptxt,3ss,molested,specialisation,087016248,popping,rose,swt,myspace
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now let's concatenate *word_counts_df* with *train* so that we also had label and text columns

In [452]:
train.head()

Unnamed: 0,label,text,split
0,ham,good night my dear sleepwell amp take care,"[good, night, my, dear, , , sleepwell, amp, ta..."
1,ham,sen told that he is going to join his uncle fi...,"[sen, told, that, he, is, going, to, join, his..."
2,ham,thank you baby i cant wait to taste the real ...,"[thank, you, baby, , i, cant, wait, to, taste,..."
3,ham,when can ü come out,"[when, can, ü, come, out, ]"
4,ham,no thank you you ve been wonderful,"[no, , thank, you, , you, ve, been, wonderful]"


In [453]:
train_full = pd.concat([train, word_counts_df], axis=1)
train_full[['label','text','message','good','you']].head()

Unnamed: 0,label,text,text.1,message,good,you
0,ham,good night my dear sleepwell amp take care,0,0,1,0
1,ham,sen told that he is going to join his uncle fi...,0,0,0,0
2,ham,thank you baby i cant wait to taste the real ...,0,0,0,1
3,ham,when can ü come out,0,0,0,0
4,ham,no thank you you ve been wonderful,0,0,0,2


In [454]:
train_full.shape 

(4458, 7716)

In [455]:
train.shape

(4458, 3)

Now we see we need to rename the original 'text' and 'split' columns as there are 'text' and 'split' words in the vocab. We can't user DataFrame.rename() finction as it will rename both occurances of 'text' and 'split'. So let's substitude column names by their indices

In [456]:
cols = list(train_full.columns)
cols[:5]

['label', 'text', 'split', '', 'language']

In [457]:
print(cols[1], cols[2])

text split


In [458]:
cols[1] = 'original_text'
cols[2] = 'split_text'
cols[:5]

['label', 'original_text', 'split_text', '', 'language']

In [459]:
train_full.columns = cols

In [460]:
train_full[['label','original_text','split_text','text','good','you']].head()

Unnamed: 0,label,original_text,split_text,text,good,you
0,ham,good night my dear sleepwell amp take care,"[good, night, my, dear, , , sleepwell, amp, ta...",0,1,0
1,ham,sen told that he is going to join his uncle fi...,"[sen, told, that, he, is, going, to, join, his...",0,0,0
2,ham,thank you baby i cant wait to taste the real ...,"[thank, you, baby, , i, cant, wait, to, taste,...",0,0,1
3,ham,when can ü come out,"[when, can, ü, come, out, ]",0,0,0
4,ham,no thank you you ve been wonderful,"[no, , thank, you, , you, ve, been, wonderful]",0,0,2


#### ^ Looks good now

### Calculating the probabilities
As per the Naive Bayes algorithm, we will need to know the probability values of the two equations below to be able to classify new messages:

<img align="center" width="565" src="GP13_Formulas_1.png"/>

To calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we will use these equations:

<img align="center" width="410" src="GP13_Formulas_2.png"/>

Some of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

- P(Spam) and P(Ham)
- Nspam, Nham, Nvocabulary

In [461]:
Pspam = len(train[train['label']=='spam'])/len(train)
Pspam

0.13324360699865412

In [462]:
Pham = len(train[train['label']=='ham'])/len(train)
Pham

0.8667563930013459

In [463]:
Nspam = train[train['label']=='spam']['split'].apply(len).sum()
Nspam

17879

In [464]:
Nham = train[train['label']=='ham']['split'].apply(len).sum()
Nham

70873

In [465]:
Nvoc = len(vocab)
Nvoc

7713

We'll also use Laplace smoothing and set α = 1

In [466]:
alpha = 1

Next we will calculate probability values P(wi|Spam) and P(wi|Ham) for every word wi in the training dataset. We will save them in *P_spam_dict* and *P_ham_dict* dictionaries, where the words will be keys and corresponding probabilities - values

In [482]:
P_spam_dict = {word: 0 for word in vocab}
P_ham_dict = {word: 0 for word in vocab}

spam_set = train_full[train_full['label']=='spam']
ham_set = train_full[train_full['label']=='ham']

for wi in vocab:
    Nwi_spam = sum(spam_set[wi])
    Nwi_ham = sum(ham_set[wi])
    P_spam_dict[wi] = (Nwi_spam + alpha)/(Nspam + alpha*Nvoc)
    P_ham_dict[wi] = (Nwi_ham + alpha)/(Nham + alpha*Nvoc)
    
#peeking into the result:
for word in vocab[:5]:
    print(word, '\t', P_spam_dict[word], '\t', P_ham_dict[word])

 	 0.10698655829946858 	 0.17476395286692287
language 	 3.907471084713973e-05 	 3.8174738503041256e-05
surfing 	 3.907471084713973e-05 	 7.634947700608251e-05
coccooning 	 3.907471084713973e-05 	 2.5449825668694168e-05
prestige 	 3.907471084713973e-05 	 2.5449825668694168e-05


#### ^ Looks good. First output string represents the "empty string" word

### classify() function
Now that we have all the required parameters, let's write a test version of classify() function

In [468]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    #initial values
    p_spam_given_message = Pspam 
    p_ham_given_message = Pham
    
    for wi in message:
        if wi in spam_dict:
            p_spam_given_message *= spam_dict[wi]
        if wi in ham_dict:
            p_ham_given_message *= ham_dict[wi]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

#### Testing classify() function:

In [469]:
test1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
classify(test1)

P(Spam|message): 4.6003314767982617e-26
P(Ham|message): 4.700196465926859e-28
Label: Spam


In [470]:
test2 = "Sounds good, Tom, then see u there"
classify(test2)

P(Spam|message): 4.883490791469427e-26
P(Ham|message): 1.1074088503460264e-21
Label: Ham


#### ^ Looks good!

### classify_test_set() function
Let's update the function to return the result instead of printing it, and apply it to test dataset

In [471]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = Pspam
    p_ham_given_message = Pham
 
    for wi in message:
        if wi in spam_dict:
            p_spam_given_message *= spam_dict[wi]
        if wi in ham_dict:
            p_ham_given_message *= ham_dict[wi]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [472]:
test.head()

Unnamed: 0,label,text
0,ham,Welp apparently he retired
1,ham,Dai what this da.. Can i send my resume to thi...
2,ham,I am late. I will be there at
3,spam,Congrats! 2 mobile 3G Videophones R yours. cal...
4,ham,Ooooooh I forgot to tell u I can get on yovill...


In [473]:
test['predicted'] = test['text'].apply(classify_test_set)
test.head()

Unnamed: 0,label,text,predicted
0,ham,Welp apparently he retired,ham
1,ham,Dai what this da.. Can i send my resume to thi...,ham
2,ham,I am late. I will be there at,ham
3,spam,Congrats! 2 mobile 3G Videophones R yours. cal...,spam
4,ham,Ooooooh I forgot to tell u I can get on yovill...,ham


### Measuring accuracy
Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:

<img src="GP13_Accuracy.png" align="center" width="485">

In [474]:
correct = 0
total = len(test)

for index, row in test.iterrows():
    if row['label'] == row['predicted']:
        correct += 1
        
accuracy = correct / total
print('Accuracy: ', accuracy)

Accuracy:  0.9865350089766607


#### ^ Looks great!

### Summary and To-Do

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.65% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

If you want to keep working on this project, here's a few next steps you can take:

- Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
- Make the filtering process more complex by making the algorithm sensitive to letter case.
- Get the project portfolio-ready by using a few tips from our [style guide for data science projects](https://www.dataquest.io/blog/data-science-project-style-guide/).
