# Building a Spam Filter with Naive Bayes

Nobody likes recieving spam. These kind of messages are very annoying. Using spam filter in our e-mail is commonly using and it very usefull thing. However, we may also got spam on our moblie phone by unwanted text message. In this guided project, we will analysys the dataset of 5,572 SMS messages that are already classified by humans.  Our task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm.

The dataset was colected together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

## Exploring the Dataset

Let's start by reading in the dataset.


In [1]:
import pandas as pd

df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
#The data points are tab separated, so we'll need to use the sep='\t' parametee
#The dataset doesn't have a header row so we added header=None and then add column names: Label and SMS

In [2]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
print('Number of rows:')
print(df.shape[0])
print('\n')
print('Number of columns:')
print(df.shape[1])

Number of rows:
5572


Number of columns:
2


Our dataset contains only 2 columns, one describes the type of text message. There are only 2 types here:

* spam
* ham ("ham" means non-spam)


In [4]:
df['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

As we see above, non-spam messages is almost 87% of total messages.

## Training and Test Set

We're first going to split our dataset into two categories:

* A **training set**, which we'll use to "train" the computer how to classify messages.
* A **test set**, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

In [5]:
df_sample = df.sample(frac=1, random_state=1)
# Use the frac=1 parameter to randomize the entire dataset.
# Use the random_state=1 parameter to make sure your results are reproducible.

df_sample = df_sample.reset_index(drop=True) #reste index, used drop to not try to insert index into dataframe columns
df_sample.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


Our dataset has been randomizing, so now we are ready to split it on 2 datasets: tranining and test, like we mentioned above.

In [6]:
df_training = df_sample.iloc[:4458, :].copy()
df_test = df_sample.iloc[4458:, :].copy() #use copy to avoid SettingwithCopyWarning

print('df_training rows:')
print(df_training.shape[0])
print('\n')
print('df_test rows:')
print(df_test.shape[0])

df_training rows:
4458


df_test rows:
1114


In [7]:
print('df_training:')
print(df_training['Label'].value_counts(normalize=True) * 100)
print('\n')
print('df_test:')
print(df_test['Label'].value_counts(normalize=True) * 100)

df_training:
ham     86.54105
spam    13.45895
Name: Label, dtype: float64


df_test:
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


After randomizing and splitting dateset on training and test in propotion 80 : 20 %, we can see that the percentage of spam and ham in both the training and the test set are similar to what we have in the full dataset.


## Letter Case and Punctuation

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

In [8]:
df_training['SMS'] = df_training['SMS'].str.replace('\W',' ')
df_training['SMS'] = df_training['SMS'].str.lower()
df_training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the Vocabulary

First, we want to know all words in each SMS of our dataset. To do it, we'll split each row in a list and then we'll iterate over the `SMS column`. Finally, we add each word to created list `vocabulary` and the we'll convert it to set, because we want only have unique words.

In [9]:
df_training['SMS'] = df_training['SMS'].str.split()

vocabulary = []

for sms in df_training['SMS']:
    for word in sms:
        vocabulary.append(word)  
        
vocabulary = list(set(vocabulary))

In [10]:
len(vocabulary)

7783

##  The Final Training Set

In [11]:
word_counts_per_sms = {unique_word: [0] * len(df_training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(df_training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [12]:
word_counts_per_sms_df = pd.DataFrame(word_counts_per_sms)

In [13]:
df_training = pd.concat([df_training, word_counts_per_sms_df], axis=1)
df_training.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants First

In [14]:
#calculating  P(Spam) and P(Ham)
p_spam = (df_training['Label'] == 'spam').sum() / df_training.shape[0]
p_ham = (df_training['Label'] == 'ham').sum() / df_training.shape[0]

print('p_spam: ')
print(p_spam)
print('\n')
print('p_ham: ')
print(p_ham)

p_spam: 
0.13458950201884254


p_ham: 
0.8654104979811574


In [15]:
df_training.shape

(4458, 7785)

In [16]:
n_spam = df_training[df_training['Label'] == 'spam']['SMS'].apply(len)
n_spam = n_spam.sum()

n_ham = df_training[df_training['Label'] == 'ham']['SMS'].apply(len)
n_ham = n_ham.sum()

n_vocabulary = len(vocabulary)

alpha = 1

## Calculating Parameters

In [17]:
spam_dict = {unique_word : 0 for unique_word in vocabulary}
ham_dict = {unique_word : 0 for unique_word in vocabulary}

spam_df = df_training[df_training['Label'] == 'spam'].copy()
ham_df = df_training[df_training['Label'] == 'ham'].copy()

for word in vocabulary:
    n_word_spam = spam_df[word].sum()
    n_word_spam = (n_word_spam + alpha) / (n_spam + (alpha * n_vocabulary))
    spam_dict[word] = n_word_spam
    
    n_word_ham = ham_df[word].sum()
    n_word_ham = (n_word_ham + alpha) / (n_ham + (alpha * n_vocabulary))
    ham_dict[word] = n_word_ham

## Classifying A New Message

In [18]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()      
   

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
       

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [19]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [20]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Looking above at 2 example of messages, our algorithm looks promising. We are ready to check our test dataset!

## Measuring the Spam Filter's Accuracy

First off, we'll change the `classify()` function that we wrote previously to return the labels instead of printing them. Below, note that we now have return statements instead of print() functions.

We are going to create a new column: `predicted`, then we will ompare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric:


In [21]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [22]:
df_test['predicted'] = df_test['SMS'].apply(classify_test_set)
df_test.head()

Unnamed: 0,Label,SMS,predicted
4458,ham,Later i guess. I needa do mcat study too.,ham
4459,ham,But i haf enuff space got like 4 mb...,ham
4460,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4461,ham,All sounds good. Fingers . Makes it difficult ...,ham
4462,ham,"All done, all handed in. Don't know if mega sh...",ham


In [23]:
correct = 0
total = len(df_test)

for row in df_test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:        
        correct += 1
        
print(correct)       
print(total)

1100
1114


In [24]:
accuracy = correct/total * 100
print('Accuracy:', round(accuracy, 2), '%')

Accuracy: 98.74 %


In [25]:
incorrect_df = df_test[df_test['Label'] != df_test['predicted']]
incorrect_df

Unnamed: 0,Label,SMS,predicted
4572,spam,Not heard from U4 a while. Call me now am here...,ham
4593,spam,More people are dogging in your area now. Call...,ham
4610,ham,Unlimited texts. Limited minutes.,spam
4617,ham,26th OF JULY,spam
4742,ham,Nokia phone is lovly..,spam
4751,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
4760,ham,No calls..messages..missed calls,spam
4777,ham,We have sent JD for Customer Service cum Accou...,spam
4962,spam,Oh my god! I've found your number again! I'm s...,ham
5004,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


## Conclusions

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

We also noticed that 13 messages in our test dataset was predicted incorrect and one of the message needs human classification.