# Filtering spam messages with Naive Bayes theorem

#### The goal of this project was to train a model based on the Bayes probabity theorem to detect if an email is a SPAM or not.
- Preparing the data
- Extracing the features

## 1 - Load the messages and the labels from the txt file

In [2]:
import pandas as pd
import csv

# Function which returns a dataframe from our text file
def create_dataframe_from_file(filename):
    df = pd.DataFrame(columns=['Status','Message'])
    with open('messages.txt') as file:
        for line in file:
            Id,rest_of_string = line.split("\t")
            rest_of_string,go_back_line = rest_of_string.split("\n")
            df = df.append({'Status' : Id , 'Message' : rest_of_string} , ignore_index=True)
    return df

def create_dataframe_from_file(filename):
    df = pd.read_csv(filename, sep = '\t',names=['Status','Message'],encoding='utf-8',quoting=csv.QUOTE_NONE);
    return df

messages_df = create_dataframe_from_file("messages.txt")
messages_df.head(10)

Unnamed: 0,Status,Message
0,ham,Yup i've finished c ü there...
1,ham,Remember to ask alex about his pizza
2,ham,No da..today also i forgot..
3,ham,Ola would get back to you maybe not today but ...
4,ham,Fwiw the reason I'm only around when it's time...
5,ham,"Hello, my boytoy! I made it home and my consta..."
6,ham,Congrats kano..whr s the treat maga?
7,ham,Who u talking about?
8,ham,Yup...
9,ham,Ok...


## 2 - Clean the dataset and get info on it 

In [3]:
# We replace our "Ham" and "Spam" labels by 0s and 1s
cleaned_messages_df = messages_df.replace({'Status': {'ham': 0, 'spam': 1}})

# We get the number of text messages and the spam/ham count in our df
df_size = cleaned_messages_df.shape[0]
num_spams = (cleaned_messages_df.Status == 1).sum()
num_hams = (cleaned_messages_df.Status == 0).sum()
print("Total # of messages :",df_size)
print("# of spams :",num_spams)
print("# of nnon spams :",num_hams)
print(cleaned_messages_df.head())

Total # of messages : 5000
# of spams : 672
# of nnon spams : 4328
   Status                                            Message
0       0                     Yup i've finished c ü there...
1       0               Remember to ask alex about his pizza
2       0                       No da..today also i forgot..
3       0  Ola would get back to you maybe not today but ...
4       0  Fwiw the reason I'm only around when it's time...


## 3 - Split the data in training and testing examples
- The Spam values are replaced by 1 and Ham values by 0
- Then we choose to arbitraly to split our data on a 70-30 basis for our training purposes

In [4]:
# We choose to use 70% of this for training purpose
train_size = round(df_size*0.70)
test_size = df_size - train_size

# We create the train and test datasets
df_train = cleaned_messages_df[:train_size]
df_test = cleaned_messages_df[train_size:]

print("There are {} training examples.".format(train_size))
print("There are {} testing examples.".format(test_size))
# df_labels_ToList = df_test_label['Status'].tolist()

There are 3500 training examples.
There are 1500 testing examples.


## 4 - Create a dictionnary  a dictionnary from the words in the training dataset

In [5]:
from collections import Counter

# Create dictionnary
def make_Dictionary(dataset):
    # List of all the words
    all_words = []
    # Loop through the whole dataset
    for index, row in dataset.iterrows():
        # Get each string
        full_line = row['Message']
        # Split it to obtain all the words 
        words = full_line.split()
        # Add them to our tab containing all the words
        all_words += words
    # We use the counter function to create a tab with all the different words and the number of times they occur
    dictionary = Counter(all_words)
    
    for item in list(dictionary): 
        if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    # Get the 3000 first most common words
    dictionary = dictionary.most_common(3000)
    return dictionary

training_dict = make_Dictionary(df_train)
print("# of words : ", len(training_dict))

# of words :  3000


## 5 - Extract features from both the training data and test data.

For each message, it means checking for every word if it appears in the dictionnary. Each message is going to be a 3000 array long.

In [6]:
import numpy as np

def extract_features(dataset,dictionary):
    features_matrix = np.zeros((len(dataset), 3000))
    docID = 0
    for index, row in dataset.iterrows():
        # Get each string
        full_line = row['Message']
        # Split it to obtain all the words 
        words = full_line.split()
        # Go through all the words in the message
        for word in words:
            wordID = 0
            # We are going to check if the word is in the dictionnary or not
            for i, d in enumerate(dictionary):
                # If it is, set the flag to 1
                if d[0] == word:
                    wordID = i
                    features_matrix[docID, wordID] = words.count(word)
        # Now work on the next feature
        docID = docID + 1
    return features_matrix

extract_feats_train = extract_features(df_train,training_dict)
extract_feats_test = extract_features(df_test,training_dict)
print("The shape of the feature dataframe for training is",extract_feats_train.shape)
print("The shape of the feature dataframe for testing is",extract_feats_test.shape)

The shape of the feature dataframe for training is (3500, 3000)
The shape of the feature dataframe for testing is (1500, 3000)


## 6 - Use the Naive Bayes theorem to build and fit a model from the training data

We are trying to evaluate the probability of an email being a Spam or Ham depending on the words in the text :
![title](formulas.png)

In [7]:
import numpy as np
from collections import Counter
import re
import math

def build_model(train_features,targets,dictionnary):
    df_model = pd.DataFrame(columns=['Index','Word','N_S','N_H','N_H/N_tot_H','N_S/N_tot_S','Total_ham','Total_spam'])
    df_model['Word'] = [row[0] for row in dictionnary]
    df_model['N_S'] = 0
    df_model['N_H'] = 0
    df_model['N_H/N_tot_H'] = 0
    df_model['N_S/N_tot_S'] = 0
    df_model['Total_ham'] = 0
    df_model['Total_spam'] = 0
    # Fill the indexes with the 3000 indexes of the dictionnary
    for i in range(0,len(dictionnary),1):
        df_model.iloc[i,0] = i    
    # Go through the features and see for each word if they correspond to a spam or ham message    
    for vector_index, vector in enumerate(train_features): # Go through 3500 vectors
        # Now go through the feature vector to check if this word is in the message
        for word_index, zero_or_one in enumerate(vector): # Go through the 3000 ones and zeros
            # If it is a spam and the word is in the message
            if zero_or_one >= 1 and targets[vector_index] >= 1:
                df_model.iloc[word_index,2] = df_model.iloc[word_index,2] + 1
            # If it is a ham and the word is in the message
            if zero_or_one >= 1 and targets[vector_index] == 0:
                df_model.iloc[word_index,3] = df_model.iloc[word_index,3] + 1
                       
    total_spam = np.count_nonzero(targets)
    total_ham = len(targets) - total_spam

    log_prob_spam = np.log(total_spam/(total_ham+total_spam))
    log_prob_ham = np.log(total_ham/(total_ham+total_spam))
    proba_spam = total_spam/(total_ham+total_spam)
    proba_ham = total_ham/(total_ham+total_spam)
    #print(proba_spam)
    #print(proba_ham)
    
    for i in range(0,len(dictionnary),1):
        # The columns will contain respectively P(word|spam) and P(word|ham)
        df_model.iloc[i,4] = (df_model.iloc[i,3]+1) / (total_ham+2)
        df_model.iloc[i,5] = (df_model.iloc[i,2]+1) / (total_spam+2)
        # The columns will contain respectively P(word|spam)*P(spam) and P(word|ham)*P(ham)
        df_model.iloc[i,6] = (df_model.iloc[i,4])*(proba_ham)
        df_model.iloc[i,7] = (df_model.iloc[i,5])*(proba_spam)
    return df_model
                            
model = build_model(extract_feats_train,df_train['Status'],training_dict)
print(model)

      Index            Word  N_S  N_H  N_H/N_tot_H  N_S/N_tot_S  Total_ham  \
0         0              to  271  742     0.244730     0.581197   0.212146   
1         1             you  103  686     0.226285     0.222222   0.196156   
2         2             the   96  513     0.169302     0.207265   0.146760   
3         3             and   60  361     0.119236     0.130342   0.103360   
4         4              in   33  410     0.135375     0.072650   0.117351   
...     ...             ...  ...  ...          ...          ...        ...   
2995   2995             Lil    0    1     0.000659     0.002137   0.000571   
2996   2996         Thinkin    0    1     0.000659     0.002137   0.000571   
2997   2997         showers    0    1     0.000659     0.002137   0.000571   
2998   2998  possessiveness    0    1     0.000659     0.002137   0.000571   
2999   2999          poured    0    1     0.000659     0.002137   0.000571   

      Total_spam  
0       0.077382  
1       0.029587  
2     

## 6 - Use the model to make predictions for the test data.

In [8]:
def predict_log_proba(extract_feats,model):
    feature_prob_5 = model.loc[ : ,'Total_ham'].values
    feature_prob_6 = model.loc[ : ,'Total_spam'].values
    feature_prob_5 = feature_prob_5.reshape(1, 3000)
    feature_prob_6 = feature_prob_6.reshape(1, 3000)
    
    feature_prob_1 = model.loc[ : ,'N_H/N_tot_H'].values
    feature_prob_2 = model.loc[ : ,'N_S/N_tot_S'].values
    feature_prob_1 = feature_prob_1.reshape(1, 3000)
    feature_prob_2 = feature_prob_2.reshape(1, 3000)
    
    matrix = []
    tot = np.concatenate((feature_prob_1, feature_prob_2), axis=0)
    log_prob_per_classs = np.array(model.iloc[0,6], model.iloc[0,7])
    
    for x in extract_feats:
        # predict_vector = (np.log(tot) * x + log_prob_per_class ).sum(axis=1)    
        predict_vector = (np.log(tot) * x + np.log(1 - tot) * np.abs(x - 1)).sum(axis=1) + log_prob_per_classs
        matrix.append(predict_vector)
    return matrix

def predict(result):
    return np.argmax(result, axis=1)

def get_accuracy(predictions,labels):
    total = 0
    for i in range(0,len(predictions),1):
        if predictions[i] == labels[i]:
            total = total + 1
    return (total/len(predictions))*100

result = predict_log_proba(extract_feats_test, model)
final_predictions = predict(result)
df_labels_ToList = df_test['Status'].tolist()
print("Total accuracy = ", get_accuracy(final_predictions,df_labels_ToList))

Total accuracy =  97.13333333333334


## 7 - Compute metrics to evaluate the performance of the model on the test data

In [15]:
cpt_true_negative = 0
cpt_false_negative = 0
cpt_true_positive = 0
cpt_false_positive = 0

for i in range (0,len(final_predictions),1):
    if final_predictions[i] == df_labels_ToList[i] and final_predictions[i] == 0:
        cpt_true_negative = cpt_true_negative + 1
    if final_predictions[i] != df_labels_ToList[i] and final_predictions[i] == 0:
        cpt_false_negative = cpt_false_negative +1
    if final_predictions[i] == df_labels_ToList[i] and final_predictions[i] == 1:
        cpt_true_positive = cpt_true_positive + 1
    if final_predictions[i] != df_labels_ToList[i] and final_predictions[i] == 1:
        cpt_false_positive = cpt_false_positive + 1

def compute_accuracy(tp, tn, fn, fp):
    return ((tp + tn) * 100)/ float( tp + tn + fn + fp)

def compute_precision(tp, fp):
    return (tp  * 100)/ float( tp + fp)

def compute_recall(tp, fn):
    return (tp  * 100)/ float( tp + fn)

def compute_f1_score(tp, tn, fn, fp):
    # calculates the F1 score
    precision = compute_precision(tp, fp)/100
    recall = compute_recall(tp, fn)/100
    f1_score = (2*precision*recall)/ (precision + recall)
    return f1_score
        
accuracy = compute_accuracy(cpt_true_positive, cpt_true_negative, cpt_false_negative, cpt_false_positive)
precision = compute_precision(cpt_true_positive, cpt_false_positive)
recall = compute_recall(cpt_true_positive, cpt_false_negative)
f1 = compute_f1_score(cpt_true_positive, cpt_true_negative, cpt_false_negative, cpt_false_positive)
    
print("Accuracy :",accuracy)
print("Precision :",precision)
print("Recall :",recall)
print("F1-Score :",f1)

Accuracy : 97.13333333333334
Precision : 93.58288770053476
Recall : 84.95145631067962
F1-Score : 0.89058524173028
