# Bayes Theorem for Predicting the Probability of an Email Being Spam

S = Spam
w = Word

$P(Spam|w_{1}, w_{2},..., w_{n}) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_{i}|Spam)$

The probability that an email consisting of the words $w_{1}, w_{2},... w_{n}$ is proportional to the probability that any given email is spam multiplied by the product of each word's probability to appear in a spam email.



In [27]:
import pandas as pd
import math
import time

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

pd.options.mode.chained_assignment = None  # default='warn'

PREDICTION = 'Prediction'
CLASSIFICATION = 'Classiciation'

## Functions

In [2]:
def count_vocab(emails):
    total_words = 0
    
    for index, row in emails.iterrows():
        total_words += sum(row.values[1:-2])
            
    return total_words

In [3]:
def calculate_word_spamicity(w_spam_count, vocab, spam_vocab):
    alpha = 1
    
    spamicity = (w_spam_count + alpha) / (spam_vocab + alpha * vocab)
    return spamicity

In [4]:
def build_word_spamicity_dict(spam_emails, vocab, spam_vocab):
    spam_word_appearances = {}
    
    for (column_name, column_data) in spam_emails.iteritems():
        if column_name != 'Email No.' and column_name != PREDICTION and column_name != CLASSIFICATION:
            spam_word_appearances[column_name] = sum(column_data.values)
            
    for word in spam_word_appearances:
        spam_word_appearances[word] = calculate_word_spamicity(spam_word_appearances[word], vocab, spam_vocab)
            
    return spam_word_appearances

In [14]:
def calculate_email(email, word_spamicities, word_hamicities, spam_proportion, ham_proportion, testing_data):
    email_spamicity = math.log(spam_proportion)
    email_hamicity = math.log(ham_proportion)
    
    for column in testing_data.columns[1:-2]:
        if email[column] > 0:
            email_spamicity += math.log(word_spamicities[column])*email[column]
            email_hamicity += math.log(word_hamicities[column])*email[column]
            
    return 1 if email_spamicity >= email_hamicity else 0
            

In [19]:
def calculate_accuracy(testing_emails):
    number_correct = 0
    for index, email in testing_emails.iterrows():
        if email[PREDICTION] == email[CLASSIFICATION]:
            number_correct += 1
        
    return number_correct / testing_emails.shape[0] * 100

In [29]:
def run_model(data, include_stop_words=True):
    if include_stop_words:
        # Take out all stopwords
        for col in df.columns:
            if col in stop_words:
                data.drop(col, axis=1, inplace=True)

    total_num_emails = data.shape[0]
    print(f'Total # Emails:{total_num_emails}')

    # Subtract 2 for "Email No." and "Prediction" columns
    total_vocab = len(data.columns) - 2
    print(f'Total Vocab: {total_vocab}')

    partition_size = total_num_emails//5

    end = 0
    begin = 0
    score_total = 0
    time_total = 0


    for i in range(1,6):
        start_time = time.time()
        end += partition_size

        if i == 5:
            testing_data = data.iloc[begin:].copy()
        else:
            testing_data = data.iloc[begin:end].copy()

        # This is where the model's prediction will be stored
        testing_data[CLASSIFICATION] = ""

        if i == 1:
            training_data = df.iloc[end:]
        elif i == 5:
            training_data = df.iloc[:begin]
        else:
            training_data_sections = []
            training_data_sections.append(data.iloc[:begin])
            training_data_sections.append(data.iloc[end:])
            training_data = pd.concat(training_data_sections)

        begin += partition_size
        print(f'\nBegin: {testing_data.at[testing_data.index[0],"Email No."]}')
        print(f'End: {testing_data.at[testing_data.index[-1],"Email No."]}')

        spam_proportion = training_data['Prediction'].value_counts()[1] / training_data.shape[0]
        print(f'% of spam emails: {spam_proportion}')

        ham_proportion = training_data['Prediction'].value_counts()[0] / training_data.shape[0]
        print(f'% of ham emails: {ham_proportion}')

        spam_training_emails = training_data.loc[training_data[PREDICTION] == 1]

        total_spam_words = count_vocab(spam_training_emails)
        print(f'total spam words: {total_spam_words}')

        ham_training_emails = training_data.loc[training_data[PREDICTION] == 0]

        total_ham_words = count_vocab(ham_training_emails)
        print(f'total ham words: {total_ham_words}')

        word_spamicities = build_word_spamicity_dict(spam_training_emails, total_vocab, total_spam_words)
        word_hamicities = build_word_spamicity_dict(ham_training_emails, total_vocab, total_ham_words)

        i = 0
        for index, email in testing_data.iterrows():
            testing_data[CLASSIFICATION].loc[testing_data.index[i]] = calculate_email(email, word_spamicities, word_hamicities, spam_proportion, ham_proportion, testing_data)
            i += 1

        score = calculate_accuracy(testing_data)
        print(f'Accuracy: {score}%')
        
        end_time = time.time()
        elapsed_time = end_time - start_time
        print(f'Elapsed Time: {elapsed_time}')
        
        score_total += score
        time_total += elapsed_time

    print(f'\nAverage Accuracy: {score_total/5}%')
    print(f'Average Time: {time_total/5}%')

# Model

## Step 1: Partition the data into training and test segments

20% of the data for testing, and the remaining 80% is training (i.e. the 80% training data will confirm whether the 20% testing data labels are correct).

In [30]:
df = pd.read_csv('emails.csv')

print('STOP WORDS INCLUDED\n')
run_model(df, False)

STOP WORDS INCLUDED

Total # Emails:5172
Total Vocab: 3000

Begin: Email 1
End: Email 1034
% of spam emails: 0.291445142580957
% of ham emails: 0.708554857419043
total spam words: 1708339
total ham words: 2972320
Accuracy: 93.90715667311412%
Elapsed Time: 11.337883949279785

Begin: Email 1035
End: Email 2068
% of spam emails: 0.29337844369260513
% of ham emails: 0.7066215563073949
total spam words: 1757177
total ham words: 3079137
Accuracy: 96.5183752417795%
Elapsed Time: 11.211353063583374

Begin: Email 2069
End: Email 3102
% of spam emails: 0.291686805219913
% of ham emails: 0.708313194780087
total spam words: 1893092
total ham words: 3181821
Accuracy: 95.45454545454545%
Elapsed Time: 11.110827684402466

Begin: Email 3103
End: Email 4136
% of spam emails: 0.28709521507974867
% of ham emails: 0.7129047849202513
total spam words: 1695443
total ham words: 3219402
Accuracy: 93.81044487427465%
Elapsed Time: 11.62602710723877

Begin: Email 4137
End: Email 5172
% of spam emails: 0.286508704

In [None]:
print('STOP WORDS NOT INCLUDED\n')
run_model(df, True)

STOP WORDS NOT INCLUDED

Total # Emails:5172
Total Vocab: 2866

Begin: Email 1
End: Email 1034
% of spam emails: 0.291445142580957
% of ham emails: 0.708554857419043
total spam words: 987979
total ham words: 1754023
Accuracy: 93.23017408123792%
Elapsed Time: 11.051778078079224

Begin: Email 1035
End: Email 2068
% of spam emails: 0.29337844369260513
% of ham emails: 0.7066215563073949
total spam words: 1018299
total ham words: 1808522
Accuracy: 96.5183752417795%
Elapsed Time: 11.348573207855225

Begin: Email 2069
End: Email 3102
% of spam emails: 0.291686805219913
% of ham emails: 0.708313194780087
total spam words: 1099899
total ham words: 1871004
Accuracy: 95.16441005802709%
Elapsed Time: 13.281291007995605

Begin: Email 3103
End: Email 4136
% of spam emails: 0.28709521507974867
% of ham emails: 0.7129047849202513
total spam words: 990528
total ham words: 1894500


## Step 2: Get probabilities that any one email in the training data is either spam or ham

In the labelled dataset, count the number of spam and ham emails.

$P(Spam) = \frac{Spam\,Emails}{Total\,Emails}$

$P(Ham) = \frac{Ham\,Emails}{Total\,Emails}$

## Step 3: Get the "spamicity" and "hamicity" probability of each word in the testing data email

**w** = word
<br>**vocab** = total words in dataset
<br>**spam_vocab**
<br>**wi_spam_count**

Count all unique words in the labelled dataset to get **vocab**.

Count the total number of words in labelled spam emails (ignoring uniqueness) to get **spam_vocab**.

For each word **w**, count all instances of the word in the spam emails to get **wi_spam_count**.

Calculate spamicity of each word and store the word and its spamicity in a dictionary

$P(w_{i}|Spam) = \frac{wi\_spam\_count\,+\,\alpha}{spam\_vocab\,+\,\alpha \cdot vocab}$

$\alpha$ is a coefficient that prevents a probability from being 0.


## Step 4: Calculate the "spamicity" and "hamicity" of each email

Multiply spamicities of each word together to get $\prod_{i=1}^{n}P(w_{i}|Spam)$.

Multiply that product by the probability that any email is spam.

## Step 5: Compare hamicity and spamicity scores to classify emails

## Step 6: Check accuracy of the model