# Bayes Theorem for Predicting the Probability of an Email Being Spam

S = Spam
w = Word

$P(Spam|w_{1}, w_{2},..., w_{n}) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_{i}|Spam)$

The probability that an email consisting of the words $w_{1}, w_{2},... w_{n}$ is proportional to the probability that any given email is spam multiplied by the product of each word's probability to appear in a spam email.



In [1]:
import pandas as pd
import math

PREDICTION = 'Prediction'
CLASSIFICATION = 'Classiciation'

## Functions

In [2]:
def count_vocab(emails):
    total_words = 0
    
    for index, row in emails.iterrows():
        total_words += sum(row.values[1:-2])
            
    return total_words

In [3]:
def calculate_word_spamicity(w_spam_count, vocab, spam_vocab):
    alpha = 1
    
    spamicity = (w_spam_count + alpha) / (spam_vocab + alpha * vocab)
    return spamicity

In [4]:
def build_word_spamicity_dict(spam_emails, vocab, spam_vocab):
    spam_word_appearances = {}
    
    for (column_name, column_data) in spam_emails.iteritems():
        if column_name != 'Email No.' and column_name != PREDICTION and column_name != CLASSIFICATION:
            spam_word_appearances[column_name] = sum(column_data.values)
            
    for word in spam_word_appearances:
        spam_word_appearances[word] = calculate_word_spamicity(spam_word_appearances[word], vocab, spam_vocab)
            
    return spam_word_appearances

In [5]:
def calculate_emails(testing_emails, word_spamicities, spam_proportion):
    test_data_map = {}
    for index, email in testing_emails.iterrows():
        email_spamicity = math.log(spam_proportion)
        for column in testing_data.columns[1:-2]:
            if email[column] > 0:
                # log P(spam) + sum log P(w|spam)
                email_spamicity += math.log(word_spamicities[column])*email[column]

        test_data_map[index] = email_spamicity
        
    return test_data_map

In [6]:
def calculate_accuracy(testing_emails):
    number_correct = 0
    for index, email in testing_emails.iterrows():
        if email[PREDICTION] == email[CLASSIFICATION]:
            number_correct += 1
        
    return number_correct / testing_data.shape[0] * 100

# Model

## Step 1: Partition the data into training and test segments

20% of the data for testing, and the remaining 80% is training (i.e. the 80% training data will confirm whether the 20% testing data labels are correct).

In [7]:
df = pd.read_csv('emails.csv')
total_num_emails = df.shape[0]

# This will be used to store the model's prediction
df[CLASSIFICATION] = ""

end = total_num_emails//5

testing_data = df.iloc[:end]
training_data = df.iloc[end:]

testing_data


Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction,Classiciation
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,1,0,0,
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,1,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1029,Email 1030,10,6,4,3,5,2,45,10,0,...,0,0,0,0,0,0,0,0,0,
1030,Email 1031,8,19,18,7,3,3,150,8,11,...,0,0,2,0,0,0,2,0,0,
1031,Email 1032,0,0,1,0,0,2,7,0,0,...,0,0,0,0,0,0,0,0,0,
1032,Email 1033,1,2,1,4,1,2,108,5,0,...,0,0,0,0,0,0,2,0,1,


In [8]:
training_data

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction,Classiciation
1034,Email 1035,18,19,18,18,3,6,143,3,0,...,0,0,0,0,0,0,3,0,0,
1035,Email 1036,4,5,3,0,2,2,19,0,0,...,0,0,0,0,0,0,0,0,0,
1036,Email 1037,0,1,1,1,0,0,6,1,0,...,0,0,0,0,0,0,0,0,1,
1037,Email 1038,3,2,1,1,0,1,11,0,0,...,0,0,0,0,0,0,1,0,1,
1038,Email 1039,11,7,11,11,16,0,98,3,4,...,0,0,0,0,0,0,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5167,Email 5168,2,2,2,3,0,0,32,0,0,...,0,0,0,0,0,0,0,0,0,
5168,Email 5169,35,27,11,2,6,5,151,4,3,...,0,0,0,0,0,0,1,0,0,
5169,Email 5170,0,0,1,1,0,0,11,0,0,...,0,0,0,0,0,0,0,0,1,
5170,Email 5171,2,7,1,0,2,1,28,2,0,...,0,0,0,0,0,0,1,0,1,


## Step 2: Get probabilities that any one email in the training data is either spam or ham

In the labelled dataset, count the number of spam and ham emails.

$P(Spam) = \frac{Spam\,Emails}{Total\,Emails}$

In [9]:
spam_proportion = training_data['Prediction'].value_counts()[1] / training_data.shape[0]
spam_proportion

0.291445142580957

$P(Spam) = \frac{Ham\,Emails}{Total\,Emails}$

In [10]:
ham_proportion = training_data['Prediction'].value_counts()[0] / training_data.shape[0]
ham_proportion

0.708554857419043

## Step 3: Get the "spamicity" probability of each word in the testing data email

**w** = word
<br>**vocab** = total words in dataset
<br>**spam_vocab**
<br>**wi_spam_count**

Count all unique words in the labelled dataset to get **vocab**.

Count the total number of words in labelled spam emails (ignoring uniqueness) to get **spam_vocab**.

For each word **w**, count all instances of the word in the spam emails to get **wi_spam_count**.

Calculate spamicity of each word and store the word and its spamicity in a dictionary

$P(w_{i}|Spam) = \frac{wi\_spam\_count\,+\,\alpha}{spam\_vocab\,+\,\alpha \cdot vocab}$

$\alpha$ is a coefficient that prevents a probability from being 0.


In [11]:
# Subtract 3 for "Email No.", "Prediction", and "Classification" columns
total_vocab = len(training_data.columns) - 3
total_vocab

3000

In [12]:
spam_training_emails = training_data.loc[training_data[PREDICTION] == 1]

total_spam_words = count_vocab(spam_training_emails)
total_spam_words

1708345

In [13]:
ham_training_emails = training_data.loc[training_data[PREDICTION] == 0]

total_ham_words = count_vocab(ham_training_emails)
total_ham_words

2972342

In [14]:
word_spamicities = build_word_spamicity_dict(spam_training_emails, total_vocab, total_spam_words)
word_spamicities

{'the': 0.004709745843181825,
 'to': 0.004884462221235344,
 'ect': 0.0017904046232641577,
 'and': 0.002946804998407685,
 'for': 0.002195349272063786,
 'of': 0.0032769546760004556,
 'a': 0.04756726434471133,
 'you': 0.002370649985829859,
 'hou': 0.00043766744870262864,
 'in': 0.010783331239463697,
 'on': 0.008480464196290052,
 'is': 0.004917185021138345,
 'this': 0.001148804010880331,
 'enron': 5.843357125535763e-07,
 'i': 0.04614966590605635,
 'be': 0.0025786734994989323,
 'that': 0.000670233062298952,
 'will': 0.0003278123347425563,
 'have': 0.0004662998986177539,
 'with': 0.000983437004227669,
 'your': 0.0009711659542640438,
 'at': 0.006023332525002264,
 'we': 0.0013743575959260113,
 's': 0.038823849077772164,
 'are': 0.0011079005110015807,
 'it': 0.004733703607396522,
 'by': 0.0004213060487511285,
 'com': 0.0020264762511358024,
 'as': 0.003302665447352813,
 'from': 0.00048324563428180756,
 'gas': 0.00013381287817476896,
 'or': 0.006563843059114323,
 'not': 0.0007380160049551669,
 'm

In [15]:
word_hamicities = build_word_spamicity_dict(ham_training_emails, total_vocab, total_ham_words)
word_hamicities

{'the': 0.006328348136113428,
 'to': 0.005621202537388979,
 'ect': 0.005653131639992983,
 'and': 0.0025418926630955364,
 'for': 0.002983858662298317,
 'of': 0.0017594616013890169,
 'a': 0.04759452862897778,
 'you': 0.001954061079364994,
 'hou': 0.0023429239395000643,
 'in': 0.008279384353126465,
 'on': 0.009871470237707127,
 'is': 0.004440834028491515,
 'this': 0.0012143141864027732,
 'enron': 0.0018928916406920615,
 'i': 0.035427523961951264,
 'be': 0.003014107285817899,
 'that': 0.0008580526205054747,
 'will': 0.0009430848621771883,
 'have': 0.0008247791346339345,
 'with': 0.0007226060063011243,
 'your': 0.0005414503610005169,
 'at': 0.0058947845323327535,
 'we': 0.001937592384337666,
 's': 0.03448443909977408,
 'are': 0.0013427027884525544,
 'it': 0.0033999452836010113,
 'by': 0.0006271547943059991,
 'com': 0.0013652212081837988,
 'as': 0.004627703302679154,
 'from': 0.0008523389916184425,
 'gas': 0.0008109992061416805,
 'or': 0.006628481700591058,
 'not': 0.0007279835393712722,
 'm

## Step 4: Calculate the "spamicity" of the email

Multiply spamicities of each word together to get $\prod_{i=1}^{n}P(w_{i}|Spam)$.

Multiply that product by the probability that any email is spam.

In [16]:
test_data_spam_map = calculate_emails(testing_data, word_spamicities, spam_proportion)
test_data_spam_map
        

{0: -266.9077389830095,
 1: -11065.714584660675,
 2: -612.2987351830429,
 3: -5162.393017137474,
 4: -5402.583128453085,
 5: -6200.152262425477,
 6: -3315.1667729718974,
 7: -2675.8507948757942,
 8: -2153.7295646889042,
 9: -5977.989932412375,
 10: -10448.289467206474,
 11: -18800.796785025348,
 12: -11507.061565999189,
 13: -5252.92907457097,
 14: -3055.1152069355157,
 15: -3641.9251725110894,
 16: -999.0996578373654,
 17: -20578.697601065414,
 18: -1920.5748159142024,
 19: -4287.192774446812,
 20: -1241.315348318113,
 21: -3725.801793709815,
 22: -4225.506755169192,
 23: -1619.841959289526,
 24: -878.4812292732026,
 25: -22867.876671370876,
 26: -6453.808649243151,
 27: -1980.1956290618205,
 28: -9741.495147286592,
 29: -10470.846919808795,
 30: -3128.3604330301664,
 31: -824.4452002329542,
 32: -22082.557180695327,
 33: -6166.60030920623,
 34: -4621.7019319481915,
 35: -2686.876496651591,
 36: -6836.310076091956,
 37: -2322.4900359573376,
 38: -2759.009128247102,
 39: -5470.88072158

In [17]:
test_data_ham_map = calculate_emails(testing_data, word_hamicities, ham_proportion)
test_data_ham_map

{0: -263.7525279038351,
 1: -10742.222607376621,
 2: -571.4310020252854,
 3: -4908.093997893962,
 4: -5109.264680218924,
 5: -6318.843401588338,
 6: -3208.1188882266447,
 7: -2770.3762438620856,
 8: -2093.2467833736823,
 9: -5746.358791028823,
 10: -10175.705165393876,
 11: -18411.815202645754,
 12: -11346.573245304633,
 13: -5062.00463281683,
 14: -2890.2852872364965,
 15: -3568.9343916999696,
 16: -993.168016628906,
 17: -20833.82866407406,
 18: -1847.7394529370974,
 19: -4063.4045094168405,
 20: -1190.6063740595212,
 21: -3613.9275574636713,
 22: -3905.794638110254,
 23: -1576.7899798971928,
 24: -822.2981833631085,
 25: -23440.508719784106,
 26: -6268.905173611588,
 27: -1905.6032332397199,
 28: -9549.323522437924,
 29: -10187.171788294741,
 30: -3095.664038646154,
 31: -844.8038231134311,
 32: -21320.011694063178,
 33: -5918.081865998214,
 34: -4480.592440730411,
 35: -2593.183114379682,
 36: -6650.247661089747,
 37: -2210.4473938125975,
 38: -2707.4559497590626,
 39: -5526.080891

In [20]:
for key in test_data_spam_map:
    if test_data_spam_map[key] >= test_data_ham_map[key]:
        testing_data.iloc[key][CLASSIFICATION] = 1
    else:
        testing_data.iloc[key][CLASSIFICATION] = 0
        
testing_data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  testing_data.iloc[key][CLASSIFICATION] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  testing_data.iloc[key][CLASSIFICATION] = 1


Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction,Classiciation
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,1,0,0,
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,1,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1029,Email 1030,10,6,4,3,5,2,45,10,0,...,0,0,0,0,0,0,0,0,0,
1030,Email 1031,8,19,18,7,3,3,150,8,11,...,0,0,2,0,0,0,2,0,0,
1031,Email 1032,0,0,1,0,0,2,7,0,0,...,0,0,0,0,0,0,0,0,0,
1032,Email 1033,1,2,1,4,1,2,108,5,0,...,0,0,0,0,0,0,2,0,1,


In [None]:
score = calculate_accuracy(testing_data)
f'Accuracy: {score}%'

## Step 5: Repeat steps 2-4 calculating the "hamicity" of each email

## Step 6: Label emails in test dataset and compare

A probability greater then 0.5 will indicate whether an email is ham or spam.