# Naive Bayes Classification Concept

- Naive Bayes Classification is based on the Naive Bayes Theorem :

$$  P(\textbf{Class}|\textbf{Features}) = \frac{P(\textbf{Features} | \textbf{Class}) P(\textbf{Class})}{P(\textbf{Features})} $$

- When predicting the class we choose the class with highest probability

$$  \textbf{arg max}(\textbf{Class})  P(\textbf{Class}|\textbf{Features}) = \frac{P(\textbf{Features} | \textbf{Class}) P(\textbf{Class})}{P(\textbf{Features})} $$

- We can ignore the denominator : $ P(\textbf{Features}) $ because it's the same for all classes. We are left with : 

$$  \textbf{arg max}(\textbf{Class}) P(\textbf{Class}|\textbf{Features}) = P(\textbf{Features} | \textbf{Class}) P(\textbf{Class}) $$

## Gaussian NB

- To calculate $ P(\textbf{Features}/\textbf{Class})$ we use assume that the features have a Normal/Gaussian Distribution. Thus, we calculate their probabilities using the Gaussian Probability Formula :
<h3>
$$
P(\textbf{f1 = x}/\textbf{Class = C1}) = {\frac {1}{\sqrt {2\pi \cdot \sigma ^{2}}}}e^{-\large{{\frac {(x-\mu(f1,C1) )^{2}}{2 \cdot \sigma(f1,C1) ^{2}}}}}
$$
</h3>

- We generally use Gaussian NB when features are real numbers
- For the training, we need to Calculate Means ($ \mu $) and Standard Deviations ($ \sigma $) for each Feature for each Class
- And of course calculate Class Priors $ P(\textbf{Class}) $

# Imports

In [1]:
import pandas as pd
import numpy as np
from math import sqrt, exp, pi

# Dataset "Spambase"

The dataset used is "Spambase" from the following link: https://archive.ics.uci.edu/ml/datasets/spambase

Here is small description from the link above:
> Attribute Information:

> The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

> 48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

> 6 continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

> 1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

> 1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

> 1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

> 1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. 

## Load Data

In [2]:
# Load Data

df = pd.read_csv('./data/spambase.csv', header=None)
X = df.iloc[:,:48]
Y = df.iloc[:, 57]
X.columns = list(range(1,49))
df = X
df['class'] = Y
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,40,41,42,43,44,45,46,47,48,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


## Split Train and Test

In [3]:
# Split Train and Test
nbr_test_rows = 100

train_df = df.iloc [:-nbr_test_rows,]
test_df = df.iloc [-nbr_test_rows:,]

nbr_train_rows = len(train_df)
print(nbr_train_rows, nbr_test_rows)
test_df.loc[4501]['class']

4501 100


0.0

## Calculate Class Priors, Means and Standard Deviations

In [4]:
# Calculate Priors

class_priors = {}
nbr_elements_in_classes = train_df['class'].value_counts()
all_classes = list(nbr_elements_in_classes.keys())

for c in all_classes:
    class_priors[c] = nbr_elements_in_classes[c] / nbr_train_rows

print (class_priors)

{0: 0.59720062208398139, 1: 0.40279937791601866}


In [5]:
# Calculate Means, Standard Deviations

mean_features = {}  # dict { f1 => { c1: mean, c2: mean}, f2 => { c1: mean, c2: mean}, ... }
sd_features = {}

for f in range(1,49):
    mean_features[f] = {}
    sd_features[f] = {}
    for c in all_classes:
        feature_values = train_df[train_df['class'] == c].loc[:, f]
        mean_f_c = feature_values.sum() / len(feature_values)
        mean_features[f][c] = mean_f_c
        sd_f_c = np.sqrt( ((feature_values - mean_f_c)**2).sum()/(len(feature_values)) )
        sd_features[f][c] = sd_f_c


## Calculate Likelyhood (Gaussian Probability)

In [6]:
# Gaussian Probability Calculation

def gaussian(f_index, f_value, c):
    mean = mean_features[f_index][c]
    sd = sd_features[f_index][c]
    variance = sd**2 
    ret =  ( 1/np.sqrt(2*np.pi*variance) ) * ( np.exp( - ( ((f_value-mean)**2)/(2*variance) ) ) )
    #print('Gaussian ', mean, sd, f_value, ret)
    return ret


## Predict

In [7]:
# Predict

def predict(f_vector):
    ret = {} # dict { c1 : Prob, c2: Prob, ... }
    for c in all_classes:
        pr = 1
        f_index = 1
        for f_value in f_vector:
            g = gaussian(f_index, f_value, c)
            pr *= g
            f_index += 1
        pr *= class_priors[c]
        ret[c] = pr
    return ret


## Score

In [8]:
# Score

nbr_errors = 0

for i, row in test_df.iterrows():
    f_vector = row[:-1]
    correct_class = row['class']
    
    predictions = predict(f_vector)
    predictions = max(predictions.items(), key= lambda x: x[1])
    predicted_class = predictions[0]
    
    if correct_class != predicted_class:
        nbr_errors +=1

score = 1-(nbr_errors/len(test_df))
print("Score: ", score)

Score:  0.61
