# Naive Bayes

Naive Bayes is a classification technique based on Bayes’ Theorem. assuming that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naïve Bayes classifier is a supervised machine learning algorithm used for classification tasks such as text classification, spam filtering, and sentiment analysis. It belongs to the family of generative learning algorithms, which means that it models the distribution of inputs for a given class or category.

Bayes theorem is based on the probability of a hypothesis, given the data and some prior knowledge. The naive Bayes classifier assumes that all features in the input data are independent of each other, which is often not true in real-world scenarios. However, despite this simplifying assumption, the naive Bayes classifier is widely used because of its efficiency and good performance in many real-world applications.

Bayes theorem is stated as:
    P(class|data) = (P(data|class) * P(class)) / P(data)

In [1]:
import pandas as pd
import numpy as np
from random import seed
from random import randrange

In [2]:
data = pd.read_csv('..\\Datasets\\iris.csv')
data = data.drop(['Id'], axis=1)

In [3]:
data.loc[data["Species"] == "Iris-setosa", "Species"] = 0
data.loc[data["Species"] == "Iris-versicolor", "Species"] = 1
data.loc[data["Species"] == "Iris-virginica", "Species"] = 2

In [4]:
def sep_class(x):
    setosa = x.loc[data['Species'] == 0]
    versicolor = x.loc[data['Species'] == 1]
    virginica  = x.loc[data['Species'] == 2]
    return [setosa, versicolor, virginica]

In [5]:
def summarie(x):    
    summaries = dict()
    temp = list()
    x = sep_class(x)
    for i in range(len(x)):
        temp = list()
        for target, valuez in x[i].items():
            temp.append((valuez.mean(), valuez.std(), len(valuez)))
        temp.pop(-1)
        summaries[i] = temp
    return summaries

# Gaussian Probability Density Function

Calculating the probability of observing a given real-value can be difficult. One way to circumvent this is to assume, naïvely, that values are drawn from a distribution, such as a bell curve or Gaussian distribution.

A Gaussian distribution can be summarized using only the mean and the standard deviation, making it possible to estimate the probability of a given value using the *Gaussian Probability Distribution Function* (or Gaussian PDF). The Gaussian PDF can be calculated as:

    f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x-mean)^2 / (2 * sigma^2)))
    
Where sigma is the standard deviation for x, mean is the mean for x and PI is the value of pi.

In [6]:
def PDF(x, mean, std):
    e = np.exp(-((x-mean)**2 / (2 * std**2 )))
    return (1 / (np.sqrt(2 * np.pi) * std)) * e

# Class Probabilities

Probabilities are calculated separately for each class, requiring that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes. The probability that a piece of data belongs to a class is calculated as:
    P(class|data) = P(X|class) * P(class)
The division has been removed to simplify the calculation making the result no longer strictly a probability of the data belonging to a class. Now, the calculation for the class that results in the largest value will be taken as the prediction.   

In [7]:
def calculate_class_probabilities(summaries, row):
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
         probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
    for i in range(len(class_summaries)):
        mean, stdev, count = class_summaries[i]
        probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
    return probabilities

#  k-fold cross-validation

The algorithm will be evaluted using k-fold cross-validation. Since the dataset being used has 150 rows, the number of folds will be set to 5 giving each class 30 rows each. k-folds helps to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

In [8]:
def kfold(dataset, n_folds):
    dataset_split = list()
    copy = dataset.copy()
    fold_size = int(len(dataset) / n_folds)
    for _ in range(n_folds):
        temp = list()
        while len(temp) < fold_size:
            ind = randrange(len(copy))
            temp.append(copy.iloc[ind])
            copy.drop(copy.index[ind])
        temp = pd.DataFrame(temp)
        dataset_split.append(temp)
    return dataset_split

In [9]:
def accuracy(y, yhat):
    correct = 0
    for i in range(len(yhat)):
        if y[i] == yhat[i]:
            correct += 1
    return correct / float(len(y)) * 100.0

In [10]:
def calculate_class_probabilities(summaries, row):
    
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        
        probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
        for i in range(len(class_summaries)):
            
            mean, std, count = class_summaries[i]
            probabilities[class_value] *= PDF(row[i], mean, std)
            
    return probabilities

In [11]:
def evaluate_algorithm(dataset, n_folds, *args):
    folds = kfold(dataset, n_folds)
    scores = list()
    for i in range(len(folds)):
        train_set = list(folds)
        train_set.pop(i)
        train_set = pd.concat(i for i in train_set)
        test_set = list()
        for j in range(len(folds[i])):
            copy = folds[i].iloc[j].copy()
            test_set.append(copy)
            copy[-1] = None
            predicted = naive_bayes(train_set, test_set, *args)
            actual = [folds[i]['Species'].iloc[k] for k in range(len(folds[i]))]
        acc = accuracy(actual, predicted)
        scores.append(acc)
    return scores

In [12]:
def predict(summaries, row):
    probabilities = calculate_class_probabilities(summaries, row)
    best_label, best_prob = None, -1.0
    for class_value, probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = class_value
    return best_label

In [13]:
def naive_bayes(train, test):
    summarize = summarie(train)
    predictions = list()
    for row in test:
        output = predict(summarize, row)
        predictions.append(output)
    return(predictions)

In [14]:
scores = evaluate_algorithm(data, 5)

In [15]:
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Scores: [96.66666666666667, 100.0, 86.66666666666667, 90.0, 96.66666666666667]
Mean Accuracy: 94.000%
