# Naive Bayes Clasifier example

In this notebook we implement and use a Naive Bayes Clasifier. This is a simple but often very helpful ML algorithm that is based on Conditional Probabilities and Bayes Theorems.

## Algorithm implementation

We'll implement a class that contains the methods and fields needed to learn and classify. 

In [99]:
class NaiveBayesClassifier:
    # X and y denotes the features and the target labels respectively
    def __init__(self, X, y):
        self.X, self.y = X, y

        # self.N stores the total number of rows in the dataSet.
        self.N = len(self.X)  # Length of the training set

        # self.X[0] :- [3 'male' 22.0 1 0]
        # self.dim stores the total number of columns present in a Row. 
        self.dim = len(self.X[0])


        # Here we'll store the columns of the training set
        # [] for _ in range():-  https://stackoverflow.com/questions/66425508/what-is-the-meaning-of-for-in-range
        # List Comprehension topic :- https://www.w3schools.com/python/python_lists_comprehension.asp
        self.attrs = [[] for _ in range(self.dim)]
        # print(self.attrs)   # [[], [], [], [], []]


        # Output classes with the number of ocurrences in the training set. In this case we have only 2 classes
        # it is a dictionary
        self.output_dom = {}

        # To store every row [Xi, yi]
        self.data = []  

        for i in range(self.N):   # i = 0, 1, 2, 3, 4  where i represents the row
            for j in range(self.dim):  # j = 0, 1, 2, 3, 4  where j represents the column
                # if we have never seen this value for this attr before, then we add it to the attrs array in the corresponding position
                if not self.X[i][j] in self.attrs[j]:
                    self.attrs[j].append(self.X[i][j])

            # if we have never seen this output class before, then we add it to the output_dom and count one occurrence for now
            # otherwise, we increment the occurrence of this output in the training set by 1
            # Basically a map data structure 
            if not self.y[i] in self.output_dom.keys():
                self.output_dom[self.y[i]] = 1
            else:
                self.output_dom[self.y[i]] += 1
            # store the row
            self.data.append([self.X[i], self.y[i]])

    def classify(self, entry):
        # The None keyword is used to define a null value, or no value at all.
        solve = None
        max_arg = -1  # partial maximum
        for y in self.output_dom.keys():
            prob = self.output_dom[y]/self.N  # P(y)
            for i in range(self.dim):
                cases = [x for x in self.data if x[0][i] == entry[i] and x[1] == y]  # all rows with Xi = xi
                n = len(cases)
                prob *= n/self.N  # P *= P(Xi = xi)
            # if we have a greater prob for this output than the partial maximum...
            if prob > max_arg:
                max_arg = prob
                solve = y
        return solve

## Simple example

We are going to use a dataset that contains all the info of the passengers on the Titanic. Our NBC will try to predict whether some passenger survived or not to the tragedy

In [100]:
import pandas as pd
data = pd.read_csv('titanic_dataSet.csv')
# print(data.head())
data['Survived'] = data['Survived'].map({
    0: 'No',
    1: 'Yes'
})
y = list(data['Survived'])
# Survived is our label for the Naive Bayes Algorithm

# We won't use the 'Name' nor the 'Fare' field
X = data[['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard','Parents/Children Aboard']].values
# if .values is not added then type(X) is pandas.core.frame.DataFrame
# else it is numpy.ndarray
# And we need the second form for the further calculation that's why the .values is important

Now let's split the data into a training and a validation set

In [101]:
print(len(y))  # this line is just for checking the total examples available to us

# We'll take first 600 examples to train the model and the rest for the validation process
X_train = X[:600]
y_train = y[:600]

X_val = X[600:]
y_val = y[600:]

887


In [102]:
## Creating the Naive Bayes Classifier instance with the training data
# print(X_train)
nbc = NaiveBayesClassifier(X_train, y_train)
total_cases = len(y_val)  # size of validation set
# Well classified examples and bad classified examples
good = 0
bad = 0
for i in range(total_cases):
    predict = nbc.classify(X_val[i])
    # print(y_val[i] + ' --------------- ' + predict)
    if y_val[i] == predict:
        good += 1
    else:
        bad += 1

print('TOTAL EXAMPLES:', total_cases)
print('RIGHT:', good)
print('WRONG:', bad)
print('ACCURACY:', good/total_cases)

prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605
prob:  0.395
prob:  0.605

## Conclusions

NBC are pretty easy to implement. They are great for using as a baseline you can compare other more complex models with. In this case, the accuracy is not good. For example, if you classify all women as survivors you'll get a better accuracy.

But you can improve the accuracy by doing some feature engineering. A simple approach would be removing some features like (eg: removing all the features but the sex and the class). The accuracy will improve by about 10% !!!

You can also check that the number of false negatives is way greater than the number of false positives. That is because there were a lot more people that didn't survive compared with the number of survivors. Then, the dataset is highly unbalanced. 