In this tutorial, we implement the Naive Bayes algorithm from scratch. We use the Scikit-Learn predefined functions of Naive Bayes and then compare the obtained results.
First, we need to import the libraries we are going to use during the Naive Bayes implementation. We are going to use Numpy and Scikit-Learn libraries, if you haven't use them before, you need to first import them to the terminal using the pip command. 

In [1]:
import numpy as np
import sklearn

We start with the scratch implementation of Naive Bayes. You can go over the Scikit-Learn website for the detailed explanation: https://scikit-learn.org/stable/modules/naive_bayes.html. We use the logaritmic version of the equation to deal with additions instead of multiplications. If you want to deeper explanation for the scratch implementation, you can watch this tutorial: https://www.youtube.com/watch?v=TLInuAorxqE. The likelihood features are calculated using the Gaussian distribution. 

In [2]:
class NaiveBayesScratch:

    def initialize(self, X, y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)

        self.mean = np.zeros((n_classes, n_features))
        self.var = np.zeros((n_classes, n_features))
        self.priors = np.zeros(n_classes)

        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            self.mean[idx, :] = X_c.mean(axis=0)
            self.var[idx, :] = X_c.var(axis=0)
            self.priors[idx] = X_c.shape[0] / float(n_samples)
            

    def predict(self, X):
        y_pred = [self.fit(x) for x in X]
        return np.array(y_pred)

    def fit(self, x):
        posteriors = []

        for idx, c in enumerate(self.classes):
            prior = np.log(self.priors[idx])
            posterior = np.sum(np.log(self.probabilityDensity(idx, x)))
            posterior = posterior + prior
            posteriors.append(posterior)

        return self.classes[np.argmax(posteriors)]

    def probabilityDensity(self, idx, x):
        mean = self.mean[idx]
        var = self.var[idx]
        probDen = (np.exp(-((x - mean) ** 2) / (2 * var))) / (np.sqrt(2 * np.pi * var))
        return probDen

We are ready! We can test our algorithm's performance using the well-known Iris dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray (src: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). We can easly import the dataset using the predefined Scikit-Learn function. In order to test the algorithm, we need to create separate train and test sets. We are going to use the 70% of our dataset to train the model and the rest will be used for testing. 

In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [9]:
nb = NaiveBayesScratch()
nb.initialize(X_train, y_train)
predictions = nb.predict(X_test)
print("Number of mislabeled points out of a total %d points using our Naive Bayes Classifier: %d"
       % (X_test.shape[0], (y_test != predictions).sum()))

Number of mislabeled points out of a total 30 points using our Naive Bayes Classifier: 1


Our model works really well! Let's compare the result with the predefined Scikit-Learn functions. There are 5 different Naive Bayes function: Gaussian NB, Multinominal NB, Complement NB, Bernoulli NB, and Categorical NB. Each of them has different advantages and disadvantages, and we need to understand our data to select the most suitable one in the given scenario. You can find the details in Scikitlearn's website https://scikit-learn.org/stable/modules/naive_bayes.html#. We select two of them for our application: Gaussian NB and Categorical NB. Since we also use the Gaussian approach in our implementation, the result of the two algorithms should be similar. Categorical NB assumes that each feature, which is described by the index , has its own categorical distribution.

In [12]:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points using Gaussian Naive Bayes Classifier: %d"
       % (X_test.shape[0], (y_test != y_pred).sum()))

gnb = CategoricalNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points using Categorical Naive Bayes Classifier: %d"
       % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 30 points using Gaussian Naive Bayes Classifier: 1
Number of mislabeled points out of a total 30 points using Categorical Naive Bayes Classifier: 2
