## NAIVE BAYES ALGORITHM

### Implementing the Naive Bayes Algorithm Without Using External Libraries

First, let's add the Numpy and Pandas libraries to our workspace.

In [46]:
import numpy as np
import pandas as pd

Next, let's add our Iris dataset to our work using the Pandas library, and simplify and shuffle the created dataframe.

In [47]:
df = pd.read_csv("Iris.csv")
df.drop("Id",inplace=True,axis=1)
df = df.sample(frac=1)
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
117,7.7,3.8,6.7,2.2,Iris-virginica
85,6.0,3.4,4.5,1.6,Iris-versicolor
124,6.7,3.3,5.7,2.1,Iris-virginica
36,5.5,3.5,1.3,0.2,Iris-setosa
29,4.7,3.2,1.6,0.2,Iris-setosa
...,...,...,...,...,...
106,4.9,2.5,4.5,1.7,Iris-virginica
15,5.7,4.4,1.5,0.4,Iris-setosa
37,4.9,3.1,1.5,0.1,Iris-setosa
108,6.7,2.5,5.8,1.8,Iris-virginica


Let's create a function called 'split_data' to separate the data into 'train' and 'test' sets. This function takes parameters: a numpy array 'X' containing the features of flowers, a numpy array 'y' containing the class labels of the flowers, and a percentage value 'train_size' indicating the proportion in which the data will be split into 'train' and 'test' sets.

In [48]:
def split_data(X, y, train_size):

    start = int(len(X)*train_size)

    X_train = X[:start]
    X_test = X[start:]
    y_train = y[:start]
    y_test = y[start:]

    return X_train, X_test, y_train, y_test

Now it's time to create the Nearest Neighbors Naive Bayes Algorithm. Let's take a look at the functions we have created in the NaiveBayesClassifier class one by one:

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*5OROQqYWuC6to-5T9OMtXw.jpeg" width="600" height="400">
<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*WAvNX2uhP9eg6P5CkkiL6g.png" width="600" height="400">

The calc_prior function takes the features and class of plant data as arguments and calculates the Prior value for each data point in the theorem.

The function calc_statistics takes the features and class of plant data as arguments. It calculates the mean and variance of each column and then returns these calculations in a NumPy array. These computations are necessary for the gaussian_density function.

The gaussian_density function calculates the Gaussian density. We will assume that a specific target value given to a certain class is normally distributed, and it computes the probability accordingly.

The "calc_posterior" function calculates posterior values for each data point. Finally, based on these calculated values, it predicts the class of the test data. It returns these predictions in a list called "classes" to the calling location in the "predict" function.

In [49]:
class NaiveBayesClassifier():

    def calc_prior(self, features, target):

        self.prior = (features.groupby(target).apply(lambda x: len(x)) / self.rows).to_numpy()

        return self.prior
    
    def calc_statistics(self, features, target):

        self.mean = features.groupby(target).apply(np.mean, axis=0).to_numpy()
        self.var = features.groupby(target).apply(np.var).to_numpy()
              
        return self.mean, self.var
    
    def gaussian_density(self, class_idx, x):     

        mean = self.mean[class_idx]
        var = self.var[class_idx]
        numerator = np.exp((-1/2)*((x-mean)**2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        prob = numerator / denominator
        return prob
    
    def calc_posterior(self, x):
        posteriors = []

        for i in range(self.count):
            prior = np.log(self.prior[i])
            conditional = np.sum(np.log(self.gaussian_density(i, x)))
            posterior = prior + conditional
            posteriors.append(posterior)
        return self.classes[np.argmax(posteriors)]
     

    def fit(self, features, target):
        self.classes = np.unique(target)
        self.count = len(self.classes)
        self.feature_nums = features.shape[1]
        self.rows = features.shape[0]
        
        self.calc_statistics(features, target)
        self.calc_prior(features, target)
        
    def predict(self, features):
        preds = [self.calc_posterior(f) for f in features.to_numpy()]
        return preds

    def accuracy(self, y_test, y_pred):
        accuracy = np.sum(y_test == y_pred) / len(y_test)
        print("Accuracy of Naive Bayes Model: ",accuracy)

Let's assign the columns containing the characteristics of flowers in a DataFrame to a variable named "X", and the columns containing the classes of flowers to a variable named "y" as numpy arrays. Then, pass these numpy arrays as parameters to the "split_data" function we created earlier.

In [50]:
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = split_data(X, y, train_size=0.7)

In [51]:
# train the model
naiveBayes = NaiveBayesClassifier()
naiveBayes.fit(X_train, y_train)
y_pred = naiveBayes.predict(X_test)
naiveBayes.accuracy(y_test, y_pred)

Accuracy of Naive Bayes Model:  0.9555555555555556


### Implementing the Naive Bayes Algorithm Using the Scikit-Learn Library

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [63]:
NB = GaussianNB()
NB.fit(X_train,y_train)
y_predNB = NB.predict(X_test)
print("Accuracy of Naive Bayes Model: ",accuracy_score(y_test,y_predNB))

Accuracy of Naive Bayes Model:  0.9555555555555556
