In classification, the goal of a learning
algorithm is to construct a classifier given a set of training examples with class labels. Naive Bayes is a simple and effective probabilistic classifier based on applying Baye's theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

Bayes Theorem provides a principled way for calculating a conditional probability.

P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data.

In this code, a Naive Bayes Classifier will be developed following the next steps:

1- Divide data into clases.

2- Get Mean and Standar desviation

3- Obtain Gaussian Probability Density Function

4- Make predictions

5- Evaluate model performance

Finally, the model is encapsulated within a class and its accuracy is compared to that obtained using the 'sklearn' library.




The data set used is "Pima Indians Diabetes Database" from the National Institute of Diabetes and Digestive and Kidney Diseases.

The variables are:

Pregnancies: Number of pregnancies the patient has had.

Plasma Glucose (mg/dl): Plasma glucose concentration.

Blood Pressure (mm Hg): Diastolic blood pressure in mm Hg.

Triceps Skin Fold Thickness (mm): Triceps skin fold thickness in mm.

2-Hour Post-Prandial Blood Sugar(mg/dl): Blood sugar level 2 hours after a meal.

Body Mass Index (BMI)

Insulin Level (mu U/ml): Insulin concentration in micro units per milliliter.

Age (years): Age of the patient in years.




References:

https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

https://scikit-learn.org/stable/modules/naive_bayes.html


Data set:

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database


Import libraries.

In [54]:
import pandas as  pd
import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB




Connecting to Google Drive could be not necessary if the code is run on a platform other than Google Colab.

In [55]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import data

In [56]:
# Replace with the path to your CSV file
csv_path = '/content/drive/MyDrive/Colab Notebooks/primerose codes/data sets/diabetes.csv'

# Read CSV file into a DataFrame
data = pd.read_csv(csv_path).values

Check data head

In [57]:
data[:10]

array([[6.000e+00, 1.480e+02, 7.200e+01, 3.500e+01, 0.000e+00, 3.360e+01,
        6.270e-01, 5.000e+01, 1.000e+00],
       [1.000e+00, 8.500e+01, 6.600e+01, 2.900e+01, 0.000e+00, 2.660e+01,
        3.510e-01, 3.100e+01, 0.000e+00],
       [8.000e+00, 1.830e+02, 6.400e+01, 0.000e+00, 0.000e+00, 2.330e+01,
        6.720e-01, 3.200e+01, 1.000e+00],
       [1.000e+00, 8.900e+01, 6.600e+01, 2.300e+01, 9.400e+01, 2.810e+01,
        1.670e-01, 2.100e+01, 0.000e+00],
       [0.000e+00, 1.370e+02, 4.000e+01, 3.500e+01, 1.680e+02, 4.310e+01,
        2.288e+00, 3.300e+01, 1.000e+00],
       [5.000e+00, 1.160e+02, 7.400e+01, 0.000e+00, 0.000e+00, 2.560e+01,
        2.010e-01, 3.000e+01, 0.000e+00],
       [3.000e+00, 7.800e+01, 5.000e+01, 3.200e+01, 8.800e+01, 3.100e+01,
        2.480e-01, 2.600e+01, 1.000e+00],
       [1.000e+01, 1.150e+02, 0.000e+00, 0.000e+00, 0.000e+00, 3.530e+01,
        1.340e-01, 2.900e+01, 0.000e+00],
       [2.000e+00, 1.970e+02, 7.000e+01, 4.500e+01, 5.430e+02, 3.050e+01

Split arrays into random train and test subsets.

In [58]:
#split the data
x = data[:,:-1]
y = data[:,-1]
x_train,x_test,y_train,y_test = train_test_split(x,y)


In [59]:
print(x.shape)
print()
print(y.shape)


(768, 8)

(768,)


1- Separate by class:

Separating by class involves categorizing data based on target labels. In this particular scenario, we have two target labels: 0 and 1.
We compute the mean and standard deviation for each feature (column) within each label.



In [60]:
def mean_and_std_per_label(x,y,wanted_label):
    #select the valuesof each class
    x_with_specific_class = x[y==wanted_label]
    #calculates media and. standar desviation
    mean = np.mean(x_with_specific_class,axis=0)
    std = np.std(x_with_specific_class,axis=0)
    return mean,std

2- Mean and Standar desviation:

Obtain the mean and standard deviation for each feature (column) within each class.

In [61]:
def calc_mean_std(x,y):
    #calculates mean and std for each wanted label, 0 or 1 for this excercise
    mean_label_0,std_label_0 = mean_and_std_per_label(x,y,0)
    mean_label_1,std_label_1 = mean_and_std_per_label(x,y,1)
    return mean_label_0,std_label_0,mean_label_1,std_label_1

3- Gaussian Probability Density Function:

A Gaussian distribution can be characterized by two parameters: the mean and the standard deviation. Consequently, using these statistical measures, we can approximate the likelihood of observing a particular value. This calculation is encapsulated within a Gaussian Probability Distribution Function (or Gaussian PDF), which can be expressed as:
f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x-mean)^2 / (2 * sigma^2)))


The function calculates the probability of observing a specific value or set of values x in a Gaussian (normal) distribution.
We get 8 probabilities (one for each feature) as result

In [62]:
def predict_gussian(x,mean,std):
    # This line calculates the probability density function (PDF) of the observed value(s) x for each dimension
    # of the Gaussian distribution.  we could also calculate it using the library scipy prob_array = scipy.stats.norm.pdf( x[i], mean[i], std[i])
    prob_array = (1/(np.sqrt(2*math.pi)*std))*np.exp(-(x-mean)**2/(2*std**2))

    #This line reshapes the prob_array to ensure it's represented as a column vector.
    prob_array = np.reshape(prob_array, (prob_array.shape[0],1))

    #the function computes the product of elements along the rows. Naive asumption, each row is independent
    prob = np.prod(prob_array,axis=0)
    return prob

4- Predictions:

Now it's time to utilize the statistical parameters derived from our training dataset to compute probabilities for new data.

Probabilities are computed independently for each class. This implies that initially, we compute the probability of a new data point belonging to the first class, followed by calculating probabilities for it belonging to the second class.

The likelihood that a piece of data belongs to a particular class is calculated using the formula:

P(class|data) = P(X|class) * P(class)

Note that the denominator has been removed to simplify the calculation. This adjustment means that the resulting value is no longer strictly a probability of the data belonging to a class. However, the value is still maximized, implying that the calculation determines the class prediction based on the largest resulting value.



In [63]:
def predict_on_dataset(x_values,mean_label_0,std_label_0,mean_label_1,std_label_1):
    prob0 = []
    prob1 = []
    for row in x_values:
        #calculates gaussian prob for each class
        prob0_row = predict_gussian(row,mean_label_0,std_label_0)
        prob1_row = predict_gussian(row,mean_label_1,std_label_1)
        #attach it as list
        prob0.append(prob0_row)
        prob1.append(prob1_row)
    #creates two columns with probabilities
    probabilities = np.column_stack( (prob0,prob1))

    # make the prediction based on the bigger value
    y_calculated = np.argmax(probabilities,axis=1)
    return y_calculated

5- Accuracy:

To evaluate the model's predictive performance, this function compares the predictions against the true labels. It returns the calculated accuracy score, representing the proportion of correctly predicted outcomes in the test dataset.

In [64]:
def evaluate_predictions(x_test,y_test,mean0,std0,mean1,std1):
    y_calculated = predict_on_dataset(x_test,mean0,std0,mean1,std1)
    diff = y_calculated - y_test #si es correcto da 0, incorrecto +- 1
    wrong_points = np.sum(np.sqrt(diff**2) ) #eleva al cuadrado por los negativos, saca raiz
    length = y_calculated.shape[0]
    accuracy = 1-wrong_points/length
    return accuracy

Once the Naive Bayes Classifier is built, let's encapsulate its functionality within a class using object-oriented programming principles.

In [65]:
class NaiveBayesClassifier():

    def __init__(self):
        self.alg_name = 'NaiveBayes'
        self.mean0 = ''
        self.std0 = ''
        self.mean1 = ''
        self.std1 = ''

    def fit(self,x_train,y_train):
        mean0,std0,mean1,std1 = calc_mean_std(x_train,y_train)
        self.mean0 = mean0
        self.mean1 = mean1
        self.std0 = std0
        self.std1 = std1

    def predict(self,x_test):
        prob0 = []
        prob1 = []
        for row in x_test:
            prob0_row = predict_gussian(row,self.mean0,self.std0)
            prob1_row = predict_gussian(row,self.mean1,self.std1)
            prob0.append(prob0_row)
            prob1.append(prob1_row)
        probabilities = np.column_stack( (prob0,prob1))
        y_calculated = np.argmax(probabilities)
        return y_calculated

    def evaluate(self,x_test,y_test):
        accuracy = evaluate_predictions(x_test,y_test,self.mean0,self.std0,self.mean1,self.std1)
        return accuracy



In [66]:
if __name__ == '__main__':
    my_classifier = NaiveBayesClassifier()
    my_classifier.fit(x_train,y_train)
    accuracy = my_classifier.evaluate(x_test,y_test)

In [67]:
print(accuracy)

0.7395833333333333


Now, lets compare the acuracy obtained runing the Naive Bayes methods available in sklearn library.
It was loaded before in from sklearn.naive_bayes import GaussianNB

In [68]:
#lets try it from sklearn
from sklearn.naive_bayes import GaussianNB

In [69]:
gnb = GaussianNB()
y_pred = gnb.fit(x_train, y_train).predict(x_test)

In [70]:
accuracy_score(y_test, y_pred)

0.7395833333333334

Great! We obtained the same result! In just three lines, we achieved the same accuracy as the one obtained using the Classifier From Scratch.