# Gaussian Naive Bayes Classifier for Diabetes Prediction

This notebook implements a Gaussian Naive Bayes classifier from scratch to predict the likelihood of diabetes based on two continuous features: **Glucose** level and **BMI**.

Unlike classifiers that count discrete features, this model assumes that the feature values for each class (diabetic or not) follow a Gaussian (normal) distribution. [Image of a bell curve for normal distribution]

In [1]:
import numpy as np
import pandas as pd
import math

## 1. Helper Function: Gaussian Probability Density

This function calculates the probability density for a given data point `x` based on the mean (`mu`) and standard deviation (`std`) of a feature's distribution for a specific class. It tells us how likely a value is to occur within that distribution.

In [2]:
def gaussian_prob(x, mu, std):
    """Calculates the probability density of x for a Gaussian distribution."""
    # This formula calculates the 'height' of the bell curve at point 'x'.
    return (1 / (math.sqrt(2 * math.pi) * std)) * math.exp(-((x - mu) ** 2) / (2 * (std ** 2)))

## 2. Main Classifier Function

This function brings everything together. It calculates the prior, likelihoods (using the Gaussian helper function), and evidence to compute the final posterior probabilities for each class.

In [3]:
def bayesian_classifier(outcome, collected_data, given_data):
    """Implements the Gaussian Naive Bayes algorithm."""
    # Calculate the Prior probability of each class based on its frequency
    prior = [outcome.count(0) / len(outcome), outcome.count(1) / len(outcome)]

    # Calculate the Likelihoods for the given_data under each class
    likelihoods = [
        # Likelihood for Class 1 (No Diabetes)
        gaussian_prob(given_data[0], np.mean(collected_data[0][0]), np.std(collected_data[0][0])) *
        gaussian_prob(given_data[1], np.mean(collected_data[1][0]), np.std(collected_data[1][0])),

        # Likelihood for Class 2 (Diabetes)
        gaussian_prob(given_data[0], np.mean(collected_data[0][1]), np.std(collected_data[0][1])) *
        gaussian_prob(given_data[1], np.mean(collected_data[1][1]), np.std(collected_data[1][1]))
    ]

    # Calculate the Evidence (overall probability of observing the data)
    evidence = prior[0] * likelihoods[0] + prior[1] * likelihoods[1]

    # Calculate the final Posterior probabilities for each class using Bayes' Theorem
    result = [(likelihoods[0] * prior[0]) / evidence, (likelihoods[1] * prior[1]) / evidence]

    # Print the calculation breakdown
    print("--- Bayes Calculation Breakdown ---")
    print(f"Prior P(No Diabetes): {prior[0]:.3f} | P(Diabetes): {prior[1]:.3f}")
    print(f"Likelihood P(Data | No Diabetes): {likelihoods[0]:.5e}")
    print(f"Likelihood P(Data | Diabetes):    {likelihoods[1]:.5e}")
    print(f"Evidence P(Data): {evidence:.5e}")
    print(f"Posterior P(No Diabetes | Data): {result[0]:.3f}")
    print(f"Posterior P(Diabetes | Data):    {result[1]:.3f}")
    print("-----------------------------------")
    
    return result

## 3. Data Loading and Preprocessing

Here we load the `diabetes.csv` dataset and segregate the `Glucose` and `BMI` data into lists based on the outcome (0 for No Diabetes, 1 for Diabetes).

In [4]:
# !!! IMPORTANT: Update this path to the location of your diabetes.csv file !!!
path = r"diabetes.csv"

try:
    file = pd.read_csv(path)
    print("File loaded successfully!")
    
    # Extract columns into lists
    outcome = list(file["Outcome"])
    glucose_data = list(file["Glucose"])
    bmi_data = list(file["BMI"])
    
    # Initialize lists to hold data segregated by class
    bmi = [[], []]      # bmi[0] for No Diabetes, bmi[1] for Diabetes
    glucose = [[], []]  # glucose[0] for No Diabetes, glucose[1] for Diabetes
    
    # Segregate the data by outcome
    for i in range(len(outcome)):
        if outcome[i] == 0:
            glucose[0].append(glucose_data[i])
            bmi[0].append(bmi_data[i])
        else:
            glucose[1].append(glucose_data[i])
            bmi[1].append(bmi_data[i])

    # Combine into the final structure for the classifier
    collected_data = [glucose, bmi]
    print(f"Data processed: {len(glucose[0])} non-diabetic samples and {len(glucose[1])} diabetic samples.")

except FileNotFoundError:
    print(f"Error: The file was not found at {path}")
    print("Please update the 'path' variable to the correct location of your 'diabetes.csv' file.")
    collected_data = None # Prevent errors in the next cell

File loaded successfully!
Data processed: 500 non-diabetic samples and 268 diabetic samples.


## 4. Making a Prediction

Finally, we define a new data point (a specific Glucose and BMI value) and use our classifier to predict the outcome.

In [6]:
if collected_data is not None:
    # Define the new data point we want to classify
    given_glucose = 150
    given_bmi = 40
    given_data = [given_glucose, given_bmi]
    print(f"Predicting for: Glucose = {given_glucose}, BMI = {given_bmi}")

    # Call the classifier to get the posterior probabilities
    result = bayesian_classifier(outcome, collected_data, given_data)

    # Compare the final probabilities to make a classification decision
    if result[0] > result[1]:
        print("\nPrediction: No Diabetes")
    else:
        print("\nPrediction: Diabetes")

Predicting for: Glucose = 150, BMI = 40
--- Bayes Calculation Breakdown ---
Prior P(No Diabetes): 0.651 | P(Diabetes): 0.349
Likelihood P(Data | No Diabetes): 1.10559e-04
Likelihood P(Data | Diabetes):    5.29880e-04
Evidence P(Data): 2.56884e-04
Posterior P(No Diabetes | Data): 0.280
Posterior P(Diabetes | Data):    0.720
-----------------------------------

Prediction: Diabetes
