# Naive Bayes for Diabetes Prediction

In this project, we build a Naive Bayes classification model to predict whether a patient is likely to have diabetes based on several medical features such as glucose level, BMI, age, and blood pressure.

The core problem we are solving is:

> **Given a set of basic medical measurements, can we predict whether a patient is likely to have diabetes?**

This is a **binary classification problem**, where:

*   **0** → Patient does _not_ have diabetes
    
*   **1** → Patient _has_ diabetes

## Dataset

We will use the **Pima Indians Diabetes Dataset**, which is a well-known healthcare dataset originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

Features Include:

*   Number of pregnancies
    
*   Glucose concentration
    
*   Blood pressure
    
*   Skin thickness
    
*   Insulin level
    
*   Body Mass Index (BMI)
    
*   Diabetes pedigree function
    
*   Age

## 1. Importing Libraries

In [1]:
# =========================
# Core Data Libraries
# =========================
import numpy as np
import pandas as pd

# =========================
# Data Visualization
# =========================
import matplotlib.pyplot as plt
import seaborn as sns

# =========================
# Preprocessing & Validation
# =========================
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV
)

# =========================
# Machine Learning Models
# =========================
from sklearn.naive_bayes import GaussianNB

# =========================
# Evaluation Metrics
# =========================
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report
)

# =========================
# Display & Plot Settings
# =========================
pd.set_option("display.max_columns", None)
sns.set(style="whitegrid")

## 2. Exploratory Data Analysis

Before training any machine learning model, we must understand our data.Exploratory Data Analysis (EDA) helps us answer questions like:

*   How big is the dataset?
    
*   What kind of data do we have?
    
*   Is the dataset balanced?
    
*   Are there missing or suspicious values?

### 2.1 Loading and Previewing Dataset

In [2]:
# Load dataset
df = pd.read_csv("diabetes.csv")

# Preview dataset
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
# Data statistical summary
df.drop(columns=['Outcome']).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0


## 3. Data Cleaning & Preprocessing

Before training a machine learning model, we must clean and prepare the data to ensure that the model learns from valid and meaningful medical information.

### 3.1 Handling Missing Values

Although the dataset shows no explicit missing values (`NaN`), earlier analysis revealed **medically invalid zero values** in certain features.

Features where zero is not medically valid:

*   Glucose
    
*   BloodPressure
    
*   SkinThickness
    
*   Insulin
    
*   BMI
    

We will treat zeros in these columns as missing values.

In [5]:
columns_with_invalid_zeros = [
    "Glucose",
    "BloodPressure",
    "SkinThickness",
    "Insulin",
    "BMI"
]

df[columns_with_invalid_zeros] = df[columns_with_invalid_zeros].replace(0, np.nan)

In [6]:
# Check missing values
df.drop(columns=['Outcome']).isnull().sum().to_frame(name="Missing")

Unnamed: 0,Missing
Pregnancies,0
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11
DiabetesPedigreeFunction,0
Age,0


#### Handling Strategy

We will impute missing values using the median of each feature.

Why median?

*   Robust to outliers (common in medical data)
    
*   Preserves realistic central tendencies

In [7]:
for column in columns_with_invalid_zeros:
    df[column] = df[column].fillna(df[column].median())

### 3.2 Train/Test Split

To evaluate how well the model performs on unseen data, we split the dataset into:

*   **Training set** → used to train the model
    
*   **Test set** → used to evaluate performance

In [8]:
# Separate features from the target variable
X = df.drop("Outcome", axis=1)
y = df["Outcome"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

At this point:

*   Data is clean.
    
*   Medical values are realistic.
    
*   Features and labels are separated.
    
*   The dataset is ready for modeling.

## 5. Implementing Gaussian Naive Bayes (scikit-learn)

In this step, we will:

*   Initialize a Gaussian Naive Bayes model.
    
*   Train it on our cleaned healthcare data.
    
*   Use it to make predictions on unseen patients.

In [9]:
# Initialize the Gaussian Naive Bayes model
nb_model = GaussianNB()

# Train the model using the training data
nb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = nb_model.predict(X_test)

# Compare predictions with actual outcomes
comparison_df = pd.DataFrame({
    "Actual": y_test.values,
    "Predicted": y_pred
})

comparison_df.head()

Unnamed: 0,Actual,Predicted
0,0,1
1,0,0
2,0,0
3,1,0
4,0,0


## 6. Model Evaluation (Healthcare-Aware)

After training a machine learning model, we must evaluate **how well it performs on unseen data**.In healthcare, evaluation is not just about numbers; it’s about **patient safety**.

### 6.1 Accuracy

In [11]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score:", accuracy)

Accuracy score: 0.7012987012987013


## 6.2 Confusion Matrix

In [12]:
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[74, 26],
       [20, 34]])

### 6.3 Precision, Recall, and F1-Score

**Precision**

*   Of all patients predicted to have diabetes, how many actually do?
    
*   Important when false alarms are costly.
    

**Recall (Sensitivity)**

*   Of all patients who actually have diabetes, how many did we correctly identify?
    
*   Measures how well we catch the disease.
    

**F1-score**

*   Balance between precision and recall.
    
*   Useful when both false positives and false negatives matter.

In [13]:
# Detailed classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.74      0.76       100
           1       0.57      0.63      0.60        54

    accuracy                           0.70       154
   macro avg       0.68      0.68      0.68       154
weighted avg       0.71      0.70      0.70       154



Here are some **brief, clear observations** you can use:

*   The model achieves an overall **accuracy of 70%**, indicating moderate performance.
    
*   Class **0** is predicted more reliably, with higher **precision (0.79)** and **F1-score (0.76)**.
    
*   Performance on class **1** is weaker, especially in **precision (0.57)**, meaning a higher rate of false positives for this class.
    
*   The **macro average (0.68)** suggests balanced but modest performance across classes, while the **weighted average (0.70)** reflects the dominance of class 0 in the dataset.

## Conclusion

In this notebook, we walked through an end-to-end healthcare machine learning workflow using **Naive Bayes**, starting from data understanding and preprocessing, through model training, evaluation, and interpretation. The goal was not just to build a model, but to **think critically about model behavior in a healthcare context**, where mistakes—especially false negatives—can have serious implications.

This notebook is designed to be **educational and transparent**, making each step easy to follow for beginners while still reflecting real-world ML best practices. The results show that Naive Bayes provides a solid baseline, but also highlight areas where more advanced models and improved preprocessing could lead to better clinical performance.

**For a detailed explanation, insights, and visual walkthrough of this project, read the full blog article here:**
[Naive Bayes For Diabetes Prediction In Healthcare](https://erickhangati.com/naive-bayes-for-diabetes-prediction-in-healthcare/)