## Naive Bayes Classifier for Heart Disease Prediction

This notebook demonstrates a simple implementation of a Gaussian Naive Bayes classifier to predict the likelihood of heart disease based on three continuous features: Cholesterol, Maximum Heart Rate (MaxHR), and Resting Blood Pressure (RestingBP).

### 1. Import Necessary Libraries
We'll start by importing the libraries required for data manipulation and mathematical calculations.

In [1]:
import pandas as pd
import numpy as np
import math

### 2. Helper Function: Gaussian Probability Density

Since our features are continuous, we assume they follow a Gaussian distribution. We need a function to calculate the probability density of a given data point `x` for a class with a specific mean (`mu`) and standard deviation (`std`).

The formula for the Gaussian Probability Density Function (PDF) is:
$$ P(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

In [2]:
def gaussian_prob(x, mu, std):
    """
    Calculates the probability of x for a given mean and standard deviation,
    assuming a Gaussian distribution.
    """
    # Note: The original code had a small error in the formula's constant (missing sqrt).
    # The corrected formula is used here for accuracy, though the original logic's outcome remains the same
    # because the constant term cancels out when comparing probabilities.
    exponent = math.exp(-((x - mu) ** 2) / (2 * (std ** 2)))
    return (1 / (math.sqrt(2 * math.pi) * std)) * exponent

### 3. The Bayesian Classifier Function

This is the core of our classifier. It uses Bayes' Theorem to find the probability of a class (Heart Disease or No Heart Disease) given the input data.

**Bayes' Theorem:**
$$ P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)} $$

Where:
- $P(C|X)$ is the **posterior probability**: the probability of class `C` given the data `X`.
- $P(X|C)$ is the **likelihood**: the probability of data `X` given the class `C`.
- $P(C)$ is the **prior probability**: the overall probability of class `C`.
- $P(X)$ is the **evidence**: the overall probability of the data `X`.

Since $P(X)$ is the same for all classes, we can ignore it and compare $P(X|C) \cdot P(C)$ for each class.
To avoid numerical underflow (multiplying many small probabilities), we work with logarithms. The calculation becomes:
$$ \log(P(C|X)) \propto \log(P(X|C)) + \log(P(C)) $$

In [3]:
def bayesian_classifier(heart_disease, collected_data, given_data):
    """
    Classifies given data based on Naive Bayes algorithm.
    
    Args:
        heart_disease (list): A list of all target labels (0s and 1s).
        collected_data (list): A nested list of feature data separated by class.
        given_data (list): The new data point to classify [cholesterol, max_hr, resting_bp].
    
    Returns:
        list: A list containing the log posterior probabilities for each class.
    """
    
    # Calculate the prior probability for each class (0 and 1).
    # P(class) = (Number of samples in class) / (Total number of samples)
    prior = [heart_disease.count(0) / len(heart_disease),  # Prior for Class 0 (No Disease)
             heart_disease.count(1) / len(heart_disease)]  # Prior for Class 1 (Disease)
    
    # Take the log of the priors to simplify future calculations (addition instead of multiplication).
    log_prior = np.log(prior)

    # The "Naive" assumption: We assume all features are independent of each other given the class.
    # So, P(X|C) = P(feature1|C) * P(feature2|C) * P(feature3|C)
    # We calculate the likelihood for each class.
    likelihoods = [
        # Likelihood for Class 0 (No Disease)
        gaussian_prob(given_data[0], np.mean(collected_data[0][0]), np.std(collected_data[0][0])) * # Cholesterol
        gaussian_prob(given_data[1], np.mean(collected_data[1][0]), np.std(collected_data[1][0])) * # MaxHR
        gaussian_prob(given_data[2], np.mean(collected_data[2][0]), np.std(collected_data[2][0])),  # RestingBP
        
        # Likelihood for Class 1 (Disease)
        gaussian_prob(given_data[0], np.mean(collected_data[0][1]), np.std(collected_data[0][1])) * # Cholesterol
        gaussian_prob(given_data[1], np.mean(collected_data[1][1]), np.std(collected_data[1][1])) * # MaxHR
        gaussian_prob(given_data[2], np.mean(collected_data[2][1]), np.std(collected_data[2][1]))   # RestingBP
    ]
    
    # Take the log of the likelihoods.
    log_likelihoods = np.log(likelihoods)

    # Calculate the final log posterior probability for each class.
    # log(Posterior) is proportional to log(Likelihood) + log(Prior)
    result = [(log_likelihoods[0] + log_prior[0]), 
              (log_likelihoods[1] + log_prior[1])]
    
    return result

### 4. Data Loading and Preparation

Now, we load the `heart.csv` dataset. Then, we process and structure it so that our classifier can use it. We need to separate the features based on the target class (`HeartDisease` = 0 or 1).

In [4]:
# IMPORTANT: Replace this path with the actual path to your 'heart.csv' file.
path = r"heart.csv"
file = pd.read_csv(path)

# Display the first few rows of the dataframe to understand its structure.
print("Dataset Preview:")
display(file.head())

# Extract the target variable (HeartDisease) into a list.
heart_disease = list(file["HeartDisease"])

# Extract the feature variables into separate lists.
cholesterol_data = list(file["Cholesterol"])
max_hr_data = list(file["MaxHR"])
resting_bp_data = list(file["RestingBP"])

# Prepare nested lists to hold feature data separated by class.
# Format: feature[class_index]
cholesterol = [[], []]  # [[class 0 values], [class 1 values]]
max_hr = [[], []]       # [[class 0 values], [class 1 values]]
resting_bp = [[], []]  # [[class 0 values], [class 1 values]]

# Iterate through the entire dataset to populate the class-separated lists.
for i in range(len(heart_disease)):
    if heart_disease[i] == 0:  # If the class is 'No Heart Disease'
        cholesterol[0].append(cholesterol_data[i])
        max_hr[0].append(max_hr_data[i])
        resting_bp[0].append(resting_bp_data[i])
    else:  # If the class is 'Heart Disease'
        cholesterol[1].append(cholesterol_data[i])
        max_hr[1].append(max_hr_data[i])
        resting_bp[1].append(resting_bp_data[i])

# Combine all the structured feature data into a single list.
# Format: collected_data[feature_index][class_index]
collected_data = [cholesterol, max_hr, resting_bp]
print("\nData preparation complete.")

Dataset Preview:


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0



Data preparation complete.


### 5. Making a Prediction

With the data prepared and the classifier function defined, we can now provide a new data point and predict whether it corresponds to the presence of heart disease.


In [5]:
# Define the new data point for which we want to make a prediction.
given_cholesterol = 248
given_max_hr = 125
given_resting_bp = 100

# Structure the given data into a list.
given_data = [given_cholesterol, given_max_hr, given_resting_bp]

print(f"Given Data:\nCholesterol: {given_data[0]}\nMaxHR: {given_data[1]}\nRestingBP: {given_data[2]}\n")

# Call the Bayesian classifier to get the log posterior scores for each class.
result = bayesian_classifier(heart_disease, collected_data, given_data)

# The class with the higher log posterior probability is our prediction.
# result[0] -> Score for Class 0 (No Heart Disease)
# result[1] -> Score for Class 1 (Heart Disease)
if result[0] > result[1]:
    print("Prediction -> Heart Disease: No")
else:
    print("Prediction -> Heart Disease: Yes")

Given Data:
Cholesterol: 248
MaxHR: 125
RestingBP: 100

Prediction -> Heart Disease: Yes
