### Gaussian Naive Bayes Classification Implementation

This notebook contains an implementation of the Gaussian Naive Bayes (GNB) classifier for continuous data - the goal is to classify benign and malignant breast tumors. In this lab, we apply GNB to classify data points based on their feature distributions, modeled as Gaussians conditioned on the class labels. Here’s an overview of the approach:

1. **Parameter Estimation**: We start by estimating the mean and variance for each feature conditioned on the class label using a labeled training dataset.
2. **MAP Class Prediction**: The notebook then uses these parameters to compute the posterior probability of each class for a new input. We apply the Maximum a Posteriori (MAP) decision rule, choosing the class with the highest posterior probability.
3. **Log-Probability for Numerical Stability**: To avoid numerical underflow, we calculate the log of the probabilities rather than the raw probabilities, improving stability when working with small probability values.
4. **Performance Evaluation**: The notebook evaluates model performance using precision, recall, and accuracy metrics. Precision measures the proportion of true positive predictions among all positive predictions, recall measures the proportion of true positives identified among actual positives, and accuracy provides the overall correctness of predictions. These metrics offer insights into the model's effectiveness in distinguishing between classes.


This implementation demonstrates the use of Gaussian distributions within the Naive Bayes framework, making it suitable for continuous predictor variables. It showcases essential ML concepts, such as parameter estimation, probabilistic modeling, and handling numerical stability through log-probability calculations.



In [1]:
#import libraries
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
#import dataset and make necessary changes
names = ['id','thick','size_unif','shape_unif','marg','cell_size','bare',
         'chrom','normal','mit','class']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/' +
                 'breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                names=names,na_values='?',header=None)
df = df.dropna()

# Get the response.  Convert to a zero-one indicator
yraw = np.array(df['class'])
BEN_VAL = 2   # value in the 'class' label for benign samples
MAL_VAL = 4   # value in the 'class' label for malignant samples
y = (yraw == MAL_VAL).astype(int) # now y has values of 0,1
Iben = (y==0)
Imal = (y==1)

In [3]:
#convert the features to a numpy array and get the number of instances and features
xnames = names[1:-1]
Xfull = np.array(df[xnames])
n = Xfull.shape[0]
d = Xfull.shape[1]

Calculate the mean and variance of each feature for benign and malignant samples

In [4]:
p = np.average(y)

sumBen = sum(Iben)
sumMal = sum(Imal)
mu0 = []
mu1 = []
sig0 = []
sig1 = []
for i in range(d):
    mu0Curr = sum(Xfull[:,i] * Iben)/sumBen
    mu1Curr = sum(Xfull[:,i] * Imal)/sumMal
    
    sig0Curr = sum((Xfull[:,i] * Iben - mu0Curr)**2)/sumBen
    sig1Curr = sum((Xfull[:,i] * Imal - mu1Curr)**2)/sumMal
    
    mu0.append(mu0Curr)
    mu1.append(mu1Curr)
    sig0.append(sig0Curr)
    sig1.append(sig1Curr)

Calculate the MLE, MAP, and log-probabilities for each class

In [5]:
#Calculate MLE with Bayes rule 
P0_MLE = []
P1_MLE = []
for i in range(n):
    pCurr0 = 1
    pCurr1 = 1
    
    for j in range(d):
        pCurr0 *= np.log((1/np.sqrt(2*np.pi*sig0[j])))*np.log(np.exp(-(Xfull[i,j]-mu0[j])**2/(2*sig0[j])))
        pCurr1 *= np.log((1/np.sqrt(2*np.pi*sig1[j])))*np.log(np.exp(-(Xfull[i,j]-mu1[j])**2/(2*sig1[j])))
    
    P0_MLE.append(pCurr0)   
    P1_MLE.append(pCurr1)

#Use MLE to calculate MAP and predict the class for each instance
yhat = []
for i in range(n):
    P0_MAP = P0_MLE[i] * np.log(1-p)
    P1_MAP = P1_MLE[i] * np.log(p)
    if P1_MAP >= P0_MAP:
        yhat.append(1)
    else:
        yhat.append(0)

yhat = np.array(yhat) 

**NOT MY CODE BELOW**: provided code for checking the accuracy, precision, and recall of my model. 

Note the recall is higher than accuracy and precision, which is desirable in the context of the model's goal to detect breast cancer, as a false negative is significantly more costly than a false positive.

In [6]:
acc = np.mean(yhat == y)
print("Accuracy on training data = %f" % acc)
recall = np.sum((yhat == 1)*(y == 1))/np.sum(y == 1)
precision = np.sum((yhat == 1)*(y == 1))/np.sum(yhat == 1)
print("Recall: " + str(recall))
print("Precision: " + str(precision))

Accuracy on training data = 0.975110
Recall: 0.9874476987447699
Precision: 0.944


**NOTE**: Ideally, we would test the recall, precision, and accuracy using test data. As this was only one part of a larger homework problem, we were not instructed to use test data as the main focus of the question was the implementation of the Gaussian Naive Bayes classifier.