# Homework 5 - Naive Bayes

Make sure you have downloaded "heart\_processed.csv" and included it in your folder/directory.

This homework will ask you to implement naive bayes using a custom likelihood and then comparing it against sklearn's Gaussian naive Bayes. 

The execution is slightly different from lecture and section. It is more streamlined to take adavantage of vector multiplications and numpy functions, which has its own benefits if we want to scale up our naive bayes prediction to higher dimensions. So, make sure you understand what naive Bayes is doing in lecture and section notebooks.

## 0 Data
We load `heart_processed.csv` which has log-predictors from the [Heart Failure Clinical Records Dataset](https://archive.ics.uci.edu/ml/datasets/Heart%2Bfailure%2Bclinical%2Brecords) for predicting `DEATH_EVENT`.

In [1]:
# ======= DO NOT CHANGE CODE =======
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv("heart_processed.csv", index_col=0)
X = dataset.drop("DEATH_EVENT", axis=1).values
# X = np.log(X)
y = dataset["DEATH_EVENT"].values

# split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# print the shapes of the training and testing sets
print('train shapes:')
print('\t X_train ->', X_train.shape)
print('\t y_train ->', y_train.shape)

print('test shapes:')
print('\t X_test ->', X_test.shape)
print('\t y_test ->', y_test.shape)

display(dataset)
# ==================================

train shapes:
	 X_train -> (239, 6)
	 y_train -> (239,)
test shapes:
	 X_test -> (60, 6)
	 y_test -> (60,)


Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,DEATH_EVENT
0,4.317488,6.366470,2.995732,12.487485,0.641854,4.867534,1
1,4.007333,8.969669,3.637586,12.481270,0.095310,4.912655,1
2,4.174387,4.983607,2.995732,11.995352,0.262364,4.859812,1
3,3.912023,4.709530,2.995732,12.254863,0.641854,4.919981,1
4,4.174387,5.075174,2.995732,12.697715,0.993252,4.753590,1
...,...,...,...,...,...,...,...
294,4.127134,4.110874,3.637586,11.951180,0.095310,4.962845,0
295,4.007333,7.506592,3.637586,12.506177,0.182322,4.934474,0
296,3.806662,7.630461,4.094345,13.517105,-0.223144,4.927254,0
297,3.806662,7.788626,3.637586,11.849398,0.336472,4.941642,0


## 1 Custom Naive Bayes Classifier with KDE
You will write a naive Bayes classifier, using KDE to approximate the likelihood. 

**Use only the training data ```X_train, y_train``` to fit the classification model.**

Recall: Bayesian analysis involves a prior, likelihood, and posterior. Your task is to complete the following code and functions in order to execute naive Bayes classification.

### 1.1 Prior
1. [2 pt] Compute ```prior```, a two element array. 
    - prior[0] is the probability of death event 0, 
    - prior[1] is the probability of death event 1. 
    - Infer the prior probabilities from the training data. You do not need to complete the (Tip: Use np.unique() with return_counts)
2. [1 pt] Print ```prior```.

In [2]:
classes = np.unique(y_train)
prior = np.array([0.0, 0.0])
for k in classes:
    members = (y_train == k)
    num = members.sum()
    prior[k] = num / y_train.size
print('The prior probabilities are:', prior)

The prior probabilities are: [0.70292887 0.29707113]


### 1.2 Likelihood (KDE)
1. [2 pt] Using `scipy.stats.gaussian_kde` with default bandwidth, define objects ```kde0``` and ```kde1```, which correspond to likelihoods when the death event is 0 and 1 respectively. 
    - Remember to use only training data.
    - When using ```gaussian_kde``` to define ```kde0``` and ```kde1```, make sure you index the correct rows of ```X_train```. 


In [3]:
from scipy.stats import gaussian_kde
kde0 = gaussian_kde(X_train[y_train == 0].T)
kde1 = gaussian_kde(X_train[y_train == 1].T)

2. [2 pt] Complete the code for ```compute_likelihood``` function.
    - The objects kde0 and kde1 has a method kde.pdf() that you will use when computing the likelihood.
    - Make sure that you read the documentation for kde.pdf() for what inputs should go in.

In [4]:
def compute_likelihood(x, kde0, kde1):
    # input:    x, a (# data) by (# features) array
    #           kde0 and kde1, kde objects that will be used to compute the likelihood
    # output:   likelihood, a (# data) by 2 array
    likelihood0 = [kde0.pdf(sample)[0] for sample in x]
    likelihood1 = [kde1.pdf(sample)[0] for sample in x]
    likelihood = np.vstack((likelihood0, likelihood1)).T
    return likelihood

### 1.3 Posterior
1. [2 pt] Complete the code for ```compute_posterior``` function. 
    - It should include calling the function ```compute_likelihood```.

In [5]:
def compute_posterior(x, prior, kde0, kde1):
    # input:    x, a (# data) by (# features) array
    #           prior, a 1 by 2 array
    #           kde0 and kde1, kde objects that will be used to compute the likelihood
    # output:   posterior, a (# data) by 2 array
    likelihood = compute_likelihood(x, kde0, kde1)
    posterior = likelihood * prior
    return posterior

### 1.4 Combine prior, likelihood, posterior
Now, we are ready to piece all the code we prepared above about the prior, likelihood, and posterior.
1. [2 pt] Complete the code for ```naive_bayes_predict```.
    - Your code should include calling the ```compute_posterior``` function.
    - Computing y_pred should be a simple one line of code. You may consider using numpy functions that find the index of the largest entry on every row.
2. [2 pt] Complete the code for ```print_success_rates```.

In [6]:
def naive_bayes_predict(x, prior, kde0, kde1):
    # input:    x, a (# data) by (# features) array
    #           prior, a 1 by 2 array
    #           kde0 and kde1, kde objects that will be used to compute the likelihood
    # output:   y_pred, a (# data) by 1 array
    posterior = compute_posterior(x, prior, kde0, kde1)
    y_pred = np.argmax(posterior, axis=1)
    return y_pred

def print_success_rates(y_true,y_pred):
    n_success = np.sum(y_true == y_pred)
    n = len(y_true)
    print("Number of correctly labeled points: %d out of %d. Accuracy: %.2f" 
        % (n_success, n, n_success/n))

### 1.5 Predict
1. Use your custom naive Bayes to:
    - [1 pt] predict *TRAINING* data with ```naive_bayes_predict```.
    - [1 pt] print the results with ```print_success_rates```.

In [7]:
# TODO predict training data and print
y_train_pred = naive_bayes_predict(X_train, prior, kde0, kde1)
print_success_rates(y_train, y_train_pred)

Number of correctly labeled points: 215 out of 239. Accuracy: 0.90


2. Use your custom naive Bayes to:
    - [1 pt] predict *TEST* data with ```naive_bayes_predict```.
    - [1 pt] print the results with ```print_success_rates```.

In [8]:
# TODO predict test data and print
y_test_pred = naive_bayes_predict(X_test, prior, kde0, kde1)
print_success_rates(y_test, y_test_pred)

Number of correctly labeled points: 37 out of 60. Accuracy: 0.62


## 2. sklearn Gaussian naive Bayes
Let's compare our custom naive Bayes with KDE to the sklearn Gaussian naive Bayes.

### 2.1 Train
1. [1 pt] Fit ```gnb``` using training data.

In [9]:
# run sklearn's version - read up on differences if interested
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
model = gnb.fit(X_train, y_train)

### 2.2 Predict
1. Use sklearn naive Bayes to:
    - [1 pt] predict *TRAINING* data.
    - [1 pt] print the results with ```print_success_rates```.

In [10]:
# TODO predict training data and print
y_train_pred = model.predict(X_train)
print_success_rates(y_train, y_train_pred)

Number of correctly labeled points: 191 out of 239. Accuracy: 0.80


2. Use sklearn naive Bayes to:
    - [1 pt] predict *TEST* data.
    - [1 pt] print the results with ```print_success_rates```.

In [11]:
# TODO predict test data and print
y_test_pred = model.predict(X_test)
print_success_rates(y_test, y_test_pred)

Number of correctly labeled points: 40 out of 60. Accuracy: 0.67


## 3. Discussion of results
1. [2 pt] How does the accuracy of predicting *TRAINING* data differ between your custom naive Bayes with KDE and sklearn's Gaussian naive Bayes? Give an explanation for why it might be so.
    
    **Ans:** Custom naive Bayes with KDE yields a better accuracy than using sklearn's Gaussian naive Bayes. This could be due to that the data distribution of training set is not necessarily Gaussian. KDE, as a more coarse model, effectively avoided some Gaussian inaccuracies and ended up fitting the training data better. This self-trained model is able to more accurately reflect the distribution of training data.

2. [2 pt] How does the accuracy of predicting *TEST* data differ between your custom naive Bayes with KDE and sklearn's Gaussian naive Bayes? Give an explanation for why it might be so.

    **Ans:** The custom naive Bayes with KDE showed a lower accuracy on the test data compared to sklearn's Gaussian naive Bayes. This could be caused by overfitting in the training set. The model learns bias and noise from the training set so well that it cannot handle test data with the same level of precision. On the contrary, sklearn's model was not affected by the training data overfitting. Besides, the pdfs were solely obtained from the KDE model; since it could potentially be flawed, it wouldn't be surprising to yield lower accuray. 