# Project 1 Kaggle Competition - Pima Indians Diabetes Dataset

Dhara Patel <br>
300146860 <br>
CSI 4106 - Fall 2022 <br>
Group: 54

### Understanding the dataset:
This dataset is taken from the Kaggle Competition Pima Indians Diabetes Dataset published by UCI MACHINE LEARNING. It is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. There were several constraints on the selection of these instances from a larger database. In particular, all patients in this dataset are females who are at least 21 years old and of Pima Indian heritage.

Seeing the rise in Diabetes diagonsis among many of my family members, I chose this dataset as I am really interested in knowing what is the probability of someone getting a diabetes. Although, this dataset is only of the Pima Indians, I am hoping to use these models to predict if someone has a diabetes with more data whenever it is published.

This dataset is fairly small compared to other datasets, with less than 1000 datapoints. This dataset is fairly clean with not many values missing, making it a good choice for learning. It is a binary classification since the output column of the dataset is either 0 or 1, indicating 'No diabetes' or 'Diabetes', respectively. This is because the dataset predicts the outcome of the patient either having diabetes or not. 
When analyzing the data, one can see that there is no null value present in the dataset. However, there are some missing datapoints, which are set to 0. This could skew the results and training of the model. For instance, point at index 7 in the dataset is missing BloodPressure, SkinThickness and Insulin attributes, and hence are set to 0. There are 768 points. 


In [99]:
from enum import unique
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, classification_report
from sklearn.model_selection import KFold, cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt


dataFrame = pd.read_csv("/Users/dhara/Documents/uOttawa Fall sem 10/CSI 4106/Project1/diabetes.csv")
dataFrame.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [100]:
! pip install matplotlib



In [101]:
!pip install tabulate



In [102]:
dataFrame.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### Brainstorming the attributes:
It seems that all the necessary features are present for diagnosis of diabetes. However, there could be other symptoms added to the list of attributes in the future such as "sensation", since most patients with diabetes also feel there is a "tingling sensation" on some part of their body, which is related to diabetes. This could help with the prediction for a more accurate prediction. All the attributes in the dataset are useful, so none of them should be discarded when doing the training.

#### Encoding the features:
Since I used Gaussian Naive Bayes, it assumes that the data is continuous, the data was left as it is since it is continuous data. As for Logistic Regression, since it works best with continuous data, there was not a need to encode the data to discrete attributes. In addition, Multi-Layer Perceptron works with continuous data as well.

### Preparing the data for 4-fold cross-validation
Here, variable x contains the data of all the attributes and y contains the data for the output, which is labelled "Outcome" in the dataset.

In [103]:
x = dataFrame.drop("Outcome", axis=1)
# predictions
y = dataFrame["Outcome"]

kf = KFold(n_splits=4, shuffle=True, random_state=4)
report_ListGB = [] # 2D array to store accuracy, precision, recall for Naive Bayes
report_ListLR = [] # For logistic regression
report_ListMLP = [] # For MLP

#### Step 5-7: 4-fold cross-validation using cross_validate function for GB 

In [104]:
scoring = ['accuracy', 'precision', 'recall']
gasNB = GaussianNB()
scores = cross_validate(gasNB, x, y, scoring=scoring, cv=4, return_estimator=True, return_train_score=True)
print("Test accuracy of each cross-validation section:", scores['test_accuracy'])
print("Precision", scores['test_precision'])
print("Recall", scores['test_recall'])
gsNB = gasNB.fit(x, y)
y_pred = gsNB.predict(x)

Test accuracy of each cross-validation section: [0.77083333 0.71354167 0.75520833 0.765625  ]
Precision [0.69491525 0.6        0.66666667 0.67741935]
Recall [0.6119403  0.53731343 0.59701493 0.62686567]


### Step 5: Training 3 models using some default parameters. 
a. Naïve Bayes - Gaussian Naive Bayes <br>
b. Logistic Regression<br>
c. Multi-Layer Perceptron<br>
 <br>
Here I am using Gaussian Naive Bayes since it is relatively fast to train and use and it is highly scalable. In addition, I am using a single hidden layer for MLP classification since it is better to start with a simpler model than overfitting or complicating the model unnecessarily. <br>

Doing the cross-validation manually. In this case, using kfold and 4 splits, 4-fold cross validation is used. However, instead of keeping a test case separately. 4 models were trained on all data. Within each validation, the 75% of the data was used to train, and 25% of the data was used to test the model. Then, the average of all these 4 models was taken to evaluate the accuracy, precision and recall. <br>
 <br>
Using the for loop to iterate through each section for cross validation and storing the data in the report_List variables for keeping track of accuracy, precision and recall for each validation.

In [112]:
count = 0 # Used to keep count of the cross-validation section
for train_index, test_index in kf.split(x):

    # Training and testing data set for each fold
    x_training, x_testing = x.iloc[train_index], x.iloc[test_index]
    y_training, y_testing = y.iloc[train_index], y.iloc[test_index]
    

    # Step 6-7 Training and Testing Gaussian Naive Bayes
    print("Gaussisan Naive Bayes")
    gaussNB = GaussianNB().fit(x_training, y_training)

    # Predicting the outcome
    y_prediction_GB = gaussNB.predict(x_testing)

    # confusion matrix
    conmGB = confusion_matrix(y_testing, y_prediction_GB)
    print("Confusion matrix: TP =", conmGB[0,0], ", TN =", conmGB[1,1], ", FP =", conmGB[0,1], ", FN =", conmGB[1,0])
    print("The accuracy score test: ", accuracy_score(y_testing, y_prediction_GB))
    print("The precision score is: ", precision_score(y_testing, y_prediction_GB))
    print("The recall score is: ", recall_score(y_testing, y_prediction_GB))
    cl_report_GB =classification_report(y_testing, y_prediction_GB, target_names=["No diabetes", "Diabetes"])
    print("Classification report below:", count)
    print(cl_report_GB)
    class_report_GB = [accuracy_score(y_testing, y_prediction_GB), precision_score(y_testing, y_prediction_GB), recall_score(y_testing, y_prediction_GB)]
    report_ListGB.append(class_report_GB)


    # Step 6-7 Training and testing Logistic Regression
    print("Logistic Regression analysis\n")
    logistic_reg = LogisticRegression(penalty='l2', fit_intercept = True, max_iter=1000).fit(x_training, y_training)
    y_prediction_LR = logistic_reg.predict(x_testing)

    # Confusion matrix
    conmLR = confusion_matrix(y_testing, y_prediction_LR)
    print("Confusion matrix: TP =", conmLR[0,0], ", TN =", conmLR[1,1], ", FP =", conmLR[0,1], ", FN =", conmLR[1,0])
    print("The accuracy score is: ", accuracy_score(y_testing, y_prediction_LR))
    print("The precision score is: ", precision_score(y_testing, y_prediction_LR))
    print("The recall score is: ", recall_score(y_testing, y_prediction_LR))

    cl_report_LR = classification_report(y_testing, y_prediction_LR, target_names=["No diabetes", "Diabetes"])
    print("Classification report below:", count)
    print(cl_report_LR)
    class_report_LR = [accuracy_score(y_testing, y_prediction_LR), precision_score(y_testing, y_prediction_LR), recall_score(y_testing, y_prediction_LR)]
    report_ListLR.append(class_report_LR)


    # MLP classification with 1 hidden layer 
    # Step 6-7: Testing and training
    print("Multi-layer Perceptron analysis\n")

    mlp = MLPClassifier(solver="adam", max_iter=1000, activation="relu", hidden_layer_sizes=(8), alpha=0.1, batch_size='auto', learning_rate_init=0.001, random_state=2)
    mlp.fit(x_training, y_training)

    y_prediction_MLP = mlp.predict(x_testing)

    # Confusion matrix
    conmMLP = confusion_matrix(y_testing, y_prediction_MLP)
    print("Confusion matrix of MLP: TP =", conmMLP[0,0], ", TN =", conmMLP[1,1], ", FP =", conmMLP[0,1], ", FN =", conmMLP[1,0])
    print("The accuracy score is: ", accuracy_score(y_testing, y_prediction_MLP))
    print("The precision score is: ", precision_score(y_testing, y_prediction_MLP))
    print("The recall score is: ", recall_score(y_testing, y_prediction_MLP))

    cl_report_MLP = classification_report(y_testing, y_prediction_MLP, target_names=["No diabetes", "Diabetes"])
    print("Classification report below", count)
    print(cl_report_MLP)
    class_report_MLP = [accuracy_score(y_testing, y_prediction_MLP), precision_score(y_testing, y_prediction_MLP), recall_score(y_testing, y_prediction_MLP)]
    report_ListMLP.append(class_report_MLP)

    count+=1


Gaussisan Naive Bayes
Confusion matrix: TP = 103 , TN = 42 , FP = 23 , FN = 24
The accuracy score test:  0.7552083333333334
The precision score is:  0.6461538461538462
The recall score is:  0.6363636363636364
Classification report below: 0
              precision    recall  f1-score   support

 No diabetes       0.81      0.82      0.81       126
    Diabetes       0.65      0.64      0.64        66

    accuracy                           0.76       192
   macro avg       0.73      0.73      0.73       192
weighted avg       0.75      0.76      0.75       192

Logistic Regression analysis

Confusion matrix: TP = 110 , TN = 44 , FP = 16 , FN = 22
The accuracy score is:  0.8020833333333334
The precision score is:  0.7333333333333333
The recall score is:  0.6666666666666666
Classification report below: 0
              precision    recall  f1-score   support

 No diabetes       0.83      0.87      0.85       126
    Diabetes       0.73      0.67      0.70        66

    accuracy           

### Step 7-8: Evaluation
Getting the average of accuracy, precision, recall of cross validation of the first 4 models of each type created in Step 6-7 for cross validation. This gives the overall score and gets the average by having each model train and test on different part of the dataset to avoid bias of the position of the data points.

In [106]:
averaged_reportGB = np.array(report_ListGB).mean(axis=0)
print("Accuracy, Precision, Recall of normal Gaussian Naive-Bayes", averaged_reportGB)

averaged_reportLR = np.array(report_ListLR).mean(axis=0)
print("Accuracy, Precision, Recall of normal Logistic Regression", averaged_reportLR)

averaged_reportMLP = np.array(report_ListMLP).mean(axis=0)
print("Accuracy, Precision, Recall of normal Multi-layer Perceptron", averaged_reportMLP)

Accuracy, Precision, Recall of normal Gaussian Naive-Bayes [0.74869792 0.65919242 0.58328446]
Accuracy, Precision, Recall of normal Logistic Regression [0.7734375  0.72834268 0.56850998]
Accuracy, Precision, Recall of normal Multi-layer Perceptron [0.71744792 0.63854337 0.44076945]


### Step 9: Modifying some parameters, and performing a train/test/evaluate again:
The cross-validation performed here is the same as the cross-validation performed earlier. Using the 4-fold cross validation to split the dataset into 4, and iterating through each set and training and testing on the remaining set.

#### Gaussian Naive Bayes Model 2:
The parameter changed for Gaussian Naive Bayes was Prior by calculating the prior probability. This allows for a better prediction, knowing the prior probability and using it when training the model.

In [107]:
# Parameters changed for GB: Prior
print("Parameter prior added to normal GB")
count = 0
unique_y = y.unique()
prior_probability = np.zeros(len(y.unique()))
for i in range(0,len(unique_y)):
    prior_probability[i]=sum(dataFrame["Outcome"]==unique_y[i])/len(dataFrame["Outcome"])

Parameter prior added to normal GB


#### Logistic Regression Model 2:
The parameter changed for Logistic Regression was the class_weight. It is now set to be "balanced", which adjusts the weights inversely proportional to class frequencies.

#### Multi-Layer Perceptron Model 2:
The parameter changed for MLP was the solver type, which I set to 'lbfgs'. This is recommended for a smaller dataset and my data is relatively small.

##### Cross-validation again:
Using the same cross-validation as earlier, using the same naming convention for variables.

In [108]:
report_ListGB2 = [] # 2D array to store accuracy, precision, recall
report_ListLR2 = []
report_ListMLP2 = []

# Classification report after parameter changes - Redoing the cross-validation after adding prior probability
for train_index, test_index in kf.split(x):
    x_training, x_testing = x.iloc[train_index], x.iloc[test_index]
    y_training, y_testing = y.iloc[train_index], y.iloc[test_index]

    # Cross validation for GB after adding prior parameter
    gaussNB2 = GaussianNB(priors=prior_probability).fit(x_training, y_training)

    # Predicting the outcome
    print(gaussNB2.predict(x_testing)[:10])

    y_prediction_GB2 = gaussNB2.predict(x_testing)

    # confusion matrix
    conmGB2 = confusion_matrix(y_testing, y_prediction_GB2)
    print("Confusion matrix of GB2: TP =", conmGB2[0,0], ", TN =", conmGB2[1,1], ", FP =", conmGB2[0,1], ", FN =", conmGB2[1,0])
    print("The accuracy score test: ", accuracy_score(y_testing, y_prediction_GB2))
    print("The precision score is: ", precision_score(y_testing, y_prediction_GB2))
    print("The recall score is: ", recall_score(y_testing, y_prediction_GB2))
    cl_report_GB2 =classification_report(y_testing, y_prediction_GB2)
    #print("Classification report below:", count)
    #print(cl_report_GB2)
    class_report_GB2 = [accuracy_score(y_testing, y_prediction_GB2), precision_score(y_testing, y_prediction_GB2), recall_score(y_testing, y_prediction_GB2)]
    report_ListGB2.append(class_report_GB2)
    
    # Model 2 for LR
    # Training and testing LogisticRegression with modified class_weight parameter to be balanced,
    # adjust the weights inversely proportional to class frequencies
    print("Logistic Regression analysis with modified parameter class_weight: 'balanced'\n")
    logistic_reg2 = LogisticRegression(penalty='l2', fit_intercept = True, max_iter=1000, class_weight='balanced').fit(x_training, y_training)
    y_prediction_LR2 = logistic_reg2.predict(x_testing)

    # Confusion matrix
    conmLR2 = confusion_matrix(y_testing, y_prediction_LR2)
    print("Confusion matrix of LR2: TP =", conmLR2[0,0], ", TN =", conmLR2[1,1], ", FP =", conmLR2[0,1], ", FN =", conmLR2[1,0])
    print("The accuracy score is: ", accuracy_score(y_testing, y_prediction_LR2))
    print("The precision score is: ", precision_score(y_testing, y_prediction_LR2))
    print("The recall score is: ", recall_score(y_testing, y_prediction_LR2))

    cl_report_LR2 = classification_report(y_testing, y_prediction_LR2, target_names=["No diabetes", "Diabetes"])
    #print("Classification report below:", count)
    #print(cl_report_LR2)
    class_report_LR2 = [accuracy_score(y_testing, y_prediction_LR2), precision_score(y_testing, y_prediction_LR2), recall_score(y_testing, y_prediction_LR2)]
    report_ListLR2.append(class_report_LR2)


    # Model 2 for MLP classification with 1 hidden layer, Changing parameter type solver to lbggs since I have a small dataset, a little less than 1k
    print("Multi-layer Perceptron analysis - modified solver to 'lbfgs'\n")
    mlp2 = MLPClassifier(solver="lbfgs", max_iter=3000, activation="relu", hidden_layer_sizes=(8), alpha=0.0001, learning_rate_init=0.001, random_state=2)
    mlp2.fit(x_training, y_training)

    y_prediction_MLP2 = mlp.predict(x_testing)

    # Confusion matrix
    conmMLP2 = confusion_matrix(y_testing, y_prediction_MLP2)
    print("Confusion matrix of MLP2: TP =", conmMLP2[0,0], ", TN =", conmMLP2[1,1], ", FP =", conmMLP2[0,1], ", FN =", conmMLP2[1,0])
    print("The accuracy score is: ", accuracy_score(y_testing, y_prediction_MLP2))
    print("The precision score is: ", precision_score(y_testing, y_prediction_MLP2))
    print("The recall score is: ", recall_score(y_testing, y_prediction_MLP2))

    cl_report_MLP2 = classification_report(y_testing, y_prediction_MLP2, target_names=["No diabetes", "Diabetes"])
    
    #print("Classification report below", count)
    #print(cl_report_MLP2)
    class_report_MLP2 = [accuracy_score(y_testing, y_prediction_MLP2), precision_score(y_testing, y_prediction_MLP2), recall_score(y_testing, y_prediction_MLP2)]
    report_ListMLP2.append(class_report_MLP2)


    count+=1

[1 1 0 0 1 0 0 1 0 1]
Confusion matrix of GB2: TP = 86 , TN = 52 , FP = 40 , FN = 14
The accuracy score test:  0.71875
The precision score is:  0.5652173913043478
The recall score is:  0.7878787878787878
Logistic Regression analysis with modified parameter class_weight: 'balanced'

Confusion matrix of LR2: TP = 95 , TN = 51 , FP = 31 , FN = 15
The accuracy score is:  0.7604166666666666
The precision score is:  0.6219512195121951
The recall score is:  0.7727272727272727
Multi-layer Perceptron analysis - modified solver to 'lbfgs'

Confusion matrix of MLP2: TP = 109 , TN = 37 , FP = 17 , FN = 29
The accuracy score is:  0.7604166666666666
The precision score is:  0.6851851851851852
The recall score is:  0.5606060606060606
[0 0 1 0 1 1 0 1 1 1]
Confusion matrix of GB2: TP = 95 , TN = 47 , FP = 35 , FN = 15
The accuracy score test:  0.7395833333333334
The precision score is:  0.573170731707317
The recall score is:  0.7580645161290323
Logistic Regression analysis with modified parameter clas

##### Average report of each model after the modification and cross validation

In [109]:
averaged_reportGB2 = np.array(report_ListGB2).mean(axis=0)
print("Average accuracy, precision, recall after cross validation of 2nd GB model:", averaged_reportGB2)

averaged_reportLR2 = np.array(report_ListLR2).mean(axis=0)
print("Average accuracy, precision, recall after cross validation of 2nd LR model:", averaged_reportLR2)

averaged_reportMLP2 = np.array(report_ListMLP2).mean(axis=0)
print("Average accuray, precision, recall after cross validation of 2nd MLP model:", averaged_reportMLP2)

Average accuracy, precision, recall after cross validation of 2nd GB model: [0.71875    0.57601479 0.74720011]
Average accuracy, precision, recall after cross validation of 2nd LR model: [0.75260417 0.62691638 0.72820486]
Average accuray, precision, recall after cross validation of 2nd MLP model: [0.72395833 0.63265993 0.49556626]


### Repeating Step 9 for Modified parameters:

#### Gaussian Naive Bayes Model 3:
The parameter changed here was smoothing variance. This is because for Gaussian Naive Bayes only prior could be changed by calculating the prior probability. However, for variance smoothing, I am estimating a constant to apply the smoothing.

#### Logistic Regression Model 3:
The parameter changed for Logistic Regression was the C, which is the inverse of regularization. It is now set to 1e12, which will allow for better prediction on test set by reducing overfitting on the training set.

#### Multi-Layer Perceptron Model 3:
The parameter changed for MLP was increasing the hidden layer nodes to 13 and setting the activation parameter to "identity". This is to implement a linear bottleneck and increasing nodes is helpful in this case to provide better prediction.

##### Cross-validation again:
Using the same cross-validation as earlier, using the same naming convention for variables.

In [110]:
report_ListGB3 = [] # 2D array to store accuracy, precision, recall
report_ListLR3 = []
report_ListMLP3 = []
count = 0

# Redoing the cross-validation after chanigng parameters
for train_index, test_index in kf.split(x):
    x_training, x_testing = x.iloc[train_index], x.iloc[test_index]
    y_training, y_testing = y.iloc[train_index], y.iloc[test_index]
    
    # Cross validation for GB after adding prior parameter and var_smoothing
    print("\nGaussian NB analysis with modified parameter var_smoothing=0.01")
    gaussNB3 = GaussianNB(var_smoothing=0.01).fit(x_training, y_training)

    # Predicting the outcome
    y_prediction_GB3 = gaussNB3.predict(x_testing)

    # confusion matrix
    conmGB3 = confusion_matrix(y_testing, y_prediction_GB3)
    print("Confusion matrix of GB3: TP =", conmGB3[0,0], ", TN =", conmGB3[1,1], ", FP =", conmGB3[0,1], ", FN =", conmGB3[1,0])
    print("The accuracy score test: ", accuracy_score(y_testing, y_prediction_GB3))
    print("The precision score is: ", precision_score(y_testing, y_prediction_GB3))
    print("The recall score is: ", recall_score(y_testing, y_prediction_GB3))
    cl_report_GB3 =classification_report(y_testing, y_prediction_GB3)
    #print("Classification report below:", count)
    #print(cl_report_GB2)
    class_report_GB3 = [accuracy_score(y_testing, y_prediction_GB3), precision_score(y_testing, y_prediction_GB3), recall_score(y_testing, y_prediction_GB2)]
    report_ListGB3.append(class_report_GB3)

    # Model 3 for Logistic Regression: Parameter changed : C : inverse of regularization
    # Training and testing LogisticRegression with modifief parameter: C = 1e12 to predict on test set better and reduce overfitting
    print("\nLogistic Regression analysis with modified parameter C: 1e12\n")
    logistic_reg3 = LogisticRegression(penalty='l2', fit_intercept = True, C=1e12, max_iter=1000).fit(x_training, y_training)
    y_prediction_LR3 = logistic_reg3.predict(x_testing)

    # Confusion matrix
    conmLR3 = confusion_matrix(y_testing, y_prediction_LR3)
    print("Confusion matrix: TP =", conmLR3[0,0], ", TN =", conmLR3[1,1], ", FP =", conmLR3[0,1], ", FN =", conmLR3[1,0])
    print("The accuracy score is: ", accuracy_score(y_testing, y_prediction_LR3))
    print("The precision score is: ", precision_score(y_testing, y_prediction_LR3))
    print("The recall score is: ", recall_score(y_testing, y_prediction_LR3))

    cl_report_LR3 = classification_report(y_testing, y_prediction_LR3, target_names=["No diabetes", "Diabetes"])
    #print("Classification report below:", count)
    #print(cl_report_LR3)
    class_report_LR3 = [accuracy_score(y_testing, y_prediction_LR3), precision_score(y_testing, y_prediction_LR3), recall_score(y_testing, y_prediction_LR3)]
    report_ListLR3.append(class_report_LR3)
   
   
    # Model 3 for MLP classification with 1 hidden layer, Changing parameter type solver to sgd

    print("\nMulti-layer Perceptron analysis - modified activation to identity (implementing linear bottleneck) - increased # of nodes\n")
    mlp3 = MLPClassifier(solver="lbfgs", max_iter=3000, batch_size='auto', activation="identity", hidden_layer_sizes=(13), alpha=0.0001, learning_rate_init=0.001, random_state=2)
    mlp3.fit(x_training, y_training)

    y_prediction_MLP3 = mlp3.predict(x_testing)

    # Confusion matrix
    conmMLP3 = confusion_matrix(y_testing, y_prediction_MLP3)
    print("Confusion matrix of MLP3: TP =", conmMLP3[0,0], ", TN =", conmMLP3[1,1], ", FP =", conmMLP3[0,1], ", FN =", conmMLP3[1,0])
    print("The accuracy score is: ", accuracy_score(y_testing, y_prediction_MLP3))
    print("The precision score is: ", precision_score(y_testing, y_prediction_MLP3))
    print("The recall score is: ", recall_score(y_testing, y_prediction_MLP3))

    cl_report_MLP3 = classification_report(y_testing, y_prediction_MLP3, target_names=["No diabetes", "Diabetes"])
    #print("Classification report below", count)
    #print(cl_report_MLP3)
    class_report_MLP3 = [accuracy_score(y_testing, y_prediction_MLP3), precision_score(y_testing, y_prediction_MLP3), recall_score(y_testing, y_prediction_MLP3)]
    report_ListMLP3.append(class_report_MLP3)

    count+=1
    if(count==1):
        headers = dataFrame.columns
        print("The last two columns are: Actual Diagnosis, Predicted Diagnosis")
        print("FP/FN for GB\n", bn1, "\n")
        an1 = np.column_stack((x_testing, y_testing, y_prediction_GB3))
        bn1 = an1[:30,:]
        print("FP/FN for LR\n", bn2, "\n")
        an2 = np.column_stack((x_testing, y_testing, y_prediction_LR3))
        bn2 = an2[:30,:]
        print("FP/FN for MLP\n", bn, "\n")
        an = np.column_stack((x_testing, y_testing, y_prediction_MLP3))
        bn = an[:30,:]
        


Gaussian NB analysis with modified parameter var_smoothing=0.01
Confusion matrix of GB3: TP = 116 , TN = 38 , FP = 10 , FN = 28
The accuracy score test:  0.8020833333333334
The precision score is:  0.7916666666666666
The recall score is:  0.5757575757575758

Logistic Regression analysis with modified parameter C: 1e12

Confusion matrix: TP = 110 , TN = 44 , FP = 16 , FN = 22
The accuracy score is:  0.8020833333333334
The precision score is:  0.7333333333333333
The recall score is:  0.6666666666666666

Multi-layer Perceptron analysis - modified activation to identity (implementing linear bottleneck) - increased # of nodes

Confusion matrix of MLP3: TP = 111 , TN = 44 , FP = 15 , FN = 22
The accuracy score is:  0.8072916666666666
The precision score is:  0.7457627118644068
The recall score is:  0.6666666666666666
The last two columns are: Actual Diagnosis, Predicted Diagnosis
FP/FN for GB
 [[2.000e+00 1.970e+02 7.000e+01 4.500e+01 5.430e+02 3.050e+01 1.580e-01
  5.300e+01 1.000e+00 1.00

#### Calculating Average again:
Now doing an average of all the cross-validation 4 models. 

In [111]:
# Now doing an average of the two models, since there is only priors that could change and there is no need to do variance 
# since there is no zero probability.

averaged_reportGB3 = np.array(report_ListGB3).mean(axis=0)
print("Average accuracy, precision, recall after cross validation of 3rd GB model:", averaged_reportGB3)

averaged_reportLR3 = np.array(report_ListLR3).mean(axis=0)
print("Average accuracy, precision, recall after cross validation of 3rd LR model:", averaged_reportLR3)
#average_allGB = np.array(averaged_reportGB, averaged_reportGB2)
averaged_reportMLP3 = np.array(report_ListMLP3).mean(axis=0)
print("Average accuracy, precision, recall after cross validation of 3rd MLP model:", averaged_reportMLP3)

Average accuracy, precision, recall after cross validation of 3rd GB model: [0.75260417 0.71738581 0.48167854]
Average accuracy, precision, recall after cross validation of 3rd LR model: [0.77473958 0.73243007 0.56539939]
Average accuracy, precision, recall after cross validation of 3rd MLP model: [0.77604167 0.73553741 0.56539939]


### Step 10 - Analyzing the results

### Comparing the pecision/recall measures of the 9 results quantitatively.
 <br>

The confusion matrix of each cross-validation section and all types of models was printed. The in-built sklearn functions of accuracy, precision and recall were used, and their averages were taken for comparison with each other.

#### Gaussian Naive Bayes:
As can be seen from the data, the best accuracy and precision of Gaussian Naive Bayes models was provided by the third Gaussian Naive Bayes model with variance smoothing set to 0.01. The average accuracy for that is 0.7526, precision is 0.7174, but the recall is low at 0.4817. While the precision and accuracy increased, the recall decreased. This means that the algorithm detects the true positives and returns the right predictions compared to the other two Naive Bayes models with lower precision. The variance of each attribute was taken into consideration, which increased the accuracy and precision of the model. This is because there are certain 0 values that skews the result. Hence, taking that into consideration allowed for a better prediction. 

#### Logistic Regression:
As can be seen from the data, the best accuracy and precision of Logistic Regression models was provided by the third Logistic Regression model with modified C(inverse of regularization strength). The average accuracy for that is 0.7747, precision is 0.7324 and lowest precision at 0.5654. While the precision and accuracy increased, the recall decreased. This means that the algorithm fits the model better since the modified C float defines how much the algorithm should be regularized. The smaller the number, the more it is regularized. Therefore, decreasing regularization prevents the training model from overfitting. Therefore, in this case, the model gave us better accuracy, precision and lower recall. The higher precision means that the model returns more right predictions compared to the other two regression models with lower precision.

#### Multi-Layer Perceptron:
The best accuracy and precision of Multi-Layer Perceptron models was provided by the third Multi-Layer Perceptron classification model with modified identity activation function for the hidden layer. This model has 'identity' instead of the 'relu' default function, and the hidden layer has 13 nodes. The average accuracy for this model is 0.7760, precision is 0.7355, and the highest recall at 0.5654. In this case, unlike the other two types of models, MLP model's recall increased along with precision. The higher recall shows that with the third MLP model, greater proportion of positive diagnosis was correct compared to other two MLP models. When playing with changing parameters, it was noticed that changing the activation function had a higher impact on the confusion matrix than increasing the nodes in the hidden layers. The activation identity function yields a linear transformation and therefore it fits better to the model. 

Overall, out of all types of models, the best accuracy and precision was provided by the third MLP classification model using identity activation function with 1 hidden layer containing 13 nodes. The lowest recall was given by the second Gaussian Naive Bayes model with prior probabilities, meaning it predicted a greater proportion of actual positives accurately, and lower false negatives.

Overall, in this case when predicting the diagnosis of Diabetes, it is better to have higher recall since that would mean that at least there will be lower number of false negatives. It is better that people get a positive prediction even if they do not have diabetes than the people with diabetes getting incorrect prediction since they are at risk of not getting proper treatment. The patients who receive an incorrect prediction can do an actual test to find out if they have the diabetes or not.

#### Examples for the above analysis:
For third model of Gaussian Naive Bayes, there are much more false negatives than false positives, for instance at index 17 has a false negative when doing a cross validation in second iteration. This is again because of high precision.
In addition, there is a false negative for for index 17 in Logistic regression and Naive Bayes on the same section. This shows that it is becuase of the missing values for skin thickness and insulin in point 17.  
For example, the third Naive Bayes gives the incorrect prediction, which is a false negative for data point at index 17 (starting from 0). It is supposed to diagnose that the patient has diabetes, but it predicts no diabetes. This is because the precision is high. While index 67 gives a false positive. The result has been printed above in the second iteration of the third models for all. I did not keep the other printed runs to avoid making this report long.


### References:

1. https://www.geeksforgeeks.org/how-to-get-column-names-in-pandas-dataframe/
2. https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
3. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold.get_n_splits
4. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict
5. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PrecisionRecallDisplay.html#sklearn.metrics.PrecisionRecallDisplay.from_predictions
6. https://holypython.com/nbc/naive-bayes-classifier-optimization-parameters/
7. https://scikit-learn.org/stable/modules/model_evaluation.html
8. https://scikit-learn.org/stable/modules/cross_validation.html

Choosing continuous or discrete variables:<br>
9. https://medium.com/@christopherfielding/na%C3%AFve-bayes-classification-for-discrete-and-continuous-variables-cb1103155488 <br>
For layers: <br>
10. https://medium.com/geekculture/introduction-to-neural-network-2f8b8221fbd3 <br>
Dataset: <br>
11: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database <br>
12. https://www.kaggle.com/code/logeshk/pima-indians-diabetes-logistic-regression

The references mentioned in the Project description.