<a href="https://www.kaggle.com/code/alliegross/diabetes-project?scriptVersionId=146219149" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Diabetes Project
### *This project includes the EDA, Hypothesis Testing, Feature Engineering, Model Pre-Processing and Model Constuction from the Diabetes Health Indicators Dataset on Kaggle*

Diabetes is a chronic metabolic disorder characterized by elevated levels of blood glucose, resulting from insufficient production or inefficient utilization of insulin. Insulin, a hormone produced by the pancreas, facilitates the absorption of glucose into cells for energy. In individuals with diabetes, this regulatory mechanism is impaired, leading to persistent hyperglycemia. 

There are two main types of diabetes: 
- Type 1, often diagnosed in childhood, involves the immune system attacking and destroying insulin-producing cells 
- Type 2, more common in adults, is linked to lifestyle factors and insulin resistance. 

If left unmanaged, diabetes can lead to serious complications, including cardiovascular diseases, kidney dysfunction, and nerve damage. Regular monitoring, lifestyle modifications, and, in some cases, medication or insulin therapy are crucial components of diabetes management.

This project works to test 5 different machine learning algorithms to predict diabetes among 21 variables.

* See full data dictionary at https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

# Table of Contents

1. [Load Packages and Clean Dataset](#load-packages-and-clean-dataset)
2. [Perform Exploratory Data Analysis (EDA)](#perform-exploratory-data-analysis-eda)
   - A. [HighBP](#highbp)
   - B. [HighChol](#highchol)
   - C. [BMI](#bmi)
   - D. [GenHealth](#genhealth)
   - E. [PhysHlth](#physhlth)
   - F. [DiffWalk](#diffwalk)
3. [Hypothesis Testing](#hypothesis-testing)
   - A. [HighBP vs Diabetes_binary](#highbp-vs-diabetes_binary)
   - B. [HighChol vs Diabetes_binary](#highchol-vs-diabetes_binary)
   - C. [BMI vs Diabetes_binary](#bmi-vs-diabetes_binary)
   - D. [GenHealth vs Diabetes_binary](#genhealth-vs-diabetes_binary)
   - E. [PhysHlth vs Diabetes_binary](#physhlth-vs-diabetes_binary)
   - F. [DiffWalk vs Diabetes_binary](#diffwalk-vs-diabetes_binary)
4. [Feature Engineering](#feature-engineering)
5. [Pre-Processing and Hyperparameter Tuning](#pre-processing-and-hyperparameter-tuning)
6. [Model Construction](#model-construction)
   - A. [K Nearest Neighbors](#k-nearest-neighbors)
   - B. [Decision Tree](#decision-tree)
   - C. [Random Forests](#random-forests)
   - D. [Logistic Regression](#logistic-regression)
   - E. [XGBoost](#xgboost)

## 1.) Load Packages and Prepare Dataset

In [None]:
# Load Packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import math

from scipy import stats
from scipy.stats import chi2_contingency
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.datasets import make_classification
from statsmodels.stats.outliers_influence import variance_inflation_factor
from imblearn.under_sampling import NearMiss
from sklearn.model_selection import train_test_split
from mlxtend.plotting import plot_confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay, classification_report

from xgboost import plot_importance
import pickle

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Load dataset
df = pd.read_csv("/kaggle/input/diabetes-health-indicators-dataset/diabetes_binary_health_indicators_BRFSS2015.csv")

# Confirm the dataset is loaded into the dataframe
df.head()

In [None]:
# Obtain summary information, check for null values and data types that may need to be changed 
df.info()

In [None]:
df.isnull().sum()

In [None]:
# Get descriptive statistics 
df.describe().T

In [None]:
# Identify, isolate, and drop rows with duplicates based on any appropriate variable or all together
df.duplicated().sum()

In [None]:
# View the disribution of duplicates to see if the duplication is accidental or normal
duplicate_rows = df[df.duplicated()]

duplicate_rows.hist(figsize=(25,20));

In [None]:
value_counts = duplicate_rows[duplicate_rows['Diabetes_binary'] == 1.0]['Diabetes_binary'].value_counts()
print(value_counts)

In [None]:
value_counts = duplicate_rows[duplicate_rows['CholCheck'] == 0.0]['Diabetes_binary'].value_counts()
print(value_counts)

### Data Cleaning Plan:
1.) Dummy encode variables: no
* The data is already dummy encoded for yes/no binary variables

2.) Replace categorical variables: no
* There are no categorical variables to replace

3.) Change time data format: no
* There is no time data to covert to date/time formatting

4.) Change other data types: yes
* Change all floats to ints for computational speed

5.) Remove null values: no
* There appears to be no null values in the dataset.

6.) Remove or restructure outliers: no
* Since most variables are categorical or yes/no, the dataset is formatted to prevent outliers. This should be kept in mind when picking models

7.) Remove duplicates: no
* I chose not to remove the duplicates because these duplicates could show patterns in the data and distribution. It does not appear that the cause of duplication is an error. If removed, the accuracy of the models could decrease

### Other Notes
* It may be beneficial to change these variable names to keep the capitalization pattern consistent: NoDocbcCost, HeartDiseaseorAttack
* If it were up to me, I would have liked to see data on whether diabetes runs in the family of the patient and sugar intake throughout lifetime

In [None]:
# Change all floats to ints
df["Diabetes_binary"] = df["Diabetes_binary"].astype(int)
df["HighBP"] = df["HighBP"].astype(int)
df["HighChol"] = df["HighChol"].astype(int)
df["CholCheck"] = df["CholCheck"].astype(int)
df["BMI"] = df["BMI"].astype(int)
df["Smoker"] = df["Smoker"].astype(int)
df["Stroke"] = df["Stroke"].astype(int)
df["HeartDiseaseorAttack"] = df["HeartDiseaseorAttack"].astype(int)
df["PhysActivity"] = df["PhysActivity"].astype(int)
df["Fruits"] = df["Fruits"].astype(int) 
df["Veggies"] = df["Veggies"].astype(int)
df["HvyAlcoholConsump"] = df["HvyAlcoholConsump"].astype(int)
df["AnyHealthcare"] = df["AnyHealthcare"].astype(int)
df["NoDocbcCost"] = df["NoDocbcCost"].astype(int)
df["GenHlth"] = df["GenHlth"].astype(int)
df["MentHlth"] = df["MentHlth"].astype(int)
df["PhysHlth"] = df["PhysHlth"].astype(int)
df["DiffWalk"] = df["DiffWalk"].astype(int)
df["Sex"] = df["Sex"].astype(int)
df["Age"] = df["Age"].astype(int)
df["Education"] = df["Education"].astype(int)
df["Income"] =df["Income"].astype(int)

In [None]:
# Confirm all data types have been changed
df.info()

***

***

# 2.) Perform EDA

In [None]:
# Visual representation of value counts
df.hist(figsize=(25,20));

In [None]:
# Check correlation using a heatmap
plt.figure(figsize = (30,20))
sns.set(font_scale=1.5)
sns.heatmap(df.corr(),annot=True, cmap='GnBu')
plt.title("Diabetes Variable Correlations",fontsize=30)

In [None]:
df.drop('Diabetes_binary', axis=1).corrwith(df.Diabetes_binary).plot(kind='bar', grid=True, figsize=(20, 8), title="Correlation with Diabetes_binary",color="Blue")

Noteable Correlations:
* Lowest: fruits, veggies, anyhealthcare, nodocbccost, sex(important!)
* Highest: highBP, highchol, highBMI, genhealth, physhealth, diffwalk

In [None]:
# Obtain value counts of patients with diabetes
labels=["Non-Diabetic","Diabetic"]
plt.pie(df["Diabetes_binary"].value_counts(), labels =labels ,autopct='%.02f');

In [None]:
df['Diabetes_binary_str']= df["Diabetes_binary"].replace({0:"Non-Diabetic",1:"Diabetic"})

df['Diabetes_binary_str'].value_counts()

* 86.07% (218,334) of the patients do not have diabetes
* 13.93% (35,346) of the patients have diabetes

### Upcoming Variable Evaluations
A. HighBP

B. HighChol

C. BMI

D. GenHealth

E. PhysHlth

F. DiffWalk

### A.) HighBP vs Diabetes_binary

In [None]:
# Create a variable that turns highBP into a string with labels for bar chart readability later
df["HighBP_str"]= df["HighBP"].replace({0:"No",1:"Yes"})

In [None]:
# Checking the relation with HighBP and Diabetes_binary
pd.crosstab(df.HighBP_str,df.Diabetes_binary_str).plot(kind="bar",figsize=(8,8))

plt.title('High Blood Pressure vs Diabetes Frequency')
plt.xlabel('HighBP')
plt.ylabel('Frequency')
plt.show()

In the above barchart, we see a drastic jump in diabetes frequency when patients say that they have high blood pressure. Let's investigate this further...

In [None]:
# Obtain percentages for the relationship between high BP and diabetes
(df.groupby("Diabetes_binary_str")["HighBP_str"].value_counts()/df.groupby("Diabetes_binary_str")["HighBP_str"].count()).round(4)*100

### B.) HighChol vs Diabetes_binary
repeat the previous steps for cholesterol

In [None]:
df['HighChol_str']=df['HighChol'].replace({0:"No",1:"Yes"})

In [None]:
pd.crosstab(df.HighChol_str,df.Diabetes_binary_str).plot(kind='bar', figsize=(8,8))

plt.title('High Cholesterol vs Diabetes Frequency')
plt.xlabel('HighChol')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Obtain percentages
(df.groupby('Diabetes_binary_str')['HighBP_str'].value_counts()/df.groupby('Diabetes_binary_str')['HighChol_str'].count()).round(4)*100

In [None]:
# Checking the correlation of HighBP and HighChol together against Diabetes_binary
(df.groupby(["HighBP_str", "HighChol_str"])["Diabetes_binary_str"].value_counts()/df.groupby(["HighBP_str" , "HighChol_str"])["Diabetes_binary"].count()).round(4)*100

### Correlation with Diabetes (diabetes_binary_str = yes)
HighBP(no) + HighChol(yes) > HighBP(yes) + HighChol(no)

The correlation between high blood pressure is stronger than the correlation between high cholesterol and diabetes
* HighBP(no) + HighChol(yes) = Diabetic 10.42%
* HighBP(yes) + HighChol(no) = Diabetic 16.73%

HighBP and Cholesterol = Has Diabetes
* High BP(yes) + HighChol(yes) = Diabetic 29.71%
* Having both HighBP and HighChol increases the risk of diabetes

### C.) BMI vs Diabetes_binary

In [None]:
# Check the distribution and outlier positioning in BMI
ax = sns.boxplot(data=df, x='Diabetes_binary', y='BMI', palette='Paired')
ax.set(title = 'BMI Distribution Comparision for Non-Diabetics and Diabetics')
ax.set_xticklabels(['Not Diabetic', 'Diabetic'])
plt.ylim(15, 60)

Non Diabetics Have:
* a lower BMI range
* a lower mean BMI
* more outliers
* a slightly more right skewed distribution

In [None]:
# Obtain descriptive statistics for BMI for diabetic and non-diabetic
df.groupby('Diabetes_binary_str')['BMI'].describe().round()

Observe that the mean BMI of diabetics is higher than the mean BMI of non-diabetics.

### D.) GenHealth vs Diabetes_binary

In [None]:
#Divide dataset into two - diabetes and non_diabetes
df_no = df[df['Diabetes_binary'] == 0]
df_yes = df[df['Diabetes_binary'] == 1]
df_no_genhlth = df_no['GenHlth']
df_yes_genhlth = df_yes['GenHlth']

In [None]:
sns.kdeplot(df_no_genhlth,color='green')
sns.kdeplot(df_yes_genhlth,color='blue')
plt.grid()
plt.title('General Heath vs Diabetes_Binary Distribution')
plt.legend(['Diabetic', 'Not Diabetic'])

### E.) PhysHlth vs Diabetes_binary

In [None]:
df_no_physhlth = df_no['PhysHlth']
df_yes_physhlth = df_yes['PhysHlth']

In [None]:
sns.kdeplot(df_no_physhlth,color='green')
sns.kdeplot(df_yes_physhlth,color='blue')
plt.grid()
plt.title('General Heath vs Diabetes_Binary Distribution')
plt.legend(['Diabetic', 'Not Diabetic'])

The distributions of the physical health of non-diabetics and diabetics closely matches.

### F.) DiffWalk vs Diabetes_binary

In [None]:
df["DiffWalk_str"]= df["DiffWalk"].replace({0:"No",1:"Yes"})

In [None]:
(df.groupby("Diabetes_binary_str")["DiffWalk_str"].value_counts()/df.groupby("Diabetes_binary_str")["DiffWalk_str"].count()).round(4)*100

In [None]:
pd.crosstab(df.DiffWalk_str,df.Diabetes_binary_str).plot(kind='bar', figsize=(8,8))

plt.title('Difficulty Walking vs Diabetes Frequency')
plt.xlabel('DiffWalk')
plt.ylabel('Frequency')
plt.show()

***

***

# 3.) Hypothesis Testing
A Chi-Squared Test will be conducted to determine the association between diabetics and non-diabetics (using Diabetes_binary) vs HighChol and HighBP

A ttest will be conducted to determine the association between

Chi Squared test for independence
-t tests cannot compare categorical data
-determines whether an two categorial variables are associated with each other

### A.) HighBP vs Diabetes_binary

H0: Diabetes_binary and HighBP are independent, and are not associated with each other

H1: Diabetes_binary and HighBP variables are not independent, and are associated with each other

In [None]:
# Prepare table
contingency= pd.crosstab(df.Diabetes_binary_str, df.HighBP_str)
contingency

In [None]:
# Conduct chi-squared test
chi2, p_value_3, dof, exp_freq = chi2_contingency(contingency)
if (p_value_3 < 0.05):
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

### B.) HighChol vs Diabetes_binary

H0: Diabetes_binary and HighChol are independent, and are not associated with each other

H1: Diabetes_binary and HighChol variables are not independent, and are associated with each other

In [None]:
# Prepare table
contingency= pd.crosstab(df.Diabetes_binary_str, df.HighChol_str)
contingency

In [None]:
chi2, p_value_3, dof, exp_freq = chi2_contingency(contingency)
if (p_value_3 < 0.05):
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

### C.) BMI vs Diabetes_binary

Conduct a two sample ttest since to compare whether two sample means are equal to each other
* H0: There is no significant difference between Diabetes_binary and BMI
* H1: There is a significant difference between Diabetes_binary and BMI

Further reasoning:
* the sample size is large
* the population std is known, but is difficult to pinpoint with large amounts of data. Consequently, ttests are more commonly used in practice ??

In [None]:
# Create variables before comparing BMI averages 
df_no_bmi = df_no['BMI']
df_yes_bmi = df_yes['BMI']

In [None]:
# Compare BMI averages across diabetics and non-diabetics
print('Average BMI for diabetics is {} and not diabetic is {} '.format(df_yes_bmi.mean().round(2),df_no_bmi.mean().round(2)))

In [None]:
# Conduct a two sample ttest 
ttest,p_value_1  = stats.ttest_ind(df_yes_bmi, df_no_bmi)
if p_value_1 < 0.05:   
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

### D.) GenHlth vs Diabetes_binary

Conduct a two sample ttest since to compare whether two sample means are equal to each other

* H0: There is no significant difference between Diabetes_binary and GenHlth
* H1: There is a significant difference between Diabetes_binary and GenHlth

In [None]:
# Create variables before comparing GenHlth averages 
df_no_genhlth = df_no['GenHlth']
df_yes_genhlth = df_yes['GenHlth']

In [None]:
# Compare GenHlth averages across diabetics and non-diabetics
print('Average GenHlth self-rating for diabetics is {} and not diabetic is {} '.format(df_yes_genhlth.mean().round(2),df_no_genhlth.mean().round(2)))

In [None]:
# Conduct a two sample ttest 
ttest,p_value_1  = stats.ttest_ind(df_yes_genhlth, df_no_genhlth)
if p_value_1 < 0.05:   
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

### E.) PhysHlth vs Diabetes_binary

Conduct a two sample ttest since to compare whether two sample means are equal to each other

* H0: There is no significant difference between Diabetes_binary and PhysHlth
* H1: There is a significant difference between Diabetes_binary and PhysHlth

In [None]:
# Create variables before comparing PhysHlth averages 
df_no_physhlth = df_no['PhysHlth']
df_yes_physhlth = df_yes['PhysHlth']

In [None]:
# Compare GenHlth averages across diabetics and non-diabetics
print('Average PhysHlth self-rating for diabetics is {} and not diabetic is {} '.format(df_yes_physhlth.mean().round(2),df_no_physhlth.mean().round(2)))

In [None]:
# Conduct a two sample ttest 
ttest,p_value_1  = stats.ttest_ind(df_yes_physhlth, df_no_physhlth)
if p_value_1 < 0.05:   
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

### F.) DiffWalk vs Diabetes_binary

H0: Diabetes_binary and DiffWalk are independent, and are not associated with each other

H1: Diabetes_binary and DiffWalk are not independent, and are associated with each other

In [None]:
# Prepare table
contingency= pd.crosstab(df.Diabetes_binary_str, df.DiffWalk_str)
contingency

In [None]:
chi2, p_value_3, dof, exp_freq = chi2_contingency(contingency)
if (p_value_3 < 0.05):
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

***

***

# 4.) Feature Engineering
Based on the previous EDA and statistical analysis, we have a good recollection and analysis of important features for stakeholders (and my own practice). 

However, there is a way to automate selection using ANOVA

In [None]:
# drop str variables (categorial variables) to prevent errors in ANOVA
df.info()

In [None]:
# drop str variables (categorial variables) to prevent errors in ANOVA
columns_to_drop = ['Diabetes_binary_str','HighBP_str','HighChol_str','DiffWalk_str']
df.drop(columns=columns_to_drop, axis=1, inplace=True)

In [None]:
df.info()

In [None]:
# Split the columns and designate Diabetes_binary as Y
X = df.iloc[:,1:]
Y = df.iloc[:,0]

In [None]:
# define feature selection formula
fs = SelectKBest(score_func=f_classif,k=13)

In [None]:
# apply feature selection
X_selected = fs.fit_transform(X,Y)
print(X_selected.shape)

In [None]:
pd.DataFrame(X_selected).head()

In [None]:
# Conduct a chi2 to utilize SelectKBest to extract the top 10 features
BestFeatures = SelectKBest(score_func=chi2, k=13)
fit = BestFeatures.fit(X,Y)

df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X.columns)

#concatenating two dataframes for better visualization
f_Scores = pd.concat([df_columns,df_scores],axis=1)
f_Scores.columns = ['Feature','Score']

n=f_Scores.shape[0]

print(n)
f_Scores.sort_values(by=['Score']).iloc[n:7:-1]

***

***

# 5.) Pre-Processing and Hyperparameter Tuning

### Since we have a few features with different ranges, we will perform normalization. 

#### However, this will be done later in the model construction since clustering algorithms like Kmeans and KNN are negatively affected by standardization.

* KNN: very affected by standardization since distance between entities is calculated
* Random Forests & Decision Trees: unneccessary to use since the structure of decisions is not changed
* Regression: standardization improves stability and may speed up the training process
* XGBoost: resilient to feature scaling, without normalization larger features might dominate.

Moving Forward: Begin with Nmeans, KNN, Random Forest and Decision Trees before normalization

In [None]:
# Choose columns for model based on correlation matrix 
df_model = df[['Diabetes_binary', 'HighBP','HighChol', 'BMI', 'GenHlth', 
               'DiffWalk', 'Age', 'HeartDiseaseorAttack', 'PhysHlth','MentHlth','Stroke','PhysActivity','HvyAlcoholConsump']]

In [None]:
# Train test split
x = df_model.drop('Diabetes_binary', axis=1)
y = df_model['Diabetes_binary']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)

In [None]:
# Drop unnecessary columns
unused_colomns = ['Fruits', 'Veggies', 'Sex', 'CholCheck', 'AnyHealthcare','Education','Smoker','NoDocbcCost']
df.drop(columns=unused_colomns, axis=1,inplace=True)

In [None]:
# Confirm column drop
df.info()

In [None]:
# split data
X=df.drop('Diabetes_binary',axis=1)
Y=df['Diabetes_binary']

In [None]:
Y.value_counts()

 ! There is a large class imbalance in the dataset

#### Nearmiss use reasoning:
* Prevent increased CPU usage
* Enough samples are present
* Low risk of overfitting

In [None]:
# Alter class imbalance by randomly eliminating majority class samples
nm = NearMiss(version = 1,n_neighbors=13)
x_sm,y_sm=nm.fit_resample(X,Y)

In [None]:
# Confirm nearmiss success
y_sm.shape , x_sm.shape

In [None]:
# Confirm nearmiss success
y_sm.value_counts()

In [None]:
# Split training and testing data
X_train , X_test , Y_train , Y_test = train_test_split(x_sm,y_sm, test_size=0.3 , random_state=42)

In [None]:
# Verify the number of samples in the partitioned data
for x in [X_train, X_test, Y_train, Y_test]:
    print(len(x))

#### Begin hyperparameter tuning with GridSearchCV

In [None]:
grid_models = [(LogisticRegression(),[{'C':[0.25,0.5,0.75]}]), 
               (DecisionTreeClassifier(),[{'criterion':['gini','entropy','log_loss'],'min_samples_leaf':[3,5,8],'max_depth':[5,8,10]}]), 
               (RandomForestClassifier(),[{'n_estimators':[50,100,200],'max_depth':[5,8,10],'criterion':['gini','entropy']}]),
               (KNeighborsClassifier(),[{'n_neighbors':[3,5,10],'algorithm':['auto','ball_tree','kd_tree','brute']}]),
               (XGBClassifier(), [{'learning_rate': [0.01, 0.05, 0.10, 0.20,0.50], 'eval_metric': ['error']}])]

In [None]:
for i,j in grid_models:
    grid = GridSearchCV(estimator=i,param_grid = j, scoring = 'accuracy',cv=2)
    grid.fit(X_train, Y_train)
    best_accuracy = grid.best_score_
    best_param = grid.best_params_
    print('{}:\nBest Accuracy : {:.2f}%'.format(i,best_accuracy*100))
    print('Best Parameters : ',best_param)
    print('')
    print('----------------')
    print('')

# 6.) Model Construction
In this section, I will be continuting my previous work by building off the previous EDA and Statistical testing to build 5 machine learning models:
* K Nearest Neighbors
* Decision Tree
* Random Forests
* Logistic Regression
* XGBoost

As stated previously, models negatively affected by normalization will be built first.
* KNN, Decision Tree, Random Forests

## A.) K Nearest Neighbors

In [None]:
# Fit the model on the training data
knn = KNeighborsClassifier(algorithm='auto',n_neighbors=5)
knn.fit(X_train , Y_train)

In [None]:
# make predictions on test set
y_pred=knn.predict(X_test)

print('Training set score: {:.4f}'.format(knn.score(X_train, Y_train)))

print('Test set score: {:.4f}'.format(knn.score(X_test, Y_test)))

In [None]:
# Check MSE & RMSE 
mse =mean_squared_error(Y_test, y_pred)
print('Mean Squared Error : '+ str(mse))
rmse = math.sqrt(mean_squared_error(Y_test, y_pred))
print('Root Mean Squared Error : '+ str(rmse))

In [None]:
matrix = classification_report(Y_test,y_pred )
print(matrix)

Summary:
* precision: what proportion of "has diabetes" diagnoses were correct (true positives/all true and false positives)
* recall: proportion of actual diabetes positive cases that were identified correctly (TP/TP+FN)
* f1: combination of precision and recall

However, since we rebalanced our classes these will not be weighed more favorabily

In [None]:
# Calculate and plot the confusion matrix
cm1 = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf_mat=cm1,show_absolute=True,show_normed=True,colorbar=True)
plt.show()

KNeighbors did not provide suitable outcomes, as this is the highest type 2 error yet.

Results Summary:
* 94% of the predicted positive values were correct
* 79% of the predicted negative values were correct
* there is a 6% chance of making a type 1 error (false positive)
* there is a 21% chance of making a type 2 error (false negative)

## B.) Decision Tree

In [None]:
# Fit the model on the training data
dt = DecisionTreeClassifier(criterion= 'gini',max_depth=10,min_samples_leaf=8)
dt.fit(X_train,Y_train)

In [None]:
# Make predictions on test data
y_pred=dt.predict(X_test)
print('Training set score: {:.4f}'.format(dt.score(X_train,Y_train)))

print('Test set score: {:.4f}'.format(dt.score(X_train,Y_train)))

In [None]:
# Check MSE and RSME
mse=mean_squared_error(Y_test,y_pred)
print('Mean Squared Error : '+str(mse))

rmse=math.sqrt(mse)
print('Mean Squared Error :'+str(rmse))

In [None]:
# Create Decision Tree Classification Report
matrix = classification_report(Y_test,y_pred )
print(matrix)

In [None]:
# Calculate and plot the confusion matrix
cm1 = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf_mat=cm1,show_absolute=True,show_normed=True,colorbar=True)
plt.show()

Confusion matrix=how many in each class are correct vs incorrect

Results Summary:
* 96% of the predicted positive values were correct
* 80% of the predicted negative values were correct
* there is a 4% chance of making a type 1 error (false positive)
* there is a 20% chance of making a type 2 error (false negative)

As I changed the max depth, I noticed that as the max_depth decreases, the chance of making a type 2 error (FN) changes. A max depth of 10 had the lowest probability of a type 2 error.

## C.) Random Forests

In [None]:
# Fit the model on the training data
rf = RandomForestClassifier(max_depth=10, criterion='gini', n_estimators =100, min_samples_split=10, random_state=42)
rf.fit(X_train, Y_train)

In [None]:
# Make predictions on test set
y_pred=rf.predict(X_test)

print('Training set score: {:.4f}'.format(rf.score(X_train, Y_train)))

print('Test set score: {:.4f}'.format(rf.score(X_test, Y_test)))

In [None]:
# Check MSE & RMSE 
mse =mean_squared_error(Y_test, y_pred)
print('Mean Squared Error : '+ str(mse))
rmse = math.sqrt(mean_squared_error(Y_test, y_pred))
print('Root Mean Squared Error : '+ str(rmse))

In [None]:
matrix = classification_report(Y_test,y_pred )
print(matrix)

In [None]:
# Calculate and plot the confusion matrix
cm1 = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf_mat=cm1,show_absolute=True, show_normed=True,colorbar=True)
plt.show()

Lowest Type 11 error probability so far

Results Summary:
* 96% of the predicted positive values were correct
* 81% of the predicted negative values were correct
* there is a 4% chance of making a type 1 error (false positive)
* there is a 19% chance of making a type 2 error (false negative)

## D.) Logistic Regression

In [None]:
# First, perform normalization
means= np.mean(X_train, axis=0)
stds= np.std(X_train, axis=0)

X_train = (X_train - means)/stds
X_test = (X_test - means)/stds

In [None]:
# Fit the model on the training data
lg=LogisticRegression(C=0.75, random_state=42)
lg.fit(X_train,Y_train)

In [None]:
# make initial predictions on test data and make the result 4 decimal places
y_pred=lg.predict(X_test)
print('Training set score: {:.4f}'.format(lg.score(X_train, Y_train)))

print('Test set score: {:.4f}'.format(lg.score(X_test, Y_test)))

In [None]:
# check MSE & RMSE
mse=mean_squared_error(Y_test,y_pred)
print('Mean Squared Error : '+str(mse))

rmse=math.sqrt(mse)
print('Root Mean Squared Error : '+str(rmse))

Mean squared error=how close the data points are to the regression line is to the data points

In [None]:
matrix = classification_report(Y_test,y_pred)
print(matrix)

In [None]:
# calculating the confusion matrix
cm1 = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf_mat=cm1,show_absolute=True,show_normed=True, colorbar=True)
plt.show()

Confusion matrix=how many in each class are correct vs incorrect

Results Summary:
* 93% of the predicted positive values were correct
* 81% of the predicted negative values were correct
* there is a 7% chance of making a type 1 error (false positive)
* there is a 19% chance of making a type 2 error (false negative)

Both types of errors are alarming. The type 2 error risk is high, which means patients may not be diagnosed with diabetes when they do have it. These errors need to be reduced.

## E.) XG Boost

In [None]:
# Fit training data to model
xg = XGBClassifier(eval_metric= 'error', learning_rate= 0.2, min_child_weight=1)
xg.fit(X_train , Y_train)

In [None]:
# Run predictions
y_pred=xg.predict(X_test)

print('Training set score: {:.4f}'.format(xg.score(X_train, Y_train)))

print('Test set score: {:.4f}'.format(xg.score(X_test, Y_test)))

In [None]:
# Check Error: MSE & RMSE 
mse =mean_squared_error(Y_test, y_pred)
print('Mean Squared Error : '+ str(mse))
rmse = math.sqrt(mean_squared_error(Y_test, y_pred))
print('Root Mean Squared Error : '+ str(rmse))

In [None]:
# Print classification Report
matrix = classification_report(Y_test,y_pred )
print(matrix)

In [None]:
# Calculate and plot the confusion matrix
cm1 = confusion_matrix(Y_test,y_pred)
plot_confusion_matrix(conf_mat=cm1,show_absolute=True,show_normed=True,colorbar=True)
plt.show()

Results Summary:
* 96% of the predicted positive values were correct
* 83% of the predicted negative values were correct
* there is a 4% chance of making a type 1 error (false positive)
* there is a 17% chance of making a type 2 error (false negative)

## Comparing Model Accuracy
In all of these models, it is apparent that the risk of a type 2 error is high. This is especially concerning for medical diagnosis, as the risk of a diagnosis coming back negative when it is actually positive can be life changing. 

Instead of choosing the model with the best accuracy, we should consider accuracy and Type 2 error.

#### If you have any comments/revisions please comment below! I'm consistently working to teach myself machine learning concepts, so any feedback helps.