# CSCE 585: Machine Learning Systems Final Project

## Project Title: Diabetes Prediction Using Machine Learning

*Diabetes Mellitus* is among critical diseases and lots of people are suffering from it in recent years. According to the recent studies, age, obesity, lack of exercise, hereditary diabetes, living style, bad diet, high blood pressure, etc. can cause Diabetes Mellitus. People having diabetes have high risk of diseases such as heart disease, kidney disease, stroke, eye problem, and nerve damage. With the recent advancements in the field of machine learning (ML), several researchers have tried to apply ML models to perform Diabetes prediction in patients based on various factors. However, there is no rigorous and comprehensive study on the evaluation of different ML models to determine the best practices in this specific problem.

In this project, we aim at implementing and evaluating different classification methods (e.g., decision tree, random forest, support vector machine, and neural network) on the given dataset and determine which methods perform better and under which conditions. We will use the Pima Indians onset of diabetes dataset. This is a standard machine learning dataset from the UCI Machine Learning repository. It describes patient medical record data for Pima Indians and whether they had an onset of diabetes within five years.


Our dataset has the following features:
* Number of Instances: 768

* Number of Attributes: 8 plus class 

* For Each row in the dataset (all numeric-valued), we have the following columns:
   * Number of times pregnant
   * Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   * Diastolic blood pressure (mm Hg)
   * Triceps skin fold thickness (mm)
   * 2-Hour serum insulin (mu U/ml)
   * Body mass index (weight in kg/(height in m)^2)
   * Diabetes pedigree function
   * Age (years)
   * Class variable (0 or 1)

In [18]:
# import the necessary modules here!
import pandas as pd
import numpy as np
import matplotlib
import os

In [19]:
# form the path to the dataset 
current_path_str = os.getcwd()
current_path_list = current_path_str.split("/")
dataset_path_list = current_path_list[:-1]
dataset_path_list.append("Dataset")
dataset_path_str = "/".join(dataset_path_list)
path = dataset_path_str + "/diabetes.csv"

# load the dataset to pandas dataframe
df = pd.read_csv(path)

In [20]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [21]:
from pycaret.classification import *

In [22]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [23]:

experiment = setup(df, target="Outcome")

Unnamed: 0,Description,Value
0,Session id,8253
1,Target,Outcome
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


In [24]:
#setup??

In [36]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7654,0.8199,0.5503,0.7308,0.6208,0.4557,0.4704,0.007
lr,Logistic Regression,0.7653,0.822,0.5608,0.7248,0.6254,0.4586,0.4717,0.242
et,Extra Trees Classifier,0.7617,0.8149,0.5608,0.7062,0.619,0.4498,0.46,0.033
ridge,Ridge Classifier,0.7598,0.0,0.5345,0.7253,0.6076,0.4403,0.4564,0.006
nb,Naive Bayes,0.7581,0.8052,0.5942,0.6838,0.6284,0.4519,0.4589,0.107
rf,Random Forest Classifier,0.7506,0.819,0.5561,0.6947,0.6077,0.4293,0.4418,0.037
gbc,Gradient Boosting Classifier,0.7488,0.8087,0.5892,0.6601,0.6166,0.4323,0.4378,0.02
ada,Ada Boost Classifier,0.7431,0.7866,0.5716,0.6683,0.6083,0.4194,0.4279,0.017
qda,Quadratic Discriminant Analysis,0.7413,0.8078,0.5082,0.6843,0.5768,0.3969,0.4102,0.007
knn,K Neighbors Classifier,0.7263,0.752,0.5339,0.6439,0.574,0.376,0.3859,0.123


In [26]:
predict_model(best_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.8009,0.8399,0.6173,0.7692,0.6849,0.5419,0.5489


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,prediction_label,prediction_score
537,5.0,78.0,48.0,0.0,0.0,33.700001,0.654,25.0,0,0,0.8745
538,2.0,98.0,60.0,17.0,120.0,34.700001,0.198,22.0,0,0,0.8921
539,6.0,114.0,88.0,0.0,0.0,27.799999,0.247,66.0,0,0,0.7672
540,1.0,143.0,74.0,22.0,61.0,26.200001,0.256,21.0,0,0,0.7841
541,10.0,122.0,78.0,31.0,0.0,27.600000,0.512,45.0,0,0,0.5039
...,...,...,...,...,...,...,...,...,...,...,...
763,2.0,157.0,74.0,35.0,440.0,39.400002,0.134,30.0,0,1,0.5329
764,2.0,92.0,52.0,0.0,0.0,30.100000,0.141,22.0,0,0,0.9321
765,12.0,100.0,84.0,33.0,105.0,30.000000,0.488,46.0,0,0,0.6177
766,2.0,114.0,68.0,22.0,0.0,28.700001,0.092,25.0,0,0,0.8779


In [27]:
predict_model(best_model, df.tail())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.8,0.5,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,prediction_label,prediction_score
0,10.0,101.0,76.0,48.0,180.0,32.900002,0.171,63.0,0,0,0.6019
1,2.0,122.0,70.0,27.0,0.0,36.799999,0.34,27.0,0,0,0.7147
2,5.0,121.0,72.0,23.0,112.0,26.200001,0.245,30.0,0,0,0.7957
3,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0,1,0,0.7631
4,1.0,93.0,70.0,31.0,0.0,30.4,0.315,23.0,0,0,0.9303


In [28]:
predict_model(best_model, df.drop("Outcome", axis = 1).tail())

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,prediction_label,prediction_score
0,10.0,101.0,76.0,48.0,180.0,32.900002,0.171,63.0,0,0.6019
1,2.0,122.0,70.0,27.0,0.0,36.799999,0.34,27.0,0,0.7147
2,5.0,121.0,72.0,23.0,112.0,26.200001,0.245,30.0,0,0.7957
3,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0,0,0.7631
4,1.0,93.0,70.0,31.0,0.0,30.4,0.315,23.0,0,0.9303


In [29]:
save_model(best_model, model_name = "logistic-regression-model")
# saves as pickle file

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=/var/folders/36/2yr7cfr96b983psr0fv9b4w00000gn/T/joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Pregnancies', 'Glucose',
                                              'BloodPressure', 'SkinThickness',
                                              'Insulin', 'BMI',
                                              'DiabetesPedigreeFunction',
                                              'Age'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               missing_values...
                                                               fill_value='constant',
                                                               missing

In [37]:
trained_model_diabetes = load_model("logistic-regression-model")

Transformation Pipeline and Model Successfully Loaded


In [40]:
#model = load_model("logistic-regression-model")
predict_model(trained_model_diabetes, df.tail())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.8,0.5,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,prediction_label,prediction_score
0,10.0,101.0,76.0,48.0,180.0,32.900002,0.171,63.0,0,0,0.6019
1,2.0,122.0,70.0,27.0,0.0,36.799999,0.34,27.0,0,0,0.7147
2,5.0,121.0,72.0,23.0,112.0,26.200001,0.245,30.0,0,0,0.7957
3,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0,1,0,0.7631
4,1.0,93.0,70.0,31.0,0.0,30.4,0.315,23.0,0,0,0.9303
