# Diabetes Onset Detection -- Modeling

## Goal
1. Try different algorithms and build the prediction model
    * Naive Bayes
    * K-Nearest Neighbors
    * Logistic Regression
    * Random Forest
    * Support Vector Machine
    * Gradient Boosting
    * Neural Network
2. Compare the performance of different imputation and normalization methods
    * impute with mean
    * impute with median
    * z-score normalization
    * min-max scaling

### Importing useful packages

In [41]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# supress warnings
import warnings
warnings.filterwarnings("ignore")

sns.set()
sns.set_style("whitegrid")

# import model package
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier

# import customized package
from MLuseful import get_best_model_accuracy
from MLuseful import roc_curve_plot

### Loading data
We will load the training and testing set from the feature engineering step and we will be ready to fit the model

In [31]:
diabetes = pd.read_csv('../Data/diabetes_outliers_clean.csv')

diabetes_mean_X_train_z = pd.read_csv('../Data/diabetes_mean_X_train_z.csv')
diabetes_mean_X_test_z = pd.read_csv('../Data/diabetes_mean_X_test_z.csv')
diabetes_median_X_train_z = pd.read_csv('../Data/diabetes_median_X_train_z.csv')
diabetes_median_X_test_z = pd.read_csv('../Data/diabetes_median_X_test_z.csv')

diabetes_mean_X_train_min_max = pd.read_csv('../Data/diabetes_mean_X_train_min_max.csv')
diabetes_mean_X_test_min_max = pd.read_csv('../Data/diabetes_mean_X_test_min_max.csv')
diabetes_median_X_train_min_max = pd.read_csv('../Data/diabetes_median_X_train_min_max.csv')
diabetes_median_X_test_min_max = pd.read_csv('../Data/diabetes_median_X_test_min_max.csv')

diabetes_y_train = pd.read_csv('../Data/diabetes_y_train.csv', header=None)
diabetes_y_test = pd.read_csv('../Data/diabetes_y_test.csv', header=None)

Before fitting the model, we need to see the base score we have to beat, the base score is basically calculated by the random guessing of the majority types in outcome variable

In [32]:
diabetes['Outcome'].value_counts(normalize=True)

0    0.651042
1    0.348958
Name: Outcome, dtype: float64

From the table above, we can see the baseline score we need to beat is 0.651, since if we predict all the patients have no diabetes, we will get a score of 0.651. We will start to fit different models to find out the best in terms of fitting time, predicting accuracy...etc., we will do the grid search on hyperparameters of the algorithms and find out the best one that gives the highest accuracy, since accuracy might not be enough for the error metrics, we will discuss more about it later, at this time we will use the accuracy score determine the best parameters

#### Naive Bayes
We will first try the simplest algorithm naive bayes, the basic assumptions of the naive bayes is the variables are independent to each other, this is more like an ideal assumption that almost never happens in the real world, but we will still see how it works since the algorithm is simple and fast

In [34]:
data_all = [(diabetes_mean_X_train_z, 'mean-z-score'), (diabetes_median_X_train_z, 'median-z-score'), 
        (diabetes_mean_X_train_min_max, 'mean-min-max'), (diabetes_median_X_train_min_max, 'median-min-max')]


In [47]:
naive = GaussianNB()

# specify the hyperparameters we are going to do gridsearch on
naive_params = {'var_smoothing': [1e-11, 1e-10, 1e-9, 1e-8, 1e-7]}
for data in data_all:
    print(data[1])
    print('-'*90)
    best = get_best_model_accuracy(naive, naive_params, data[0], diabetes_y_train)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best Accuracy: 0.7486033519553073
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


median-z-score
------------------------------------------------------------------------------------------
Best Accuracy: 0.74487895716946
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


mean-min-max
------------------------------------------------------------------------------------------
Best Accuracy: 0.7486033519553073
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.001


median-min-max
------------------------------------------------------------------------------------------
Best Accuracy: 0.74487895716946
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001




In [37]:
lgr = LogisticRegression()
lgr_params = {'penalty': ['l1', 'l2'],
              'C': [0.01, 0.1, 1, 10]}
for data in data_all:
    print(data[1])
    print('-'*90)
    best = get_best_model_accuracy(lgr, lgr_params, data[0], diabetes_y_train)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best Accuracy: 0.770949720670391
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 0.004
Average Time to Score (s): 0.001


median-z-score
------------------------------------------------------------------------------------------
Best Accuracy: 0.770949720670391
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


mean-min-max
------------------------------------------------------------------------------------------
Best Accuracy: 0.7728119180633147
Best Parameters: {'C': 1, 'penalty': 'l2'}
Average Time to Fit (s): 0.006
Average Time to Score (s): 0.001


median-min-max
------------------------------------------------------------------------------------------
Best Accuracy: 0.7728119180633147
Best Parameters: {'C': 1, 'penalty': 'l2'}
Average Time to Fit (s): 0.006
Average Time to Score (s): 0.001


