# Diabetes Onset Detection -- Modeling

## Goal
1. Try different algorithms and build the prediction model
    * Naive Bayes
    * K-Nearest Neighbors
    * Logistic Regression
    * Decision Tree
    * Random Forest
    * Support Vector Machine
    * Gradient Boosting
    * Neural Network
2. Compare the performance of different imputation and normalization methods
    * impute with mean
    * impute with median
    * z-score normalization
    * min-max scaling

### Importing useful packages

In [29]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# supress warnings
import warnings
warnings.filterwarnings("ignore")

sns.set()
sns.set_style("whitegrid")

# import model package
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix

# import customized package
from MLuseful import get_best_model_accuracy
from MLuseful import roc_curve_plot
from MLuseful import print_confusion_matrix

### Loading data
We will load the training and testing set from the feature engineering step and we will be ready to fit the model

In [30]:
diabetes = pd.read_csv('../Data/diabetes_outliers_clean.csv')

# z-score normalization
diabetes_mean_X_train_z = pd.read_csv('../Data/diabetes_mean_X_train_z.csv')
diabetes_mean_X_test_z = pd.read_csv('../Data/diabetes_mean_X_test_z.csv')
diabetes_median_X_train_z = pd.read_csv('../Data/diabetes_median_X_train_z.csv')
diabetes_median_X_test_z = pd.read_csv('../Data/diabetes_median_X_test_z.csv')
diabetes_mean_X_train_z_PCA = pd.read_csv('../Data/diabetes_mean_X_train_z_PCA.csv')
diabetes_median_X_train_z_PCA = pd.read_csv('../Data/diabetes_median_X_train_z_PCA.csv')
diabetes_mean_X_test_z_PCA = pd.read_csv('../Data/diabetes_mean_X_test_z_PCA.csv')
diabetes_median_X_test_z_PCA = pd.read_csv('../Data/diabetes_median_X_test_z_PCA.csv')

# min-max scaling
diabetes_mean_X_train_min_max = pd.read_csv('../Data/diabetes_mean_X_train_min_max.csv')
diabetes_mean_X_test_min_max = pd.read_csv('../Data/diabetes_mean_X_test_min_max.csv')
diabetes_median_X_train_min_max = pd.read_csv('../Data/diabetes_median_X_train_min_max.csv')
diabetes_median_X_test_min_max = pd.read_csv('../Data/diabetes_median_X_test_min_max.csv')
diabetes_mean_X_train_min_max_PCA = pd.read_csv('../Data/diabetes_mean_X_train_min_max_PCA.csv')
diabetes_median_X_train_min_max_PCA = pd.read_csv('../Data/diabetes_median_X_train_min_max_PCA.csv')
diabetes_mean_X_test_min_max_PCA = pd.read_csv('../Data/diabetes_mean_X_test_min_max_PCA.csv')
diabetes_median_X_test_min_max_PCA = pd.read_csv('../Data/diabetes_median_X_test_min_max_PCA.csv')

diabetes_y_train = pd.read_csv('../Data/diabetes_y_train.csv', header=None)
diabetes_y_test = pd.read_csv('../Data/diabetes_y_test.csv', header=None)

Before fitting the model, we need to see the base score we have to beat, the base score is basically calculated by the random guessing of the majority types in outcome variable

In [31]:
diabetes['Outcome'].value_counts(normalize=True)

0    0.651042
1    0.348958
Name: Outcome, dtype: float64

From the table above, we can see the baseline score we need to beat is 0.651, since if we predict all the patients have no diabetes, we will get a score of 0.651. We will start to fit different models to find out the best in terms of fitting time, predicting accuracy...etc., we will do the grid search on hyperparameters of the algorithms and find out the best one that gives the highest accuracy, since accuracy might not be enough for the error metrics, we will discuss more about it later, at this time we will use the accuracy score determine the best parameters

#### Naive Bayes
We will first try the simplest algorithm naive bayes, the basic assumptions of the naive bayes is the variables are independent to each other, this is more like an ideal assumption that almost never happens in the real world, but we will still see how it works since the algorithm is simple and fast

We will include all the data we gathered and processed from data cleaning and feature engineering step, there is no need to split the data into training and testing set again since we already did it in previous steps

In [32]:
# data collection
data_all = [(diabetes_mean_X_train_z, 'mean-z-score'), (diabetes_median_X_train_z, 'median-z-score'), 
(diabetes_mean_X_train_min_max, 'mean-min-max'), (diabetes_median_X_train_min_max, 'median-min-max'),
(diabetes_mean_X_train_z_PCA, 'mean-z-score-PCA'), (diabetes_median_X_train_z_PCA, 'median-z-score-PCA'),
(diabetes_mean_X_train_min_max_PCA, 'mean-min-max-PCA'), (diabetes_median_X_train_min_max_PCA, 'median-min-max-PCA')]

In [33]:
# create a new dictionary to collect all the score and all the best parameters
all_score = {}
all_params = {}

In [34]:
naive = GaussianNB()

# create an empty list to store the scores of naive bayes algorithms
all_score['Naive_Bayes'] = []
all_params['Naive_Bayes'] = []
# specify the hyperparameters we are going to do gridsearch on
naive_params = {'var_smoothing': [1e-11, 1e-10, 1e-9, 1e-8, 1e-7]}
for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(naive, naive_params, data[0], diabetes_y_train, score='accuracy')
    all_score['Naive_Bayes'].append(score)
    all_params['Naive_Bayes'].append(best_nv)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7486033519553073
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.74487895716946
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.001


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7486033519553073
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.74487895716946
Best Parameters: {'var_smoothing': 1e-11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


mean-z-score

It looks like for naive bias, the z-score and min-max method does not have much difference, for the PCA data, it looks like it does not help with the prediction, the accuracy score is lower than without the PCA, we have only tuned one hyperparameter here and it looks like `var_smoothing` = 1e-11 is the best for all models, the best accuracy score we got is 0.7486

#### K-nearest neighbor
Next we will try another simple algorithm called KNN, it basically gather the neighbors that are closest to the one we are predicting and each neighbor got a vote, the majority of the vote result will be the predictive value

In [35]:
knn = KNeighborsClassifier()

# create an empty list to store the scores of knn algorithms
all_score['KNN'] = []
all_params['KNN'] = []
knn_params = {'n_neighbors': [1,3,5,7,9,11]}

for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(knn, knn_params, data[0], diabetes_y_train, score='accuracy')
    all_score['KNN'].append(score)
    all_params['KNN'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.770949720670391
Best Parameters: {'n_neighbors': 9}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.003


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.776536312849162
Best Parameters: {'n_neighbors': 11}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.003


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.776536312849162
Best Parameters: {'n_neighbors': 9}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.002


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7783985102420856
Best Parameters: {'n_neighbors': 9}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.002


mean-z-score-PCA
-----------------

Same as naive bayes, PCA data perform worse than the feature engineered data group, note that each data has its own best hyperparameter, the highest accuracy score we got using knn algorithm is 0.7784

#### Logistic Regression
Now we will try the logistic regression, this is a more complicated algorithms than our previous two, it basically utilize the sigmoid function to calculate the probability of an example and use it to predict the outcome value, the default probability is set to be 0.5, when the hypothesis is greater or equal to 0.5, it will predict 1 and 0 otherwise 

In [36]:
lgr = LogisticRegression()

# create an empty list to store the scores of logistic regress algorithms
all_score['Logistic_regression'] = []
all_params['Logistic_regression'] = []
lgr_params = {'penalty': ['l1', 'l2'],
              'C': [0.01, 0.1, 1, 10]}

for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(lgr, lgr_params, data[0], diabetes_y_train, score='accuracy')
    all_score['Logistic_regression'].append(score)
    all_params['Logistic_regression'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.770949720670391
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 0.004
Average Time to Score (s): 0.001


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.770949720670391
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average Time to Fit (s): 0.003
Average Time to Score (s): 0.001


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7728119180633147
Best Parameters: {'C': 1, 'penalty': 'l2'}
Average Time to Fit (s): 0.006
Average Time to Score (s): 0.001


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7728119180633147
Best Parameters: {'C': 1, 'penalty': 'l2'}
Average Time to Fit (s): 0.006
Average Time to Score (s): 0.001


me

The same trend of PCA data perform worse than feature engineered data still hold in logistic regression, the best score we got here is 0.7728, one thing to note that the average time to fit the model seems slightly longer, this suggest logistic regression algorithm is more complicated than naive bayes and knn

#### Decision tree
We will now test our tree model, this algorithm is different from previous models we fit, it basically uses a certain criteria to separate the data into two groups in each branch, the criteria is based on the maximum information gain 

In [37]:
dt = DecisionTreeClassifier()

# create an empty list to store the scores of tree algorithms
all_score['Decision_tree'] = []
all_params['Decision_tree'] = []

dt_params = {'criterion': ['gini', 'entropy'],
             'max_depth': [1,3,5,7,9,11],}

for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(dt, dt_params, data[0], diabetes_y_train, score='accuracy')
    all_score['Decision_tree'].append(score)
    all_params['Decision_tree'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7374301675977654
Best Parameters: {'criterion': 'gini', 'max_depth': 5}
Average Time to Fit (s): 0.004
Average Time to Score (s): 0.001


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7467411545623837
Best Parameters: {'criterion': 'gini', 'max_depth': 5}
Average Time to Fit (s): 0.005
Average Time to Score (s): 0.001


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7411545623836127
Best Parameters: {'criterion': 'gini', 'max_depth': 5}
Average Time to Fit (s): 0.004
Average Time to Score (s): 0.001


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7430167597765364
Best Parameters: {'criterion': 'gini', 'max_depth': 5}
Average Time to Fit (s

Decision tree model has the highest score of 0.7467, it is comparable to the naive bayes model, since this is a single tree model, next we will take multiple trees altogether to improve our performance 

#### Random forest
Random forest algorithm is based on the decision tree, it is an ensemble model that takes the random patch of the data each time and build a decision tree based on that, it then create multiple decision trees and take the maximum voting as the final prediction

In [38]:
rf = RandomForestClassifier()

# create an empty list to store the scores of tree algorithms
all_score['Random_forest'] = []
all_params['Random_forest'] = []

rf_params = {'n_estimators': [10, 50, 100, 500],
             'max_depth': [1,3,5,7,9]}

for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(rf, rf_params, data[0], diabetes_y_train, score='accuracy')
    all_score['Random_forest'].append(score)
    all_params['Random_forest'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7690875232774674
Best Parameters: {'max_depth': 5, 'n_estimators': 50}
Average Time to Fit (s): 0.177
Average Time to Score (s): 0.018


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7746741154562383
Best Parameters: {'max_depth': 7, 'n_estimators': 500}
Average Time to Fit (s): 0.183
Average Time to Score (s): 0.018


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.776536312849162
Best Parameters: {'max_depth': 9, 'n_estimators': 50}
Average Time to Fit (s): 0.18
Average Time to Score (s): 0.018


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7783985102420856
Best Parameters: {'max_depth': 9, 'n_estimators': 100}
Average Time to Fit (s): 0

We can see that the average score of random forest is better than the decision tree, the best score we can get here is 0.7765, however, we can also see that the average time to fit the model is a lot longer than the decision tree, this is clearly a trade-off between accuracy score and model fitting time

#### Support vector machine
We will next try the support vector machine, this algorithms has a cost function that is modified from logistic regression, the modified version of cost function creates an ideal hyperplance in an n-dimensional space that separates the class with the maximum margin, we will first try the linear support vector machine

In [39]:
# linear support vector machine
lsvc = LinearSVC()

# create an empty list to store the scores of linear support vector machine algorithms
all_score['LSVM'] = []
all_params['LSVM'] = []
lsvc_params = {'C': [0.01, 0.1, 1, 10, 100],
               'penalty': ['l1', 'l2']}
for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(lsvc, lsvc_params, data[0], diabetes_y_train, score='accuracy')
    all_score['LSVM'].append(score)
    all_params['LSVM'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7653631284916201
Best Parameters: {'C': 0.01, 'penalty': 'l2'}
Average Time to Fit (s): 0.01
Average Time to Score (s): 0.0


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7672253258845437
Best Parameters: {'C': 0.1, 'penalty': 'l2'}
Average Time to Fit (s): 0.009
Average Time to Score (s): 0.0


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.776536312849162
Best Parameters: {'C': 0.1, 'penalty': 'l2'}
Average Time to Fit (s): 0.007
Average Time to Score (s): 0.0


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7728119180633147
Best Parameters: {'C': 0.1, 'penalty': 'l2'}
Average Time to Fit (s): 0.007
Average Time to Score (s): 0.0


mean-

The best score we got from linear support vector machine is 0.7765 and we can see that the average fitting time is quite fast, we will next try the non-linear kernel model

In [40]:
# svc with non-linear kernel
svc = SVC()

# create an empty list to store the scores of non-linear support vector machine algorithms
all_score['SVM'] = []
all_params['SVM'] = []
svc_params = {'C': [0.01, 0.1, 1, 10, 100],
              'kernel': ['rbf', 'poly']}
for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(svc, svc_params, data[0], diabetes_y_train, score='accuracy')
    all_score['SVM'].append(score)
    all_params['SVM'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7635009310986964
Best Parameters: {'C': 1, 'kernel': 'rbf'}
Average Time to Fit (s): 0.016
Average Time to Score (s): 0.002


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7616387337057728
Best Parameters: {'C': 1, 'kernel': 'rbf'}
Average Time to Fit (s): 0.015
Average Time to Score (s): 0.002


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7746741154562383
Best Parameters: {'C': 100, 'kernel': 'poly'}
Average Time to Fit (s): 0.006
Average Time to Score (s): 0.002


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7728119180633147
Best Parameters: {'C': 100, 'kernel': 'poly'}
Average Time to Fit (s): 0.006
Average Time to Score (s): 0.002


The non-linear kernel support vector machine in this case actually has a lower score than the linear svm, the best score we got is 0.7747

#### Gradient boosting
Next we will try another ensemble model called gradient boosting, this algorithm basically use the weak learner model, at first the model will perform badly, however we will use another weak leaner to fit the unexplained residue from the previous weak learner, repeat it in series until we get the final output

In [41]:
gbr = GradientBoostingClassifier()

# create an empty list to store the scores of gradient boosting algorithms
all_score['Gradient_boosting'] = []
all_params['Gradient_boosting'] = []
gbr_params = {'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],
              'n_estimators': [50, 100, 500]}

for data in data_all:
    print(data[1])
    print('-'*90)
    best, score = get_best_model_accuracy(gbr, gbr_params, data[0], diabetes_y_train, score='accuracy')
    all_score['Gradient_boosting'].append(score)
    all_params['Gradient_boosting'].append(best)
    print('\n')

mean-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7616387337057728
Best Parameters: {'learning_rate': 0.05, 'n_estimators': 100}
Average Time to Fit (s): 0.2
Average Time to Score (s): 0.001


median-z-score
------------------------------------------------------------------------------------------
Best accuracy : 0.7635009310986964
Best Parameters: {'learning_rate': 0.01, 'n_estimators': 500}
Average Time to Fit (s): 0.203
Average Time to Score (s): 0.002


mean-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7653631284916201
Best Parameters: {'learning_rate': 0.05, 'n_estimators': 100}
Average Time to Fit (s): 0.209
Average Time to Score (s): 0.002


median-min-max
------------------------------------------------------------------------------------------
Best accuracy : 0.7635009310986964
Best Parameters: {'learning_rate': 0.05, 'n_estimators': 5

We can see that gradient boosting takes a while to fit the model, mostly due to the reason that it fits the weak learner in series, the best score we got is 0.7653

Since we have gather all the scores and parameters from the model we fit, we will show the summary table as below

In [44]:
# creating a dataframe for the score
all_score_df = pd.DataFrame(all_score, index=['mean-z-score', 'median-z-score', 'mean-min-max',
                                              'median-min-max', 'mean-z-score-PCA', 'median-z-score-PCA',
                                              'mean-min-max-PCA', 'median-min-max-PCA'])
all_score_df

Unnamed: 0,Naive_Bayes,KNN,Logistic_regression,Decision_tree,Random_forest,LSVM,SVM,Gradient_boosting
mean-z-score,0.748603,0.77095,0.77095,0.73743,0.769088,0.765363,0.763501,0.761639
median-z-score,0.744879,0.776536,0.77095,0.746741,0.774674,0.767225,0.761639,0.763501
mean-min-max,0.748603,0.776536,0.772812,0.741155,0.776536,0.776536,0.774674,0.765363
median-min-max,0.744879,0.778399,0.772812,0.743017,0.778399,0.772812,0.772812,0.763501
mean-z-score-PCA,0.700186,0.728119,0.72067,0.705773,0.743017,0.726257,0.729981,0.743017
median-z-score-PCA,0.6946,0.72067,0.726257,0.728119,0.746741,0.718808,0.718808,0.743017
mean-min-max-PCA,0.700186,0.731844,0.735568,0.722533,0.746741,0.73743,0.73743,0.73743
median-min-max-PCA,0.703911,0.724395,0.73743,0.728119,0.761639,0.739292,0.739292,0.744879


In [45]:
all_params_df = pd.DataFrame(all_params, index=['mean-z-score', 'median-z-score', 'mean-min-max',
                                              'median-min-max', 'mean-z-score-PCA', 'median-z-score-PCA',
                                              'mean-min-max-PCA', 'median-min-max-PCA'])
all_params_df

Unnamed: 0,Naive_Bayes,KNN,Logistic_regression,Decision_tree,Random_forest,LSVM,SVM,Gradient_boosting
mean-z-score,{'var_smoothing': 1e-11},{'n_neighbors': 9},"{'C': 0.1, 'penalty': 'l1'}","{'criterion': 'gini', 'max_depth': 5}","{'max_depth': 5, 'n_estimators': 50}","{'C': 0.01, 'penalty': 'l2'}","{'C': 1, 'kernel': 'rbf'}","{'learning_rate': 0.05, 'n_estimators': 100}"
median-z-score,{'var_smoothing': 1e-11},{'n_neighbors': 11},"{'C': 0.1, 'penalty': 'l1'}","{'criterion': 'gini', 'max_depth': 5}","{'max_depth': 7, 'n_estimators': 500}","{'C': 0.1, 'penalty': 'l2'}","{'C': 1, 'kernel': 'rbf'}","{'learning_rate': 0.01, 'n_estimators': 500}"
mean-min-max,{'var_smoothing': 1e-11},{'n_neighbors': 9},"{'C': 1, 'penalty': 'l2'}","{'criterion': 'gini', 'max_depth': 5}","{'max_depth': 9, 'n_estimators': 50}","{'C': 0.1, 'penalty': 'l2'}","{'C': 100, 'kernel': 'poly'}","{'learning_rate': 0.05, 'n_estimators': 100}"
median-min-max,{'var_smoothing': 1e-11},{'n_neighbors': 9},"{'C': 1, 'penalty': 'l2'}","{'criterion': 'gini', 'max_depth': 5}","{'max_depth': 9, 'n_estimators': 100}","{'C': 0.1, 'penalty': 'l2'}","{'C': 100, 'kernel': 'poly'}","{'learning_rate': 0.05, 'n_estimators': 50}"
mean-z-score-PCA,{'var_smoothing': 1e-11},{'n_neighbors': 7},"{'C': 0.1, 'penalty': 'l1'}","{'criterion': 'gini', 'max_depth': 3}","{'max_depth': 3, 'n_estimators': 500}","{'C': 10, 'penalty': 'l2'}","{'C': 10, 'kernel': 'rbf'}","{'learning_rate': 0.1, 'n_estimators': 50}"
median-z-score-PCA,{'var_smoothing': 1e-11},{'n_neighbors': 7},"{'C': 0.1, 'penalty': 'l1'}","{'criterion': 'entropy', 'max_depth': 5}","{'max_depth': 3, 'n_estimators': 100}","{'C': 0.01, 'penalty': 'l2'}","{'C': 1, 'kernel': 'rbf'}","{'learning_rate': 0.05, 'n_estimators': 50}"
mean-min-max-PCA,{'var_smoothing': 1e-11},{'n_neighbors': 5},"{'C': 10, 'penalty': 'l2'}","{'criterion': 'gini', 'max_depth': 9}","{'max_depth': 7, 'n_estimators': 10}","{'C': 10, 'penalty': 'l2'}","{'C': 1, 'kernel': 'rbf'}","{'learning_rate': 0.01, 'n_estimators': 100}"
median-min-max-PCA,{'var_smoothing': 1e-11},{'n_neighbors': 7},"{'C': 10, 'penalty': 'l2'}","{'criterion': 'gini', 'max_depth': 3}","{'max_depth': 5, 'n_estimators': 10}","{'C': 100, 'penalty': 'l2'}","{'C': 1, 'kernel': 'rbf'}","{'learning_rate': 0.05, 'n_estimators': 50}"


In [47]:
# sort the average score of all algorithms
all_score_df.mean().sort_values(ascending=False)

Random_forest          0.762104
Gradient_boosting      0.752793
Logistic_regression    0.750931
KNN                    0.750931
LSVM                   0.750466
SVM                    0.749767
Decision_tree          0.731611
Naive_Bayes            0.723231
dtype: float64

Looks like we have a winner here if we just take the average score of all the data, the random forest algorithm outperforms the other algorithm and the simple algorithm naive bayes, as expected, has the average lowest score 