In [1]:
import pandas as pd

In [133]:
df_log = pd.read_csv('../data/scores_LogisticRegression.csv')
df_xgboost = pd.read_csv('../data/scores_XGBoost.csv')
df_nb = pd.read_csv('../data/scores_MulinomialNaiveBayes.csv')
df_rf = pd.read_csv('../data/scores_RandomForest.csv')

### Model Evaluation
When evaluating our models performance, we specified balanced accuracy and recall as the two target metrics. Balanced accuracy is used as a check to see if the model is performing well generally, while recall is the primary target metric. It is important that the model performs well in both classes (balanced accuracy) but we are more interested in ensuring that the model picks up on as many positive cases as possible (recall). The precision and f1_score are listed, but were not used for determing the best model.

In [134]:
# This function is to set and name the index of the dataframe
def fix_index(df):
    df.set_index('Unnamed: 0',inplace = True)
    df.index = df.index.rename('Type of Model')

### Naive Bayes Evaluation

In [135]:
fix_index(df_nb)
df_nb

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
impute with median,0.557551,0.828635,0.094165,0.169112
gridsearch over params,0.562204,0.812309,0.095597,0.171063
impute with median and group missing into one category,0.556652,0.828635,0.09395,0.168766
MinMax Scaler,0.686953,0.483322,0.283363,0.357267
TruncatedSVD,0.5,0.0,0.0,0.0
RandomOverSampler,0.555571,0.830867,0.093658,0.16834
ROS with MinMaxScaler,0.747664,0.736669,0.214596,0.332371
ROS w/ Polynomial feature w/ MMS,0.747568,0.726216,0.219551,0.337169
SMOTE,0.554508,0.830162,0.093419,0.16794
SMOTE w/ MinMax,0.751111,0.754287,0.211271,0.330087


Varous Methods of Multinomial Bayes Classification were attempted. The SMOTE with MinMax method is the preferred model, as it performed the best on average of our two target metrics.

In [136]:
#saving this to a dataframe of the best models
df_best_models = df_nb.loc[['SMOTE w/ MinMax']]
df_best_models = df_best_models.rename(index={'SMOTE w/ MinMax': 'Best NB Model'})


### Logistic Regression Evaluation

In [137]:
fix_index(df_log)
df_log

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
lgr,0.559662,0.129669,0.528736,0.208263
lgr_oversample,0.671484,0.402044,0.378567,0.389952
lgr_oversample_bac,0.776381,0.747592,0.255664,0.381024
lgr_smote_bac,0.778651,0.797393,0.229165,0.356014
lgr_adasyn_bac,0.778694,0.813249,0.221497,0.348167
lgr_weight_bac,0.784727,0.811017,0.231083,0.359682
lgr_weight_tune_bac,0.784668,0.8109,0.231058,0.35964


In terms of recall, the ADASYN method had the highest score.  However, the overweight method runs much more quickly and has a higher balanced accuracy. Since the difference in recall between the two models is marginal, the overweight method is the top choice of the various logisitic regression models.

In [138]:
new_row = df_log.loc[['lgr_weight_bac']]
df_best_models = pd.concat([df_best_models, new_row])
df_best_models = df_best_models.rename({'lgr_weight_bac': 'Best Logistic Model'})
df_best_models

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Best NB Model,0.751111,0.754287,0.211271,0.330087
Best Logistic Model,0.784727,0.811017,0.231083,0.359682


The Logsitic Regression Model with overweighting performed better than the Naive Bayes model on both recall and balanced_accuracy

### XGBoost Evaluation

In [139]:
fix_index(df_xgboost)
df_xgboost

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
xgboost,0.786722,0.819591,0.229615,0.358729


The XGBoost model had a higher balanced accuracy and recall than any of the logistic regression models.

In [140]:
df_best_models = pd.concat([df_best_models, df_xgboost])
df_best_models = df_best_models.rename({'xgboost': 'XGBoost Model'})
df_best_models

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Best NB Model,0.751111,0.754287,0.211271,0.330087
Best Logistic Model,0.784727,0.811017,0.231083,0.359682
XGBoost Model,0.786722,0.819591,0.229615,0.358729


The XGBoost model outperformed the logstic regression model on both recall and balanced accuracy

### Random Forest Evaluation

In [141]:
fix_index(df_rf)
df_rf

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
imbalanced data,0.500526,0.001062,0.9,0.002122
oversampling,0.77202,0.826056,0.206886,0.330899
SMOTE,0.701753,0.517111,0.288441,0.37032
ADASYN,0.705567,0.525844,0.289897,0.373747
Hypertuned Oversampling,0.776585,0.76717,0.241997,0.367933


The Oversampling method yielded the best results, with a high balanced accuracy and the highest recall of any of the previous models.

In [142]:
new_row2= df_rf.loc[['oversampling']]
df_best_models = pd.concat([df_best_models, new_row2])
df_best_models = df_best_models.rename({'oversampling': 'Best Random Forest Model'})
df_best_models

Unnamed: 0_level_0,balanced_accuracy,recall,precision,f1_score
Type of Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Best NB Model,0.751111,0.754287,0.211271,0.330087
Best Logistic Model,0.784727,0.811017,0.231083,0.359682
XGBoost Model,0.786722,0.819591,0.229615,0.358729
Best Random Forest Model,0.77202,0.826056,0.206886,0.330899


We believe that the best model overall is the Random Forest Model with oversampling. Although it has a lower balanced accuracy, it has a higher recall score. This means that this model had the lowest number of false negatives. We chose this model as it had the highest true positive rate (recall) and only a marginally worse balanced accuracy.

### Conclusion and Recommendations

The Random Forest Model developed can be used by healthcare professionals to identify individuals at risk of having heart disease. This model had the highest recall score and a balanced accuracy that was only slightly lower than other models produced.
We recommend using this model as an initial screening to determine patients that possibly have heart disease.


Currently, the model predicts that an individual will have heart disease if the 'probability' is greater than 50%. The model can be tweaked to instead produce a probability that an individual has heart disease. This may be useful to healthcare professionals. An example of this can be seen in the random forest model notebook.