# Tree Models with Imbalanced Data
- Models tested
    - Decision Tree Pipeline with Stratified Cross Validation
    - Decision Tree Pipeline with SMOTE and Stratified Cross Validation
    - RandomForest Pipeline with Stratified Cross Validation
    - RandomForest Pipeline with SMOTE and Stratified Cross Validation
    - BalancedRandomForest
    - AdaBoost Pipeline with Stratified Cross Validation
    - AdaBoost Pipeline with SMOTE and Stratified Cross Validation
    
- Tree models are not suppose to be very good with imbalanced data.  The bootstrapped dataset on highly imbalanced data can cause significant variations.   

In [48]:
# Dependencies

import pandas as pd
from statistics import mean, mode
import matplotlib.pyplot as plt

import joblib
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, precision_recall_curve, auc, roc_auc_score
from scikitplot.metrics import plot_roc, plot_precision_recall, plot_cumulative_gain, plot_lift_curve
from collections import Counter

# Note:  May need to use class weight - see _-model-weights.ipynb


In [49]:
# Loading data
# file_path = Path("../data/myopia.csv")
file_path = Path("../eda/reduced_filtered_df.csv")
df = pd.read_csv(file_path)
df.head(1)

<IPython.core.display.Javascript object>

Unnamed: 0,ACD,LT,VCD,SPORTHR,DADMY,delta_spheq,total_positive_screen,MYOPIC
0,3.702,3.392,15.29,4,1,1.358,8,0


In [50]:
# Check dataset balance
df["MYOPIC"].value_counts()

0    323
1     49
Name: MYOPIC, dtype: int64

In [51]:
# Define X,y
label = df["MYOPIC"]
X = df.iloc[:,:-1].copy()
X.head()

Unnamed: 0,ACD,LT,VCD,SPORTHR,DADMY,delta_spheq,total_positive_screen
0,3.702,3.392,15.29,4,1,1.358,8
1,3.462,3.514,15.52,14,0,1.929,10
2,3.224,3.556,15.36,10,1,2.494,26
3,3.186,3.654,15.49,12,1,1.433,16
4,3.732,3.584,15.08,12,0,2.022,8


In [52]:
# Note the use of strategy since the dataset is imbalanced. 
# I am isolaating the X_test and y_test from the preprossing 
X_train, X_test, y_train, y_test = train_test_split(X, label, random_state=42, test_size=0.1, stratify=label)

<IPython.core.display.Javascript object>

In [53]:
y_train.value_counts()

0    290
1     44
Name: MYOPIC, dtype: int64

In [54]:
y_test.value_counts()

0    33
1     5
Name: MYOPIC, dtype: int64

## Decision Trees

### Decision Tree Pipeline with Stratified Cross Validation

In [55]:
model = DecisionTreeClassifier(random_state=1)
scoring = ('f1', 'recall', 'precision', 'roc_auc')
steps = [('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)

In [56]:
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)

In [57]:
print(f'Mode of Scores: {mode(scores["test_roc_auc"])}')
print("--------"*10)
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('Mean ROC AUC: %.3f' % mean(scores['test_roc_auc']))

Mode of Scores: 0.5560344827586207
--------------------------------------------------------------------------------
Mean f1: 0.314
Mean recall: 0.358
Mean precision: 0.320
Mean ROC AUC: 0.621


In [58]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [59]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.91      0.92        33
           1       0.50      0.60      0.55         5

    accuracy                           0.87        38
   macro avg       0.72      0.75      0.73        38
weighted avg       0.88      0.87      0.87        38



### Decision Tree Pipeline with SMOTE and Stratified Cross Validation
- Good explanation of stratified sampling - https://medium.com/sfu-cspmp/surviving-in-a-random-forest-with-imbalanced-datasets-b98b963d52eb

In [60]:
scoring = ('f1', 'recall', 'precision', 'roc_auc')
steps = [('over', SMOTE(random_state=3)), ('model', DecisionTreeClassifier(random_state=1))]
pipeline = Pipeline(steps=steps)

# evaluate pipeline
# Note for imbalanced classification don't use k-fold cross-validation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)

In [61]:
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('Mean ROC AUC: %.3f' % mean(scores['test_roc_auc']))

Mean f1: 0.384
Mean recall: 0.473
Mean precision: 0.337
Mean ROC AUC: 0.660


In [62]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [63]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.88      0.88        33
           1       0.20      0.20      0.20         5

    accuracy                           0.79        38
   macro avg       0.54      0.54      0.54        38
weighted avg       0.79      0.79      0.79        38



## Random Forest

### RandomForest Pipeline with Stratified Cross Validation (imbalanced data)

In [64]:
irfc = RandomForestClassifier(n_estimators=150, random_state=1)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)

In [65]:
# Printout of the folds
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()
for train_index, test_index in cv.split(X_train_np, y_train_np):
    
    # select rows
    train_X, test_X = X_train_np[train_index], X_train_np[test_index]
    train_y, test_y = y_train_np[train_index], y_train_np[test_index]
    # summarize train and test composition
    train_0, train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
    test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1])
    print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=39, Test: 0=29, 1=5
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>Train: 0=261, 1=40, Test: 0=29, 1=4
>

In [66]:
scoring = ('f1', 'recall', 'precision', 'roc_auc')
steps = [('model', irfc)]
pipeline = Pipeline(steps=steps)

In [67]:
#Evaluate irfc model
'''
When using X_train, y_train, the following warning occurs:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to 
no predicted samples. Use `zero_division` parameter to control this behavior

This is because some of the models don't classify any minority class.
'''
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [68]:
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('AUC: %.3f' % mean(scores['test_roc_auc']))
print('-----'*20)
print('Precision Results for each fold')
scores['test_precision']

Mean f1: 0.220
Mean recall: 0.163
Mean precision: 0.378
AUC: 0.802
----------------------------------------------------------------------------------------------------
Precision Results for each fold


array([0.        , 0.        , 0.33333333, 0.        , 0.        ,
       0.        , 0.        , 1.        , 0.        , 0.33333333,
       0.5       , 0.        , 0.66666667, 0.        , 0.5       ,
       0.66666667, 0.        , 1.        , 1.        , 0.5       ,
       0.5       , 1.        , 1.        , 0.        , 0.5       ,
       0.        , 1.        , 0.5       , 0.33333333, 0.        ])

In [69]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [70]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96        33
           1       1.00      0.40      0.57         5

    accuracy                           0.92        38
   macro avg       0.96      0.70      0.76        38
weighted avg       0.93      0.92      0.91        38



### RandomForest Pipeline with SMOTE and Stratified Cross Validation

In [71]:
irfc = RandomForestClassifier(n_estimators=150, random_state=1)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)
scoring = ('f1', 'recall', 'precision', 'roc_auc')
steps = [('over', SMOTE(random_state=3)), ('model', irfc)]
pipeline = Pipeline(steps=steps)
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv)
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('AUC: %.3f' % mean(scores['test_roc_auc']))
print('-----'*20)
print('Precision Results for each fold')
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Mean f1: 0.414
Mean recall: 0.490
Mean precision: 0.371
AUC: 0.811
----------------------------------------------------------------------------------------------------
Precision Results for each fold
              precision    recall  f1-score   support

           0       0.94      0.88      0.91        33
           1       0.43      0.60      0.50         5

    accuracy                           0.84        38
   macro avg       0.68      0.74      0.70        38
weighted avg       0.87      0.84      0.85        38



### BalancedRandomForest Pipeline (imbalanced data)

In [72]:
brfc = BalancedRandomForestClassifier(n_estimators=150, random_state=1)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)
scoring1 = ('f1', 'recall', 'precision', 'roc_auc')

steps1 = [('model1', brfc)]
pipeline = Pipeline(steps=steps1)

In [73]:
X_train.values

array([[ 3.23799992,  3.62599993, 14.34000015, ...,  1.        ,
         1.50700003, 11.        ],
       [ 3.68799996,  3.7980001 , 14.38000011, ...,  1.        ,
         0.833     , 16.        ],
       [ 3.49000001,  3.49799991, 14.88000011, ...,  1.        ,
         1.62099999, 18.        ],
       ...,
       [ 3.4519999 ,  3.73799992, 16.20999908, ...,  0.        ,
         1.91999996, 10.        ],
       [ 3.62199998,  3.75200009, 14.32999992, ...,  1.        ,
         1.16499999,  8.        ],
       [ 3.78600001,  3.51200008, 15.03999996, ...,  1.        ,
         1.10699999, 14.        ]])

In [74]:
y_train.shape

(334,)

In [75]:
#Evaluate SRF model
# Note needed to upgrade imbalance-learn to at least 0.9.1 and scikit-learn to at least 1.1.1
scores = cross_validate(pipeline, X_train.values, y_train.values, cv=cv, scoring=scoring1)
#brfc.fit(X_train.values.reshape(-1, 1), y_train)

In [76]:
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('Mean AUC: %.3f' % mean(scores['test_roc_auc']))

Mean f1: 0.424
Mean recall: 0.772
Mean precision: 0.296
Mean AUC: 0.830


In [77]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [78]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.85      0.90        33
           1       0.44      0.80      0.57         5

    accuracy                           0.84        38
   macro avg       0.70      0.82      0.74        38
weighted avg       0.90      0.84      0.86        38



## Adaboost

### AdaBoost Pipeline with Stratified Cross Validation

In [79]:
iabc = AdaBoostClassifier(n_estimators=150, random_state=1)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)
scoring = ('f1', 'recall', 'precision', 'roc_auc')
steps = [('model', iabc)]
pipeline = Pipeline(steps=steps)

In [80]:
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv)
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('AUC: %.3f' % mean(scores['test_roc_auc']))
print('-----'*20)
print('Precision Results for each fold')
scores['test_precision']

Mean f1: 0.300
Mean recall: 0.287
Mean precision: 0.377
AUC: 0.701
----------------------------------------------------------------------------------------------------
Precision Results for each fold


array([0.33333333, 0.33333333, 0.25      , 0.25      , 0.        ,
       0.5       , 0.        , 0.75      , 0.        , 0.33333333,
       0.5       , 0.5       , 0.        , 0.        , 0.66666667,
       0.5       , 1.        , 0.4       , 0.33333333, 0.25      ,
       0.5       , 0.25      , 1.        , 0.33333333, 0.66666667,
       0.        , 1.        , 0.        , 0.33333333, 0.33333333])

In [81]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [82]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.88      0.89        33
           1       0.33      0.40      0.36         5

    accuracy                           0.82        38
   macro avg       0.62      0.64      0.63        38
weighted avg       0.83      0.82      0.82        38



### AdaBoost Pipeline with SMOTE and Stratified Cross Validation

In [83]:
abc = AdaBoostClassifier(n_estimators=150, random_state=1)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=2)
scoring = ('f1', 'recall', 'precision', 'roc_auc')
steps = [('over', SMOTE(random_state=3)), ('model', abc)]
pipeline = Pipeline(steps=steps)

In [84]:
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv)
print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))
print('AUC: %.3f' % mean(scores['test_roc_auc']))
print('-----'*20)
print('Precision Results for each fold')
scores['test_precision']

Mean f1: 0.433
Mean recall: 0.557
Mean precision: 0.363
AUC: 0.730
----------------------------------------------------------------------------------------------------
Precision Results for each fold


array([0.5       , 0.33333333, 0.375     , 0.2       , 0.33333333,
       0.8       , 0.33333333, 0.66666667, 0.15384615, 0.3       ,
       0.5       , 0.375     , 0.25      , 0.28571429, 0.33333333,
       0.375     , 0.5       , 0.33333333, 0.375     , 0.25      ,
       0.375     , 0.33333333, 0.8       , 0.16666667, 0.375     ,
       0.2       , 0.16666667, 0.42857143, 0.2       , 0.28571429])

In [85]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [86]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.85      0.89        33
           1       0.38      0.60      0.46         5

    accuracy                           0.82        38
   macro avg       0.65      0.72      0.68        38
weighted avg       0.86      0.82      0.83        38



## Summary

The best tree model was the BalancedRandomForest() when looking at ROC AUC and f1 and recall.  The recall score was actually very high with an average score of 0.830.  The only metric that did not perform as well is with precision.  Most models the performed the best with precision were RandomForest with SMOTE and AdaBoost; these models showed scores of 0.36 and 0.39 respectively.  

Overall I would say that the BalancedRandomForest outperformed the other models.  As seen in these examples, using SMOTE to balance the data showed an increase in overall performance based on the ROC AUC scores.  The Decision Trees and SVC did not perform that differently but Random Forests showed a significant improvement.  

Caveat1:  The one thing I don't like about the BalancedRandomForest is the lack of transparency of how many samples it used in it's fitting.  I believe that this model undersamples and I already had a fairly small sample.  

Caveat2:  Precision in these models is still low.  

- BalancedRandomForest
    Mean f1: 0.424
    Mean recall: 0.772
    Mean precision: 0.296
    Mean AUC: 0.830

The below results come from the balance-svc-thresholds.ipynb file:  
ADASYN Mean f1: 0.279 Mean recall: 0.597 Mean precision: 0.185 Mean ROC AUC: 0.659
SMOTE Mean f1: 0.292 Mean recall: 0.578 Mean precision: 0.200 Mean ROC AUC: 0.662

In [None]:
- Decision Tree Pipeline with Stratified Cross Validation
    Mean f1: 0.314
    Mean recall: 0.358
    Mean precision: 0.320
    Mean ROC AUC: 0.621
- Decision Tree Pipeline with SMOTE and Stratified Cross Validation
    Mean f1: 0.384
    Mean recall: 0.473
    Mean precision: 0.337
    Mean ROC AUC: 0.660
- RandomForest Pipeline with Stratified Cross Validation
    Mean f1: 0.220
    Mean recall: 0.163
    Mean precision: 0.378
    AUC: 0.802
- RandomForest Pipeline with SMOTE and Stratified Cross Validation
    Mean f1: 0.414
    Mean recall: 0.490
    Mean precision: 0.371
    AUC: 0.811
- BalancedRandomForest
    Mean f1: 0.424
    Mean recall: 0.772
    Mean precision: 0.296
    Mean AUC: 0.830
- AdaBoost Pipeline with Stratified Cross Validation
    Mean f1: 0.300
    Mean recall: 0.287
    Mean precision: 0.377
    AUC: 0.701
- AdaBoost Pipeline with SMOTE and Stratified Cross Validation
    Mean f1: 0.433
    Mean recall: 0.557
    Mean precision: 0.363
    AUC: 0.730