# AdaBoost

In [19]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import preprocessing
from sklearn import datasets ## Get dataset from sklearn
import sklearn.model_selection as ms
import sklearn.metrics as sklm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import numpy.random as nr

%matplotlib inline

In this notebook AdaBoost method will be used to predict heart disease.

In [20]:
Features = np.array(pd.read_csv('Features.csv'))
Labels = np.array(pd.read_csv('Labels.csv'))
Labels = Labels.reshape(Labels.shape[0],)
print(Features.shape)
print(Labels.shape)

(297, 29)
(297,)


In [21]:
nr.seed(123)
inside = ms.KFold(n_splits=10, shuffle = True)
nr.seed(321)
outside = ms.KFold(n_splits=10, shuffle = True)

Boosting is basically using many weak classifiers and weight them properly. If a classifier is causing only a few errors it has a great value (weight). Otherwise the weight is adequactly smaller. The code below searches for the best parameter (like in previous notebooks) for learing rate. Each of elements of the main classifier will be affected by learning rate.

In [22]:
## Define the dictionary for the grid search and the model object to search on
param_grid = {"learning_rate": [0.01, 0.1, 1, 10]}
## Define the AdaBoosted tree model
nr.seed(3456)
ab_clf = AdaBoostClassifier()  

## Perform the grid search over the parameters
nr.seed(4455)
ab_clf = ms.GridSearchCV(estimator = ab_clf, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      scoring = 'roc_auc',
                      return_train_score = True)
ab_clf.fit(Features, Labels)
print(ab_clf.best_estimator_.learning_rate)

0.1




In [23]:
nr.seed(498)
cv_estimate = ms.cross_val_score(ab_clf, Features, Labels, 
                                 cv = outside) # Use the outside folds

print('Mean performance metric = %4.3f' % np.mean(cv_estimate))
print('SDT of the metric       = %4.3f' % np.std(cv_estimate))
print('Outcomes by cv fold')
for i, x in enumerate(cv_estimate):
    print('Fold %2d    %4.3f' % (i+1, x))



Mean performance metric = 0.905
SDT of the metric       = 0.057
Outcomes by cv fold
Fold  1    0.920
Fold  2    0.864
Fold  3    0.814
Fold  4    0.884
Fold  5    0.898
Fold  6    0.818
Fold  7    0.964
Fold  8    0.943
Fold  9    0.980
Fold 10    0.964




In [24]:
## Randomly sample cases to create independent training and test data
nr.seed(1115)
indx = range(Features.shape[0])
indx = ms.train_test_split(indx, test_size = 80)
X_train = Features[indx[0],:]
y_train = np.ravel(Labels[indx[0]])
X_test = Features[indx[1],:]
y_test = np.ravel(Labels[indx[1]])

In [25]:
nr.seed(1115)
ab_mod = AdaBoostClassifier(learning_rate = ab_clf.best_estimator_.learning_rate) 
ab_mod.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.1, n_estimators=50, random_state=None)

In [36]:
def score_model(probs, threshold):
    return np.array([1 if x > threshold else 0 for x in probs[:,1]])

def print_metrics(labels, probs, threshold):
    scores = score_model(probs, threshold)
    metrics = sklm.precision_recall_fscore_support(labels, scores)
    conf = sklm.confusion_matrix(labels, scores)
    print('                 Confusion matrix')
    print('                 Score positive    Score negative')
    print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
    print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
    print('')
    print('Accuracy        %0.2f' % sklm.accuracy_score(labels, scores))
    print('AUC             %0.2f' % sklm.roc_auc_score(labels, probs[:,1]))
    print('Macro precision %0.2f' % float((float(metrics[0][0]) + float(metrics[0][1]))/2.0))
    print('Macro recall    %0.2f' % float((float(metrics[1][0]) + float(metrics[1][1]))/2.0))
    print(' ')
    print('           Positive      Negative')
    print('Num case   %6d' % metrics[3][0] + '        %6d' % metrics[3][1])
    print('Precision  %6.2f' % metrics[0][0] + '        %6.2f' % metrics[0][1])
    print('Recall     %6.2f' % metrics[1][0] + '        %6.2f' % metrics[1][1])
    print('F1         %6.2f' % metrics[2][0] + '        %6.2f' % metrics[2][1])
    
probabilities = ab_mod.predict_proba(X_test)
print_metrics(y_test, probabilities, 0.55)    

                 Confusion matrix
                 Score positive    Score negative
Actual positive        31                 2
Actual negative        18                29

Accuracy        0.75
AUC             0.91
Macro precision 0.78
Macro recall    0.78
 
           Positive      Negative
Num case       33            47
Precision    0.63          0.94
Recall       0.94          0.62
F1           0.76          0.74


The results are not great. Note that treshold is already adjusted to value 0.55. Without this change the result would be worse. But still there are two cases with disease that were missclassified.

A huge disadvantage of the AdaBoost method is that the model cannot be weighted properly. As written before the model is unbalanced. Only thing one can do is undersample healthy cases.

In [27]:
temp_Labels_1 = Labels[Labels == 0]  # Save these
temp_Features_1 = Features[Labels == 0,:] # Save these
temp_Labels_0 = Labels[Labels == 1]  # Undersample these
temp_Features_0 = Features[Labels == 1,:] # Undersample these

indx = nr.choice(temp_Features_0.shape[0], temp_Features_1.shape[0], replace=True)

temp_Features = np.concatenate((temp_Features_1, temp_Features_0[indx,:]), axis = 0)
temp_Labels = np.concatenate((temp_Labels_1, temp_Labels_0[indx,]), axis = 0) 

print(np.bincount(temp_Labels))
print(temp_Features.shape)
print(temp_Labels.shape)

[136 136]
(272, 29)
(272,)


In [28]:
nr.seed(1234)
inside = ms.KFold(n_splits=10, shuffle = True)
nr.seed(3214)
outside = ms.KFold(n_splits=10, shuffle = True)

## Define the AdaBoosted tree model
nr.seed(3456)
ab_clf = AdaBoostClassifier()  

## Perform the grid search over the parameters
nr.seed(4455)
ab_clf = ms.GridSearchCV(estimator = ab_clf, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      scoring = 'roc_auc',
                      return_train_score = True)
ab_clf.fit(temp_Features, temp_Labels)
print(ab_clf.best_estimator_.learning_rate)

0.1




In [29]:
nr.seed(498)
cv_estimate = ms.cross_val_score(ab_clf, Features, Labels, 
                                 cv = outside) # Use the outside folds

print('Mean performance metric = %4.3f' % np.mean(cv_estimate))
print('SDT of the metric       = %4.3f' % np.std(cv_estimate))
print('Outcomes by cv fold')
for i, x in enumerate(cv_estimate):
    print('Fold %2d    %4.3f' % (i+1, x))



Mean performance metric = 0.905
SDT of the metric       = 0.057
Outcomes by cv fold
Fold  1    0.920
Fold  2    0.864
Fold  3    0.814
Fold  4    0.884
Fold  5    0.898
Fold  6    0.818
Fold  7    0.964
Fold  8    0.943
Fold  9    0.980
Fold 10    0.964




In [30]:
## Randomly sample cases to create independent training and test data
nr.seed(1115)
indx = range(Features.shape[0])
indx = ms.train_test_split(indx, test_size = 80)
X_train = Features[indx[0],:]
y_train = np.ravel(Labels[indx[0]])
X_test = Features[indx[1],:]
y_test = np.ravel(Labels[indx[1]])

## Undersample the majority case for the training data
temp_Labels_1 = y_train[y_train == 0]  # Save these
temp_Features_1 = X_train[y_train == 0,:] # Save these
temp_Labels_0 = y_train[y_train == 1]  # Undersample these
temp_Features_0 = X_train[y_train == 1,:] # Undersample these

indx = nr.choice(temp_Features_0.shape[0], temp_Features_1.shape[0], replace=True)

X_train = np.concatenate((temp_Features_1, temp_Features_0[indx,:]), axis = 0)
y_train = np.concatenate((temp_Labels_1, temp_Labels_0[indx,]), axis = 0) 

print(np.bincount(y_train))
print(X_train.shape)
print(y_train.shape)

[103 103]
(206, 29)
(206,)


In [31]:
## Define and fit the model
nr.seed(1115)
ab_mod = AdaBoostClassifier(learning_rate = ab_clf.best_estimator_.learning_rate) 
ab_mod.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.1, n_estimators=50, random_state=None)

In [37]:
probabilities = ab_mod.predict_proba(X_test)
print_metrics(y_test, probabilities, 0.55)    

                 Confusion matrix
                 Score positive    Score negative
Actual positive        31                 2
Actual negative        18                29

Accuracy        0.75
AUC             0.91
Macro precision 0.78
Macro recall    0.78
 
           Positive      Negative
Num case       33            47
Precision    0.63          0.94
Recall       0.94          0.62
F1           0.76          0.74


With the evaluation in the cells above one cannot see a great improvement. Furthermore, in this particular dataset and with this particular goal (to classify all the cases with disease properly and as less cases without disease as disease as possible) this is not the best method. It has to be a method where it is possible to specify how important are cases with disease, way more important than cases without disease. As written before those cases would be only an additional job for a doctor to specify if the case is ill or not. But the important thing is not to miss a single case with disease. 

Of all three methods shown on 3rd 4th and 5th notebook the best performance was regression (3rd notebook).  