# Adaboost

- Boosting: Combination of several weak learners into a strong learner
- Train predictors sequentially
- Adaboost learns from past mistakes by focusing on a more difficult problem that it got wrong in prior learning (pays more attention to data that was previously underfitted)
    - Fit a sequence of weak learners (small decision trees) on repeatedly modified versions of the data
    - All predictions are then combined through a weighted majority vote to produce the final prediction
    - Data modifications at each iteration apply weights to the training samples
    - All weights are initialized to $ 1 / N $ so that the first step trains a weak learner on the original data
    - At each step, those training examples that were incorreectly predicted by the boosted model included at the previous step have their weights increased, whereas correct predictions are decreased 
    - As this proceeds, examples that are difficult to predict receive stronger influence, and each subsequent weak learner is then forced to concentrate on the examples that are missed by the previous ones in the sequence


In [6]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier

In [7]:
df = sns.load_dataset('titanic')

In [8]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [9]:
df.dropna(inplace=True)
X = df[['pclass', 'sex', 'age']].copy()
le = preprocessing.LabelEncoder()
X['sex'] = le.fit_transform(df['sex'])
y = df['survived'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [14]:
ada_clf = AdaBoostClassifier(n_estimators=100)
ada_clf.fit(X_train, y_train);

In [17]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_auc_score

def printScore(clf, X_train, X_test, y_train, y_test, train=True):
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_train)
    if train:
        res = clf.predict(X_train)
        print('Train Results:\n')
        print('Accuracy: %.2f\n' % accuracy_score(y_train, res))
        print('Classification Report: \n {} \n'.format(classification_report(y_train, res)))
        print('Confusion Matrix: \n {} \n'.format(confusion_matrix(y_train, res)))
        print('ROC AUC: {0:.4f}\n'.format(roc_auc_score(lb.transform(y_train), lb.transform(res))))
    else:
        res_test = clf.predict(X_test)
        print('Test Results:\n')
        print('Accuracy: %.2f\n' % accuracy_score(y_test, res_test))
        print('Classification Report: \n {} \n'.format(classification_report(y_test, res_test)))
        print('Confusion Matrix: \n {} \n'.format(confusion_matrix(y_test, res_test)))
        print('ROC AUC: {0:.4f}\n'.format(roc_auc_score(lb.transform(y_test), lb.transform(res_test))))

In [16]:
printScore(ada_clf, X_train, X_test, y_train, y_test)
printScore(ada_clf, X_train, X_test, y_train, y_test, train=False)

Train Results:

Accuracy: 0.86

Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.68      0.76        53
           1       0.86      0.95      0.90       110

    accuracy                           0.86       163
   macro avg       0.86      0.81      0.83       163
weighted avg       0.86      0.86      0.85       163
 

Confusion Matrix: 
 [[ 36  17]
 [  6 104]] 

ROC AUC: 0.8123

Test Results:

Accuracy: 0.74

Classification Report: 
               precision    recall  f1-score   support

           0       0.67      0.33      0.44         6
           1       0.75      0.92      0.83        13

    accuracy                           0.74        19
   macro avg       0.71      0.63      0.64        19
weighted avg       0.72      0.74      0.71        19
 

Confusion Matrix: 
 [[ 2  4]
 [ 1 12]] 

ROC AUC: 0.6282



## With RF

In [18]:
from sklearn.ensemble import RandomForestClassifier

In [20]:
ada_clf = AdaBoostClassifier(RandomForestClassifier(n_estimators=100), n_estimators=100)

In [22]:
ada_clf.fit(X_train, y_train);

In [23]:
printScore(ada_clf, X_train, X_test, y_train, y_test)
printScore(ada_clf, X_train, X_test, y_train, y_test, train=False)

Train Results:

Accuracy: 0.94

Classification Report: 
               precision    recall  f1-score   support

           0       0.92      0.89      0.90        53
           1       0.95      0.96      0.95       110

    accuracy                           0.94       163
   macro avg       0.93      0.93      0.93       163
weighted avg       0.94      0.94      0.94       163
 

Confusion Matrix: 
 [[ 47   6]
 [  4 106]] 

ROC AUC: 0.9252

Test Results:

Accuracy: 0.79

Classification Report: 
               precision    recall  f1-score   support

           0       0.62      0.83      0.71         6
           1       0.91      0.77      0.83        13

    accuracy                           0.79        19
   macro avg       0.77      0.80      0.77        19
weighted avg       0.82      0.79      0.80        19
 

Confusion Matrix: 
 [[ 5  1]
 [ 3 10]] 

ROC AUC: 0.8013

