# Boosting with AdaBoost
<br>
Boosting is another ensemble method that is also an adaptive basis function model, i.e. the basis function, $\phi(x)$ does not have to be linear in the parameters. In short, the estimate $\hat{f}(x)$ is defined by:
$$\hat{f}(x) = w_0 + \sum_{m=1}^m w_m\phi(x,\gamma_m)$$ 
- where each $\phi(x,\gamma_m)$
  - is a simple classifier that can classify the entire feature space
  - is a weak learner that is only required to do better than chance
    - each weak learner is a 'stump', i.e. a 1 stage CART with one node and two leaves

For a two class problem where $\tilde{y} \in \{-1, 1\}$
$$\hat{f}(x) = f_0 + \sum_{m=1}^m \beta_m\phi(x,\gamma_m)$$ 
where 
- $\beta_m$ is the "importance" of the $m^{th}$ classifier
- $\phi(x,\gamma_m)$ is the $m^{th}$ base classifier

and the final classifier is defined by:
<br>
$$\hat{y}(x) = sign\{\hat{f}(x)\}$$
<br>
And |$\hat{f}(x)$| provides a measure of confidence in the class assignment of x. It is worth stressing that Boosting with each iteration is trying to improve on the estimate and that is where $\beta_m$ comes from. In the image below (credit goes to analyticsvidhya.com) you can appreciate how boosting works. In the first iteration illustrated by 'box 1' you can see that the split misclassified the three '+' samples toward the top. So these three '+' samples gain more importance as can be appreciated in 'box 2' since they appear larger. Now in the second iteration the three '-' samples to the left of the split are misclassified so that in box 3 they appear larger (and the others appear smaller). Finally, the box in the bottom of the graphic shows the aggregated results. 

![Boosting](boosting.png)
<br>
An important aspect in the Boosting procedure is how to calculate $\phi_m(x)$ and for that we use the following objective function:
<br>

$$ f_{obj} = [\{\beta_m, m=1,...,M\}, \{\gamma_m, m=1,...,M\}, \phi(x)]\\
= \frac{1}{N} \sum_{i=1}^{N}L(\tilde{y_i}, f(x_i))$$
<br>
where L is a loss function such as:
- the 0-1 loss function, $L_{0-1} = \mathbb{I}(\tilde{y_i} \ne sign\{\hat{f}(x_i)\})$
- the exponential loss function, $L_{exp}(\tilde{y_i},\hat{f}(x_i)) = exp(-\tilde{y_i}\hat{f}(x_i)) $
<br><br>

So you want to find:<br>
$$\hat{f}(x) = \underset{f(x)}{\mathrm{argmin}}\quad \sum_{i=1}^N L_{exp}(\tilde{y_i},\hat{f}(x_i))$$
$$= \underset{f_0, \beta_m, \gamma_m}{\mathrm{argmin}}\quad \sum_{i=1}^N L_{exp}(\tilde{y_i}, f_0 + \sum_{m=1}^M \beta_m\phi(x,\gamma_m))$$
<br><br>

AdaBoost is a Boosting method where the loss function is $L_{exp}(\tilde{y_i},\hat{f}(x_i))$ 

In [1]:
'''
In this example we are reading in a house description and sale dataset. For this classification we are going to 
estimate whether a house will sell(and with what probability) within 90 days of being put on the market.
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

# this data has already been cleaned up, standardized, one hot encoded and vetted
df = pd.read_csv("classification_house_sale_px_data.csv", parse_dates=True, sep=',', header=0)
df_labels = pd.read_csv("classification_house_sale_px_labels.csv", parse_dates=True, sep=',', header=0)

# split data into training and test sets
train, test, y_train, y_test = train_test_split(df, df_labels, train_size=.6, test_size=.4, shuffle=True)

# run the classifier on the training data
clf = AdaBoostClassifier(n_estimators=5)
clf.fit(train, list(y_train.label.values))
# make prediction on the test data
#predicted = clf.predict(test)
print("AdaBoost: Test set accuracy (% correct) when n_estimators = 5: {0:.3f}".format(clf.score(test, y_test.label.values)))
# run the classifier on the training data
clf = AdaBoostClassifier(n_estimators=20)
clf.fit(train, list(y_train.label.values))
print("AdaBoost: Test set accuracy (% correct) when n_estimators = 20: {0:.3f}".format(clf.score(test, y_test.label.values)))

AdaBoost: Test set accuracy (% correct) when n_estimators = 5: 0.615
AdaBoost: Test set accuracy (% correct) when n_estimators = 20: 0.630


<br>
Note how the AdaBoost estimate is as good as the best decision tree estimate, however, the worst AdaBoost estimate does not degrade as much as the worst decision tree estimate.  
<br>
# Take away
- Boosting is an ensemble method that makes use of weak learners and aggregates those results for final estimate
- Boosting takes the previous iteration results and attempts to improve on them by assining a higher importance to previously misclassified samples
- AdaBoost uses a CART weak learner and the $L_{exp}$ loss function