## Chapter 7 - Ensemble Learning and Random Forests

### Boosting

Boosting is an Ensemble method that combines several weak learners to create a strong learner. The general idea is to train predictors sequentially, each trying to correct its predecessor. 

The most popular boosting methods are Adaboost and Gradient Boosting.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.datasets import make_moons

One way for a new predictor to correct its predecessor is to pay more attention to the training instances that the predecessor underfitted. This results in predictors focusong more and more of the wrongly classified cases. This is the technique used by Adaboost.

To build an Adaptive Boosting (Adaboost) classifier, a base classifier is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and makes predictions on the training set, and so on.

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, but predictors that have different weights depending on their overall accuracy on the weighted training set.

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
# Train for Adaboost classifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, 
                            learning_rate=0.5, algorithm='SAMME.R')
ada_clf.fit(X_train, y_train)
y_pred = ada_clf.predict(X_test)

A weight is applied to every example in the training data. Initially, they are all equal so $w^{(i)}=\frac 1m$. A first predictor is trained. The errors are calculated using the error rate $r$.

$$r = \frac{\sum_{i=1,\hat{y}^{(i)} \neq {y}^{(i)}}^M w^{(i)}}{\sum_{i=1}^M w^{(i)}}$$
The numerator is sum of weights of all incorrectly classified instances while the denominator is sum of weights of all instances.

The predictor's weight $\alpha$ is computed using $\eta \log \frac{1-r}{r}$ where $\eta$ is the learning rate.

Now, a second predictor is trained on the training set again. But the weights of the training set are adjusted so the examples correctly classified have a smaller weight and those that were wrongly classified have a larger weight. To do so, 

$$w_\text{new} = \begin{cases}\frac{w_\text{old} \exp(\alpha)}{\sum_i w^{(i)}} \text{ if classified correctly or }\hat{y}^{(i)} = {y}^{(i)}\\\frac{w_\text{old} \exp(-\alpha)}{\sum_i w^{(i)}} \text{ if classified incorrectly or }\hat{y}^{(i)} \neq {y}^{(i)}\end{cases}$$


With multiple predictors and weights, the predicted class is the one that receives the maximum score of the weighted votes. 

For gradient boosted trees, it also addes predictors sequentially to the ensemble. However, instead of tweaking the weights of the instance, the method fits the new predcitor to the residual errors made by the previous predictor.

In [5]:
# Train for Adaboost classifier
gbt_clf = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.0)
gbt_clf.fit(X_train, y_train)
y_pred = gbt_clf.predict(X_test)