# Programming for Data Science and Artificial Intelligence

## 9. Ensemble

Ensemble is the idea of finding aggregated answers from multiple classifiers.  Indeed, Random Forests are one example of ensembles.  

### Voting Classifiers

#### Hard voting
Suppose you have trained a few classifiers, each one achieving about 80% accuracy.  A simple way to create a better classifier is to predict the class that gets the most votes.  This majority-vote classifier is called **hard voting** classifier.

Somewhat surprisingly, this voting classifier often achieves a higher accuracy then ensemble.  In fact, even each classifier is a *weak learner* (i.e., could simply be a random guesser), their ensembles can be *strong learner*.

One strong tip is that ensemble methods works best when **classifiers are independent**  Hence, the best practice is to use diverse classifiers which will make different types of errors, improving the overall accuracy of the ensemble.

The voting classifier can be simply implemented using sklearn **VotingClassifier** API:

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(random_state=42)

#hard voting
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

In [3]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])

In [4]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


#### Soft Voting

If all classifiers have a <code>predict_proba</code> method, then you can tell sklearn to predict the class with highest class probability, averaged over all individual classifiers.  This is called **soft voting**.  It gives usually better results because more weight is given to highly confident votes.  

All we need to do is to replace <code>voting=hard</code> with <code>voting=soft</code>.  Also, since SVM does not have <code>predict_proba</code>, we need to set SVC of <code>probability=True</code> which will run cross-validation to get the probabilities and will give SVM the needed <code>predict_proba</code>

In [5]:
#soft voting

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


### Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms as just discussed.  Another way is to use same training algorithms but with different random subsets of the training set (similar to Random Forests).  When sampling is performed **with** replacement, this is called **bagging** or **boostrapping**.  Otherwise, is called **pasting**.  In other words, only bagging allows training instances to be sampled several times for the same predictor.  

And because bagging and pasting support parallel computing (e.g., using <code>n_jobs</code>), they are very popular methods.

To perform in sklearn, we can use the <code>BaggingClassifier</code> API.  Pasting can be done using <code>BaggingClassifier</code> setting <code>boostrap=False</code>

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [7]:
from sklearn.metrics import accuracy_score
print("Bagging: ", accuracy_score(y_test, y_pred))

Bagging:  0.904


In [8]:
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print("Decision Tree: ", accuracy_score(y_test, y_pred_tree))

Decision Tree:  0.856


#### Out of Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all.  By default <code>BaggingClassifier</code> samples roughly 70% of data, while leaving 30% untouched.  This untouched data is called **Out of Bag** (oob).  Note that oob is not the same for all predictors.

One interesting is that since oob is something that each classifier never see, thus oob is somewhat a test set.  In <code>BaggingClassifier</code>, we can set <code>oob_score=True</code> which will evaluate the ensemble by averaging ut the oob evaluations of each predictor.



In [9]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    bootstrap=True, oob_score=True, random_state=42)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

In [10]:
#Compare with our prediction

from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.92

#### Sampling features

BaggingClassifier supports sampling the features as well.  This is controlled by two hyperparameters: <code>max_features</code> and <code>bootstrap_features</code>.  Thus each predictor will be trained on random subset of input features (becareful, I am talking about features, NOT instances)

Sampling both training instances and features is called the **Random Patches**.  Keeping all training instances (bootstrap=False and max_samples=1.0) but sampling features (bootstrap_features=True and/or max_features = something less than 1.0) is called **Random Subspaces**

### Boosting

**Boosting** refers to ensemble method that can combine several weak learners into a strong learners.  The general idea of most boosting methods is to train predictors **sequentially**, each trying to correct its predecessor.  The most popular ones are **AdaBoost** (short for Adaptive Boosting) and **Gradient Boosting**.

#### AdaBoost

By correcting what has been underclassified (or underfitted), we can create new predictor that focus on more hard cases.  Last, when we combine all sequential predictors, we get a predictor that takes care of all cases.

For example, in the picture below, in the Weak Learner #1, a classification was made.  However, one red dot was misclassified.  This red dot gets increased weight and pass to the second learner.  The second learner now can correctly classify the bigger red dots, and then this bigger red dots become smaller (i.e., decrease weights).  At the same time, anything that is again misclassified here has increased weights and these weights are passed on to the next predictor.   And son on.  

Once all predictors are trained, the ensemble makes predictions very much like bagging, except that predictors have different weights ($a_j$) depending on their overall accuracy on the weighted training set.

Initially, all data points ($w_i$) have same weight:

$$ w_i = 1/N $$

where N is the total number of data points, and the weighted samples always sum to 1, thus value of w of each point will always lie between 0 and 1.  

We also can create weight for predictors (j) using

$$ a_j = \eta\ln\frac{1 - r_j}{r_j} $$

where $\eta$ is simply learning rate (defaults to 1), and $r_j$ is simply total number of misclassifications divided by the training set size.  This $a_j$ is useful both for (1) using in final predictions and (2) updating weights of each data samples.  To update each data sample (i) for the next predictor, we do:

$$ 
  w_i =
  \begin{cases}
    w_i, & \text{if } \hat{y_j^i} = y_i \\
    w_{i-1}*\exp(\alpha_j) & \text{otherwise } \\
  \end{cases}
$$

In other words, new sample weight is simply function of old sample weight multiplied with Euler's number, raised to alpha we computed in (2).

Last, once all predictions are trained, AdaBoost make predictions using
based on the majority of weighted votes.



![](figures/ada.png)
Source: https://www.sciencedirect.com/topics/engineering/adaboost

sklearn implements AdaBoost using SAMME which stands for Stagewise Additive Modeling using a Multiclass Exponential Loss Function.

The following code trains an AdaBoost classifier based on 200 Decision stumps.  A Decision stump is basically a Decision Tree with max_depth=1.  This is the default base estimator of AdaBoostClassifier class:

In [11]:
from sklearn.ensemble import AdaBoostClassifier

#SAMME.R - a variant of SAMME which relies on class probabilities 
#rather than predictions and generally performs better
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred = ada_clf.predict(X_test)
print("Ada score: ", accuracy_score(y_test, y_pred))

Ada score:  0.896


### Gradient Boosting

Another popular one is Gradient Boosting.  Similar to AdaBoost, Gradient Boosting works by adding sequential predictors.  However, instead of adding **weights**, this method tries to fit the new predictor to the **residual errors** made by the previous predictor.

