# Ensemble Learning

Suppose you ask a complex question to thousands of random people. Then, you aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the _wisdom of the crowd_. Similarly, if you aggregate the predictions of a group of predictors, you will often get better predictions than with the best individual predictor. 

A group of predictors is called __ensemble__; thus, this technique is called _Ensemble Learning_, and an Ensemble Learning algorithm is called an _Ensemble Method_.

# Random Forest
For example, you can train a group of DEcision Tree clasisfiers, each on a different random subset of the training set. To make predictions, you just obtain the predictions of all invdividual trees, then predict the class that gets the most votes. Such an ensembles of Decision Trees is called __Random Forest__. 

## Voting Classifiers
Suppose you have a couple of classifiers such as Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest Neighbors classifier, and perhaps a few more. A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the _most vote_. This majority-vote classifer is called a __hard voting__ classifier.

Even if each classifer is a weak learner, the ensemble can still be a strong learner, provided there are a sufficient number of weak learners and they are sufficiently diverse.

In [1]:
from __future__ import division, print_function, unicode_literals
import warnings
warnings.filterwarnings("ignore") 

import numpy as np
import os
import pandas as pd

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rc
plt.rcParams['axes.labelsize'] = 16
plt.rcParams['xtick.labelsize'] = 14
plt.rcParams['ytick.labelsize'] = 14
rc('text', usetex = True)
rc('font', family='serif')

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=500, noise = 0.3, random_state=42) 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomF...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [2]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.888
SVC 0.888
VotingClassifier 0.888


While __hard voting__ takes the maximum probability of all indivudual classifiers, we could use __soft voting__ which takes the average over all estimates of ensembles. This is called __soft voting__.

In [3]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)
voting_clf_soft = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft'
)
for clf in (log_clf, rnd_clf, svm_clf, voting_clf_soft):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.904
SVC 0.888
VotingClassifier 0.928


## Bagging and Pasting

One way for ensemble learing is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called _bagging_ (bootstrap aggregating). When sampling is performed without replacement, it is called _pasting_. If you want to have a __bagging__ classifier, you need to set the hyperparameter __bootstrap__ as __True__. 

In [35]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

beg_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1
)

beg_clf.fit(X_train, y_train)
y_pred = beg_clf.predict(X_test)

The __BeggingClassifier__ automatically performs soft voting instead of hard voting if he base classifier can estimate class probabilities (i.e. it has a __predict_proba()__ method).

In [36]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.92

## Out-of-Bag Evaluation

By default a __BaggingClassifer__ samples $m$ training instances with replacement (__bootstrap = True__), where $m$ is the size of training set. This means that only $\%63$ of the training instances are sampled on average for each predictor. Thsi is beacuse $1 - (1 - \frac{1}{m})^m = 1 - e^{-1} = \frac{2}{3}$. 
The remaining $\%37$ of the training instanes that are not sampled are called _out-of-bag_ (oob) instances. As the oob instances never be seen during the training, it can be evaluated without the need for a separate validation set or cross-validation. __oob_score=True__ when creating a __BaggingClassifier__. 

In [37]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9013333333333333

In [32]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.904

The decision function for each training instance is also available through the __oob_decision_function___ variable.

In [33]:
bag_clf.oob_decision_function_[0:10]

array([[0.32972973, 0.67027027],
       [0.31707317, 0.68292683],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.00552486, 0.99447514],
       [0.09      , 0.91      ],
       [0.31034483, 0.68965517],
       [0.01298701, 0.98701299],
       [0.98795181, 0.01204819],
       [0.98979592, 0.01020408]])

In [38]:
len(bag_clf.oob_decision_function_) / 2 / len(y)

0.375

## Random Pathces and Random Subspaces