In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Ensemble Learning and Random Forests

The main idea is the following: If you aggregate the predictions of a group of predictors (such as classifiers or regressors), you'll often get a better predictions than with the best individual predictor. A group of predictor is called an *ensemble* thus this technique is called *Ensemble Learning*, and an Ensemble Learning algorithm is called an *Ensemble method*.

## Voting classifiers

A simple way to create a better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a ***Hard Voting Classifier***. It often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a *weak learner* (means it does slightly better than random guessing), the ensemble can still be a *strong learner*, provided there are a sufficient number of weak learners and they are sufficiently diverse.

Ensemble methods work bst when the predictors are as independent from one another possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble's accuracy.

In [2]:
#Import algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#Import dataset
from sklearn.datasets import make_moons

#Import train test split
from sklearn.model_selection import train_test_split

In [3]:
X,y = make_moons(n_samples=100000,noise=0.4)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [5]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

In [6]:
voting_clf = VotingClassifier(
            estimators=[("lr",log_clf),("rf",rnd_clf),("svm",svm_clf)],
            voting="hard",
            n_jobs=-1)

In [7]:
voting_clf.fit(X_train,y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [8]:
#import accuracy
from sklearn.metrics import accuracy_score

In [9]:
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

LogisticRegression 0.8309090909090909
RandomForestClassifier 0.8455757575757575
SVC 0.8613333333333333
VotingClassifier 0.8572727272727273


If all the classifiers are able to estimate class probabilities, then it is possible to predict the class with the highest class probability, averaged over all the individual classifiers. This is called ***Soft voting***. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. For this, just remplacing voting="soft" and ensure that every classifiers can compute class probabilities.

## Bagging and Pasting

Another approach is to use the same training algorithm for every predictor and train them on different subsets of the training set. When sampling is performed with replacement, this method is called bagging and without it is called pasting.
They both allow training instance to be sample several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statisical mode for classification and the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reducs both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

In [10]:
#Import algorithms
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [11]:
bag_clf = BaggingClassifier(
        DecisionTreeClassifier(),n_estimators=500,
        max_samples=100,bootstrap=True,n_jobs=-1)

In [12]:
bag_clf.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [13]:
y_pred = bag_clf.predict(X_test)

In [14]:
accuracy_score(y_test,y_pred)

0.8618181818181818

The ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the training set, but the decision boundary is less irregular)

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but the extra diversity means also that the predictors end up being less correlated, so the ensemble's variance is reduced.

## Out-of-Bag evaluation

With bagging, some instances may not be sampled several times for any given predictor, while others may not be sampled at all. Only about 63%of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that arenot sampled are called *out-of-bag* (oob) instances. They are not the same for all the predictors. Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set. The ensemble can be evaluated by averaging the oob evaluations of each predictor. In Sklearn, oob_score=True to request an automatic oob evaluation after training.

In [15]:
bag_clf = BaggingClassifier(
        DecisionTreeClassifier(),
        bootstrap=True,
        n_estimators=500,
        n_jobs=-1,
        oob_score=True)

In [16]:
bag_clf.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [17]:
bag_clf.oob_score_

0.841044776119403

In [18]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test,y_pred)

0.8423030303030303

## Random Patches and Random Subspaces

Features can be sampled as well. The result is each predictor being trained on a random subset of the input features. This technique is particulary useful when dealing with high-dimensional inputs. This is called *Random Patches method*. Keeping all training instances but sampling features is called the *Random Subspaces method*. Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

## Random Forest

A random forest is an ensemble of Decision Tree, generally trained via the bagging method, with typically the max_sample set to the size of the training set.

In [19]:
from sklearn.ensemble import RandomForestClassifier

In [20]:
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=16, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [21]:
y_pred_rf = rnd_clf.predict(X_test)

In [22]:
accuracy_score(y_test,y_pred_rf)

0.8575757575757575

The RandomForest introduces extra randomness when growing trees. Instead of searching for the very best feature when splitting a node, it searches for the best feature among a subset of features. This results in geater tree diversity, which trade a higher bias for a lower variance, generally yielding an overall better model.

In [23]:
#Equivalent to a RandomForest

bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random",max_leaf_nodes=16),n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)

## Extra-Trees

It is possible to ake trees even more random by also using random thresholds for each features rather than searching for the best possible thresholds ( like Decision Trees do).

A forest of such extremely random trees is called an Extremely Randomized Trees ensemble (extra-trees). This technique trades more bias for a lower variance. It also makes the Extra-Trees much faster to train than regular Random Forests, because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growning a tree.

## Feature Importance

Another quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average. More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it.

In [24]:
from sklearn.datasets import load_iris

In [25]:
iris = load_iris()

In [27]:
rnd_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)
rnd_clf.fit(iris["data"],iris["target"])

for name, score in zip(iris["feature_names"],rnd_clf.feature_importances_):
    print(name,score)

sepal length (cm) 0.09836622532432085
sepal width (cm) 0.024746491815825236
petal length (cm) 0.42054391243883515
petal width (cm) 0.45634337042101875


Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

## Boosting

Boosting (originally called hypothesis boosting) refers to any ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

## Adaboost