*Ensemble Learning* involves training a group of predictors (an *ensemble*) and aggregating their predictions to achieve better accuracy than any of the individual models.

Ensemble methods are typically applied near the end of a project after some effective predictors have been identified.

Topics:
- Voting Classifiers
- Bagging & Pasting
- Random Forests
- Boosting
- Stacking

# Voting Classifiers

*Voting classifiers* aggregate the predictions of different classifiers.

- *Hard voting*: each classifier makes a prediction and the class with the most votes is selected
- *Soft voting*: each classifier determines the probability of each class and the class with the highest average probability is selected

An ensemble's accuracy tends to improve when the classifiers are more independent from one another. Different algorithms are subject to different types of errors, so a diverse group is less likely to repeat the same errors and thus ensures the ensemble will be more robust.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

# Demonstration using the moons dataset
X, y = make_moons(n_samples=500, noise=0.30)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

hard_voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)
hard_voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [3]:
from sklearn.metrics import accuracy_score

def accuracies(clf_tuple):
    # Check the accuracy of each model
    for clf in clf_tuple:
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(f"{clf.__class__.__name__}: {accuracy_score(y_test, y_pred)}")

accuracies((log_clf, rnd_clf, svm_clf, hard_voting_clf))

LogisticRegression: 0.816
RandomForestClassifier: 0.896
SVC: 0.904
VotingClassifier: 0.896


In [4]:
# Soft voting increases training time
soft_voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft'
)
soft_voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [5]:
accuracies((log_clf, rnd_clf, svm_clf, soft_voting_clf))

LogisticRegression: 0.816
RandomForestClassifier: 0.896
SVC: 0.904
VotingClassifier: 0.904


# Bagging & Pasting

Another ensemble learning approach is to use the same training algorithmm for each predictor but train them on different random subsets of the training data. *Bagging* (or bootstrap aggregating) and *pasting* are two sampling methods for this approach.

- *Bagging*: sampling with replacement
- *Pasting*: sampling without replacement

Predictions are made via aggregation, typically using the statistical mode function (most frequent prediction) similar to a hard voting classifier. Each predictor will have a higher bias due to the smaller sample of training data, but the ensemble benefits from this much like it does from error diversity in voting classification, resulting in a lower overall bias and variance.

Bagging and pasting can train and predict in parallel, so they are attractive methods for scaling.

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# 500 decision trees with 100 samples each, using maximum number of CPU cores
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=True,
    n_jobs=-1
)
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [7]:
y_pred = bag_clf.predict(X_test)
print(f"BaggingClassifier: {accuracy_score(y_test, y_pred)}")

BaggingClassifier: 0.904


In [8]:
# Disable bootstrap for pasting
past_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=False,
    n_jobs=-1
)
past_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [9]:
y_pred = past_clf.predict(X_test)
print(f"PastingClassifier: {accuracy_score(y_test, y_pred)}")

PastingClassifier: 0.904


When using replacement (bagging), each predictor samples ~63% of the training data. The remaining ~37% is considered *out-of-bag* (oob). This data can be used to evaluate the classifier after training and predict how accurate it will be.

In [10]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9173333333333333

In [11]:
# Compare with accuracy using the test set
accuracy_score(y_test, bag_clf.predict(X_test))

0.904

In [12]:
# Access probabilities of each classifier in the bag
bag_clf.oob_decision_function_[:5]

array([[0.79842932, 0.20157068],
       [0.89159892, 0.10840108],
       [0.97738693, 0.02261307],
       [0.84776903, 0.15223097],
       [0.00520833, 0.99479167]])

Predictors can also be trained on a random subset of the input features.

- *Random Patches*: sampling both instances and features
- *Random Subspaces*: keeping all instances and sampling only features

In [13]:
# Random patches
bag_clf_patch = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True,
    bootstrap_features=True,
    max_features=0.5
)
bag_clf_patch.fit(X_train, y_train)
bag_clf_patch.oob_score_

0.872

In [14]:
accuracy_score(y_test, bag_clf_patch.predict(X_test))

0.832

In [15]:
# Random subspaces
bag_clf_sub = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=1.0,
    bootstrap=False,
    n_jobs=-1,
    bootstrap_features=True,
    max_features=0.5
)
bag_clf_sub.fit(X_train, y_train)
accuracy_score(y_test, bag_clf_sub.predict(X_test))

0.704

Random patches and subspaces are mostly useful when handling high-dimensional inputs such as images.

# Random Forests

A *Random Forest* is an ensemble of decision trees typically trained via bagging with max samples hyperparameter set to the size of the training set.

The RandomForestClassifier is an optimized alternative to creating a BaggingClassifier with DecisionTreeClassifier.

In [16]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=16, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [20]:
y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred_rf)

0.896

Random Forests can be made even more random by using random feature thresholds when growing trees instead of searching for the best possible thresholds. These trees are called *Extremely Randomized Trees*, or *Extra-Trees*.

In [19]:
from sklearn.ensemble import ExtraTreesClassifier

et_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
et_clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=16, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
                     oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [21]:
y_pred_et = et_clf.predict(X_test)
accuracy_score(y_test, y_pred_et)

0.864

Random Forests are useful for measuring feature importance by observing how nodes using each feature reduce impurity across all trees in the forest.

In [22]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [23]:
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(f'{name}: {score}')

sepal length (cm): 0.09936617140725601
sepal width (cm): 0.023920764344037643
petal length (cm): 0.43623714209033443
petal width (cm): 0.4404759221583719
