*Ensemble Learning* involves training a group of predictors (an *ensemble*) and aggregating their predictions to achieve better accuracy than any of the individual models.

A *Random Forest* is an ensemble method that trains a group of decision trees on different random subsets of the training data. Predictions are made by each tree and the class with the most 'votes' is predicted.

Ensemble methods are typically applied near the end of a project after some effective predictors have been identified.

Topics:
- Voting Classifiers
- Bagging & Pasting
- Random Forests
- Boosting
- Stacking

# Voting Classifiers

*Voting classifiers* aggregate the predictions of different classifiers.

- *Hard voting*: each classifier makes a prediction and the class with the most votes is selected
- *Soft voting*: each classifier determines the probability of each class and the class with the highest average probability is selected

An ensemble's accuracy tends to improve when the classifiers are more independent from one another. Different algorithms are subject to different types of errors, so a diverse group is less likely to repeat the same errors and thus ensures the ensemble will be more robust.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

# Demonstration using the moons dataset
X, y = make_moons(n_samples=500, noise=0.30)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

hard_voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)
hard_voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [20]:
from sklearn.metrics import accuracy_score

def accuracies(clf_tuple):
    # Check the accuracy of each model
    for clf in clf_tuple:
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(f"{clf.__class__.__name__}: {accuracy_score(y_test, y_pred)}")

accuracies((log_clf, rnd_clf, svm_clf, hard_voting_clf))

LogisticRegression: 0.864
RandomForestClassifier: 0.896
SVC: 0.896
VotingClassifier: 0.896


In [18]:
# Soft voting increases training time
soft_voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft'
)
soft_voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [21]:
accuracies((log_clf, rnd_clf, svm_clf, soft_voting_clf))

LogisticRegression: 0.864
RandomForestClassifier: 0.888
SVC: 0.896
VotingClassifier: 0.904


# Bagging & Pasting

Another ensemble learning approach is to use the same training algorithmm for each predictor but train them on different random subsets of the training data. *Bagging* (or bootstrap aggregating) and *pasting* are two sampling methods for this approach.

- *Bagging*: sampling with replacement
- *Pasting*: sampling without replacement

Predictions are made via aggregation, typically using the statistical mode function (most frequent prediction) similar to a hard voting classifier. Each predictor will have a higher bias due to the smaller sample of training data, but the ensemble benefits from this much like it does from error diversity in voting classification, resulting in a lower overall bias and variance.

Bagging and pasting can train and predict in parallel, so they are attractive methods for scaling.

In [22]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# 500 decision trees with 100 samples each, using maximum number of CPU cores
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=True,
    n_jobs=-1
)
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [27]:
y_pred = bag_clf.predict(X_test)
print(f"BaggingClassifier: {accuracy_score(y_test, y_pred)}")

BaggingClassifier: 0.856


In [28]:
# Disable bootstrap for pasting
past_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=False,
    n_jobs=-1
)
past_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [29]:
y_pred = past_clf.predict(X_test)
print(f"PastingClassifier: {accuracy_score(y_test, y_pred)}")

PastingClassifier: 0.912


When using replacement (bagging), each predictor samples ~63% of the training data. The remaining ~37% is considered *out-of-bag* (oob). This data can be used to evaluate the classifier after training and predict how accurate it will be.

In [30]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.928

In [32]:
# Compare with accuracy using the test set
accuracy_score(y_test, bag_clf.predict(X_test))

0.912

In [36]:
# Access probabilities of each classifier in the bag
bag_clf.oob_decision_function_[:5]

array([[0.375     , 0.625     ],
       [0.38802083, 0.61197917],
       [0.9972752 , 0.0027248 ],
       [0.01023018, 0.98976982],
       [0.03266332, 0.96733668]])

Predictors can also be trained on a random subset of the input features.

- *Random Patches*: sampling both instances and features
- *Random Subspaces*: keeping all instances and sampling only features

In [42]:
# Random patches
bag_clf_patch = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True,
    bootstrap_features=True,
    max_features=0.5
)
bag_clf_patch.fit(X_train, y_train)
bag_clf_patch.oob_score_

0.848

In [43]:
accuracy_score(y_test, bag_clf_patch.predict(X_test))

0.84

In [47]:
# Random subspaces
bag_clf_sub = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    max_samples=1.0,
    bootstrap=False,
    n_jobs=-1,
    bootstrap_features=True,
    max_features=0.5
)
bag_clf_sub.fit(X_train, y_train)
accuracy_score(y_test, bag_clf_sub.predict(X_test))

0.648

Random patches and subspaces are mostly useful when handling high-dimensional inputs such as images.