##### MA755 Machine Learning - Ensemble Learning - 28 Mar 2017 

These notes are based on, and include images from, [_Hands-On Machine Learning with Scikit-Learn and TensorFlow_](http://shop.oreilly.com/product/0636920052289.do)
- by Aurélien Géron
- Published by O'Reilly Media, Inc., 2017

### Load libraries

In [1]:
import numpy             as np
import pandas            as pd

In [2]:
%matplotlib inline
import matplotlib        as mpl
import matplotlib.pyplot as plt
import seaborn           as sea

In [1]:
import sklearn.metrics         as sk_me
import sklearn.model_selection as sk_ms
import sklearn.datasets        as sk_ds

### Iris dataset

In [2]:
iris = sk_ds.load_iris()
(iris.data.shape, 
 iris.target.shape
)

((150, 4), (150,))

Store the feature/variable names for later:

In [5]:
iris["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [6]:
iris_feature_names = iris["feature_names"]

Create the train and test datasets:

In [7]:
(train_data,   test_data,
 train_target, test_target
 ) = sk_ms.train_test_split(iris.data, 
                            iris.target, 
                            test_size=0.5, 
                            random_state=42)
(train_data.shape, train_target.shape, 
 test_data.shape,  test_target.shape
)

((75, 4), (75,), (75, 4), (75,))


### Chapter 7. Ensemble Learning and Random Forests

Below we describe several types of ensemble classifiers. 

__Hard voting__. Train multiple classifiers on the entire training dataset. For each instance choose the most common class prediction returned by the multiple classifiers. 

__Soft voting__. Train multiple classifiers on the entire training dataset. For each instance choose the class prediction of the most confident (highest probability) classifier, for that instance.

__Bagging__. Train a single classifier on multiple training subsets, created by sampling instances __with__ replacement. Choose the most common class prediction.

__Pasting__. Train a single classifier on multiple training subsets, created by sampling instances __without__ replacement. Choose the most common class prediction. 

__Random Subspaces__. Train a single classifier on multiple subsets, created by sampling the features used, but including all instances. Choose the most common class prediction.

__Random Patches__. Train a single classifier on multiple subsets, created by sampling features and sampling instances. Choose the most common class prediction.

__AdaBoost__. Train a single classifier (i.e. decision tree) iteratively on subsets that are (by design) likely to contain instances that were previously incorrectly classified.

__Gradient Boosting__. Seems to be regression only. 

__Stacking__.

### Hard voting classifier 

Train multiple classifiers on the entire training dataset. 

For each instance choose the most common class prediction returned by the multiple classifiers. 

### Hard voting - example

Create four base classifiers:

In [8]:
from sklearn.tree         import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm          import SVC, LinearSVC

lr_clf = LogisticRegression()
dt_clf = DecisionTreeClassifier()
sv_clf = SVC(probability=True)
ls_clf = LinearSVC()

These classifiers are used at several points below as parts of the ensemble classifiers.

Create an ensemble, __hard voting__, classifier:

In [9]:
from sklearn.ensemble     import VotingClassifier

vo_clf = VotingClassifier(
        estimators=[('lr', lr_clf), 
                    ('dt', dt_clf), 
                    ('ls', ls_clf)],
        voting='hard'
    )
vo_clf

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('dt', Decisio...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))],
         n_jobs=1, voting='hard', weights=None)

Compare the prediction accuracy of this ensemble classifier and of its base classifiers:

In [10]:
from sklearn.metrics import accuracy_score

for clf in (lr_clf, dt_clf, sv_clf, vo_clf):
    clf.fit(train_data, train_target)
    test_predict = clf.predict(test_data)
    print(round(accuracy_score(test_target, 
                               test_predict),
                3),
         clf.__class__.__name__)

0.973 LogisticRegression
0.933 DecisionTreeClassifier
0.987 SVC
0.987 VotingClassifier


### Soft voting classifier 

Create an ensemble, __soft voting__, classifier:

In [11]:
vo_clf = VotingClassifier(
        estimators=[('lr', lr_clf), 
                    ('dt', dt_clf), 
                    ('sv', sv_clf)],
        voting='soft'
    )
vo_clf

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('dt', Decisio...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         n_jobs=1, voting='soft', weights=None)

Compare the prediction accuracy of this ensemble classifier and of its base classifiers:

In [12]:
from sklearn.metrics import accuracy_score

for clf in (lr_clf, dt_clf, sv_clf, vo_clf):
    clf.fit(train_data, 
            train_target)
    test_predict = clf.predict(test_data)
    print(round(accuracy_score(test_target, 
                               test_predict),
                3),
          clf.__class__.__name__)

0.973 LogisticRegression
0.947 DecisionTreeClassifier
0.987 SVC
0.96 VotingClassifier


### Bagging and pasting - `BaggingClassifier`

The `BaggingClassifier` implements both the bagging and pasting techniques. 

It automatically performs __soft voting__ if the base classifiers can estimate class probabilities. 

Otherwise it performs __hard voting__.

### Bagging

- Trains a single classifier on multiple training subsets
- Creates these subsets by sampling instances __with__ replacement
- Chooses the most common class prediction

See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

Parameters:

- `bootstrap=True` indicates that __bagging__ is to be performed
- `n_estimators=5` indicates that 5 decision trees (base classifiers) are created
- `max_samples=10` indicates that the size of the subset to create (and use for training the base classifiers)
- `n_jobs=-1` indicates that all available cpus are used to train the base classifiers

Create the classifier and train it on the training datasets:

In [13]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree     import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), bootstrap=True, 
                            n_estimators=5, max_samples=10, n_jobs=-1)
bag_clf.fit(train_data, 
            train_target)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=10, n_estimators=5, n_jobs=-1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

Create predictions from `test_data` and check their accuracy:

In [14]:
test_predict = bag_clf.predict(test_data)
print("Bagging (with replacement sampling)")
print("Accuracy score: ", accuracy_score(test_target, 
                                         test_predict))

Bagging (with replacement sampling)
Accuracy score:  0.96


### Pasting

- Train a single classifier on multiple training subsets
- Create these subsets by sampling instances __without__ replacement
- Choose the most common class prediction

See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

Parameters:

- `bootstrap=False` indicates that __pasting__ is to be performed
- `n_estimators=5` indicates that 5 decision trees (base classifiers) are created
- `max_samples=10` indicates that the size of the subset to create (and use for training the base classifiers)
- `n_jobs=-1` indicates that all available cpus are used to train the base classifiers

Create the classifier and train it on the training datasets:

In [15]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree     import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), bootstrap=False, 
                            n_estimators=500, max_samples=10, n_jobs=-1
    )
bag_clf.fit(train_data, 
            train_target)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
         bootstrap=False, bootstrap_features=False, max_features=1.0,
         max_samples=10, n_estimators=500, n_jobs=-1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

Create predictions from `test_data` and check their accuracy:

In [16]:
test_predict = bag_clf.predict(test_data)
print("Pasting (without replacement)")
print("Accuracy score: ", accuracy_score(test_target, 
                                         test_predict))

Pasting (without replacement)
Accuracy score:  0.986666666667


### Out-of-bag error

When using bagging, not all instances are used to train the data because sampling is performed __with replacement__. These unused instances can be used to evaluate the classifier since they are not part of the training dataset. 

__Out-of-bag error__ is the error rate on these unused instances. 

See https://en.wikipedia.org/wiki/Out-of-bag_error

Call `BaggingClassifier` with parameter `oob_score=True` and then fit to the training dataset:

In [17]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree     import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), bootstrap=True, 
                            n_estimators=500, max_samples=10, n_jobs=-1, 
                            oob_score=True
)
bag_clf.fit(train_data, 
            train_target)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=10, n_estimators=500, n_jobs=-1, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

The out-of-bag score is available in `oob_score_`. 

In [18]:
print("Out-of-bag score: ", bag_clf.oob_score_)

Out-of-bag score:  0.906666666667


Compare this to the model accuracy on the test dataset:

In [19]:
from sklearn.metrics import accuracy_score

test_predict = bag_clf.predict(test_data)

print("Accuracy score: ", 
      accuracy_score(test_target, 
                     test_predict))

Accuracy score:  0.986666666667


### `RandomForestClassifier`

The Random Forest model is a bagging classifier that uses decision trees. 

Below two classifiers are created and evaluated. 

The first is `RandomClassifier` and the second uses `BaggingClassifier` and `DecisionTreeClassifier` to implement the random forest model.

See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Parameters:

- `bootstrap=False` indicates that __pasting__ is to be performed
- `n_estimators=5` indicates that 5 decision trees (base classifiers) are created
- `max_samples=10` indicates that the size of the subset to create (and use for training the base classifiers)
- `n_jobs=-1` indicates that all available cpus are used to train the base classifiers

In [20]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_features='auto', n_jobs=-1
                                )
rnd_clf.fit(train_data, 
            train_target)
print("Accuracy score for RandomForestClassifier: ", 
      accuracy_score(test_target,
                     rnd_clf.predict(test_data)))

Accuracy score for RandomForestClassifier:  0.973333333333


In [21]:
bag_clf = BaggingClassifier(
        DecisionTreeClassifier(splitter="random", max_features='auto'),
        n_estimators=500, bootstrap=True, n_jobs=-1
    )
bag_clf.fit(train_data, train_target)
print("Accuracy score for BaggingClassifier: ", 
      accuracy_score(test_target,
                     bag_clf.predict(test_data)))

Accuracy score for BaggingClassifier:  0.986666666667


The two sets of predictions are nearly identical:

In [23]:
sk_me.confusion_matrix(rnd_clf.predict(test_data),
                       bag_clf.predict(test_data))

array([[29,  0,  0],
       [ 0, 24,  1],
       [ 0,  0, 21]])

Decision trees place the most effective variable (at predicting the target variable) at the root of the tree. 

As random forests create many decision trees, we can measure the importance of a variable by the number of times it is the split variable at the root of the tree.

In [24]:
rnd_clf.feature_importances_

array([ 0.0948241 ,  0.04750261,  0.44927531,  0.40839798])

In [25]:
for name, score in zip(iris["feature_names"], 
                       rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.0948241032171
sepal width (cm) 0.0475026060019
petal length (cm) 0.449275310455
petal width (cm) 0.408397980326


### AdaBoost

1. Set the row weights identically to $1/m$, where $m$ is the number of rows.
1. Create a subset of the rows using the row weights as probabilities.
1. Train the classifier on that subset.
1. Make predictions with that classifier.
1. Weight the classifier; higher weights correspond to less error
1. Add this weighted classifier to the overall prediction function
1. Updated row weights: incorrectly predicted rows are weighted higher
1. Go to step 2.

In [26]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), 
                             n_estimators=200, algorithm="SAMME.R", learning_rate=0.5
    )
ada_clf.fit(train_data, 
            train_target)
print("Accuracy score for AdaBoostClassifier: ", 
      accuracy_score(test_target,
                     ada_clf.predict(test_data)))

Accuracy score for AdaBoostClassifier:  0.946666666667


### The end