___

# Machine Learning in Geosciences ] 
Department of Applied Geoinformatics and Carthography, Charles University

Lukas Brodsky lukas.brodsky@natur.cuni.cz

## Ensemble Learning

This notebook covers these topics of the ensemble learning: 

* Comparison of hard and soft voting classifiers

* Bagging ensembles

* Feature importance example 

* Stacking ensemble 


# Setup

In [None]:
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."

def image_path(fig_id):
    return os.path.join(PROJECT_ROOT_DIR, "images", fig_id)

def save_fig(fig_id, tight_layout=True):
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(image_path(fig_id) + ".png", format='png', dpi=300)

# Voting classifiers

In [None]:
# Use generated data set

# make_moons generate 2d binary classification datasets that are challenging to certain algorithms 
# including optional Gaussian noise.

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
print(X.shape)
print(y.shape)

In [None]:
np.mean(X[:,0])

In [None]:
np.mean(X[:,1])

In [None]:
np.unique(y)

## Compare hard and soft voting classifiers

### Hard voting

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.ensemble import VotingClassifier

# hyperparameters (solver, n_estimators, gamma) 
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)


voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

In [None]:
voting_clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score


for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,': ', accuracy_score(y_test, y_pred))

### Soft voting

In [None]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", probability=True, random_state=42)

# voting=soft
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,': ',  accuracy_score(y_test, y_pred))

# Bagging ensembles

In [None]:
# Bagging Decision Trees classifier (Bootstrap Aggregation) 
# estimators 500
# max. samples 100 

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

In [None]:
# Single Tree classifier

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))

### Random Forests from DecisionTree classifier

In [None]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, random_state=42)

In [None]:
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

In [None]:
# compare RF/DT results
np.sum(y_pred == y_pred_rf) / len(y_pred) 

## Out-of-Bag evaluation

In [None]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True, random_state=40)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

In [None]:
# bag_clf.oob_decision_function_

## Feature importance
with MNIST

In [None]:
# import MNIST data set 

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.int64)


In [None]:
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])

In [None]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.gray, interpolation="nearest")
    plt.axis("off")

In [None]:
plot_digit(rnd_clf.feature_importances_)

cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(), rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Important'])

# save_fig("mnist_feature_importance_plot")
plt.show()

## Voting Classifier Exercize

Split MNIST data into a training set, a validation set, and a test set.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

Train various classifiers: Random Forest classifier, Extra-Trees classifier, SVM and Neural Network.

In [None]:
... RandomForestClassifier
... ExtraTreesClassifier
... LinearSVC
... MLPClassifier

In [None]:
# use n_estimators=10 
# random_state=42

random_forest_clf = RandomForestClassifier(...)
extra_trees_clf = ExtraTreesClassifier(...)
svm_clf = LinearSVC(...)                       # may be a non-linear as well 
mlp_clf = MLPClassifier(...)                   # default parameters 

In [None]:
# fit the models: takes while! 

estimators = [random_forest_clf, extra_trees_clf, svm_clf , mlp_clf]

... # train the estimatiors 

In [None]:
# print the estimators score 
...

Combine them into an ensemble using a soft and hard voting classifier.

In [None]:
... VotingClassifier

In [None]:
named_estimators = [
    ...
    ...
    ...
    ...
]

In [None]:
# n_jobs=-1

voting_clf = VotingClassifier(named_estimators)

In [None]:
# fit the voting classifier on X_train, y_train 
# takes while 

...

In [None]:
# print the voting score using the X_val, y_val
... .score(X_val, y_val)

In [None]:
# what are the scores of the individual estimator? 
# voting_clf.estimators_ .. attribute to access any fitted sub-estimators by name.

...

Remove SVM. Does it help to improve performance? 

In [None]:
# set the clf to None 
voting_clf.set_params(svm_clf=None)

Update the list of estimators. 

In [None]:
# check 
voting_clf.estimators

In [None]:
voting_clf.estimators_

In [None]:
# delete the estimator 
del voting_clf.estimators_[2]

Now let's evaluate the `VotingClassifier` again:

In [None]:
# Evaluate VotingClassifier again
voting_clf.score(X_val, y_val)

In [None]:
# A bit better? 

Try using a soft voting classifier. Do not  retrain the classifier, just set `voting` to `"soft"`:

In [None]:
voting_clf.voting = ...

In [None]:
# evaluate the score of the soft classifier! 
...

In [None]:
# Improvement? 

In [None]:
# Try it on the test set. How much better does it perform compared to the individual classifiers?

In [None]:
voting_clf.score(X_test, y_test) 

In [None]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

In [None]:
# Did the voting classifier reduce the error rate?
# Compare it to the best model.  

## Stacking Ensemble

Exercise: _Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set._

In [None]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [None]:
X_val_predictions

In [None]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

In [None]:
rnd_forest_blender.oob_score_

One can fine-tune this blender or try other types of blenders, then select the best one using cross-validation.

Exercise: the blender together with the classifiers form a stacking ensemble! 

Let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. 

How does it compare to the voting classifier you trained earlier?

In [None]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [None]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
# How good is the stacking ensemble compare to the soft voting classifier? 