# Exercises 8 and 9 (Chapter 7)

7. Load the MNIST data (introduced in Chapter 3), and:
- split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing)
- Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier.
- Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting.
- Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os
import pandas as pd

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

In [3]:
X, y = mnist["data"], mnist["target"]

In [4]:
from sklearn.model_selection import train_test_split
X_train_full, X_test, y_train_full, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full,
                                                     test_size=10000,
                                                     random_state=1989)

In [5]:
# Train RandomForest, ExtraTrees and SVM
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rnd_clf = RandomForestClassifier(n_estimators=500, # number of trees
                                 n_jobs=-1) # use all the cores

max_leaf_nodes_params = list(range(2, 500, 20))
max_leaf_nodes_params.append(None)

grid_rf = {
    'max_leaf_nodes': max_leaf_nodes_params
}

grid_rf = GridSearchCV(rnd_clf, grid_rf, cv = 3, scoring = 'accuracy')

grid_rf.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=RandomForestClassifier(n_estimators=500, n_jobs=-1),
             param_grid={'max_leaf_nodes': [2, 22, 42, 62, 82, 102, 122, 142,
                                            162, 182, 202, 222, 242, 262, 282,
                                            302, 322, 342, 362, 382, 402, 422,
                                            442, 462, 482, None]},
             scoring='accuracy')

In [None]:
grid_rf.best_score_

In [None]:
# Now get the performance (score) on the validation set
from sklearn.metrics import accuracy_score
y_pred_rf = grid_rf.predict(X_val)
accuracy_score(y_val, y_pred_rf)

In [None]:
# Now it's the turn of an ExtraTrees classifier
from sklearn.ensemble import ExtraTreesClassifier

xt_clf = ExtraTreesClassifier(n_jobs=-1,
                              random_state=1989)
# max_leaf_nodes and n_estimators
max_leaf_nodes_params = list(range(2, 500, 20))
max_leaf_nodes_params.append(None)

grid_xt = {
    'max_leaf_nodes': max_leaf_nodes_params,
    'n_estimators': list(range(100, 1000, 10))
}

gridsearch_xt = GridSearchCV(xt_clf,
                            grid_xt,
                            cv = 3,
                            scoring = 'accuracy')

gridsearch_xt.fit(X_train, y_train)

In [None]:
grid_search_xt.best_score_

In [None]:
y_pred_xt = grid_search_xt.predict(X_val)
accuracy_score(y_val, y_pred_xt)

In [None]:
# AND FINALLY! THE SVM PREDICTOR
# Look at the SVM chapter exercises
# For this I have to scale and center the data
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="rbf", probability=True))
    ])

svm_clf.fit(X_train, y_train)

In [None]:
y_pred_svm = svm_clf.predict(X_val)
accuracy_score(y_val, y_pred_svm)

Now I'm gonna combine them in a Voting Classifier.
Note that if this is done with the default `VotingClassifier` class, it is going to retrain all the models. But here I want to compare the performance of them on their own with the one they get combined, so I need a voting classifier that preserves the training.

In [None]:
!pip install mlxtend

In [None]:
# Solution using mlxtend
from mlxtend.classifier import EnsembleVoteClassifier
import copy
eclf_hard = EnsembleVoteClassifier(clfs=[grid_rf, grid_search_xt, svm_clf],
                                   voting='hard',
                                   fit_base_estimators=False)

eclf_hard.fit(X_train, y_train)

y_pred_vot_hard = eclf_hard.predict(X_val)
accuracy_score(y_val, y_pred_vot_hard)

In [None]:
eclf_soft = EnsembleVoteClassifier(clfs=[grid_rf, grid_search_xt, svm_clf],
                                   voting='soft',
                                   fit_base_estimators=False)

ecl_soft.fit(X_train, y_train)

y_pred_vot_soft = ecl_soft.predict(X_val)
accuracy_score(y_val, y_pred_vot_soft)

Now it's time to check the accuracy of the base models against the best voting classifier *on test data* 

In [None]:
labels = ['Random Forest', 'ExtraTrees', 'SVM', 'Voting Classifier']

for clf, label in zip([grid_rf, grid_search_xt, svm_clf, eclf_soft], labels):
    y_pred = clf.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    print("Accuracy: %0.2f [%s]" 
          % (score, label))


9. 

- Run the individual classifiers from the previous exercise to make predictions on the validation set
- create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class
- Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble!
- Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?