### 8.
Load the MNIST data, and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [3]:
from sklearn.datasets import fetch_openml
import  numpy as np

mnist = fetch_openml('mnist_784', version=1)

X, y = mnist["data"], mnist["target"].astype(np.uint8)

In [4]:
# separate into training, validation and test sets
# simpler way:
# X_train, X_valid, X_test, y_train, y_valid, y_test = X[:50000], X[50000:60000], X[60000:], y[:50000], y[50000:60000], y[60000:]

from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

In [8]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

rnd_clf = RandomForestClassifier()

ext_clf = ExtraTreesClassifier()

svm_clf = LinearSVC()

mlp_clf = MLPClassifier()

In [9]:
estimators = [rnd_clf, ext_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier()
Training the ExtraTreesClassifier()
Training the LinearSVC()




Training the MLPClassifier()


In [10]:
for estimator in estimators:
    print(estimator.__class__.__name__, estimator.score(X_val, y_val))

RandomForestClassifier 0.9704
ExtraTreesClassifier 0.9722
LinearSVC 0.8477
MLPClassifier 0.9629


The linear SVM is far outperformed by the other classifiers. However, let's keep it for now since it may improve the voting classifier's performance

In [38]:
# combining them on voting classifier
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", rnd_clf),
    ("extra_trees_clf", ext_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf)
]

# by default, this is a hard voting. SVM does not has predict_proba, 
# so we can't get a soft vote with it.
voting_clf = VotingClassifier(named_estimators)

In [39]:
voting_clf.fit(X_train, y_train)



VotingClassifier(estimators=[('random_forest_clf', RandomForestClassifier()),
                             ('extra_trees_clf', ExtraTreesClassifier()),
                             ('svm_clf', LinearSVC()),
                             ('mlp_clf', MLPClassifier())])

In [40]:
voting_clf.score(X_val, y_val)

0.9702

In [60]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9704, 0.9722, 0.8477, 0.9629]

Let's remove the SVM to see if performance improves. 

It is possible to remove an estimator by setting it to None using set_params() like this:

In [61]:
voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf', RandomForestClassifier()),
                             ('extra_trees_clf', ExtraTreesClassifier()),
                             ('svm_clf', None), ('mlp_clf', MLPClassifier())])

And this is how it looks now:

In [62]:
voting_clf.estimators

[('random_forest_clf', RandomForestClassifier()),
 ('extra_trees_clf', ExtraTreesClassifier()),
 ('svm_clf', None),
 ('mlp_clf', MLPClassifier())]

But as we can see, the trained estimators list was not updated:

In [63]:
voting_clf.estimators_

[RandomForestClassifier(),
 ExtraTreesClassifier(),
 LinearSVC(),
 MLPClassifier()]

We can fit the estimator again, but this will take time and cpu power... depending on the dataset, we could be talking of hours!

An alternative is to just delete it from the list of trained estimators:

In [64]:
del voting_clf.estimators_[2] # index 2 is LinearSVC (check code output on the box above)

Evaluating the VotingClassifier again:

In [65]:
voting_clf.score(X_val, y_val)

0.9736

It improved a litte bit, showing us that the SVM was just hurting us.

Let's try with a soft vote (no need to retrain the classifier):

In [66]:
voting_clf.voting = 'soft'

In [67]:
voting_clf.score(X_val, y_val)

0.97

It decreased our score a little bit, so it seems that hard vote is better here.

Let's now revert back to hard voting and try on our test set:

In [68]:
voting_clf.voting = 'hard'
voting_clf.score(X_test, y_test)

0.9704

In [71]:
[estimator.score(X_test, y_test) for estimator in estimators]

[0.9647, 0.9678, 0.8503, 0.9614]

The voting classifier reduced the error by 0.0026, when compared to the best model.

This is not a huge improvement, but it improved anyway! Great!

The models already had almost 97% accuracy, which is pretty impressive already. We managed to make the predictions even better by using an ensemble technique.

### 9.
Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?

In [72]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [73]:
X_val_predictions

array([[5., 5., 8., 5.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [74]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In [75]:
rnd_forest_blender.oob_score_

0.9696

We could fine-tune this blender or try other types of blenders (e.g., an MLPClassifier), then select the best one using cross-validation, as always.

let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions.

In [76]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [77]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [78]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9676

This stacking ensemble does not perform as well as the voting classifier we trained earlier, it's not quite as good as the best individual classifier.