#### 8) Load the MNIST data, and split it into a training set, a validation set, and a test set ( 50,000  for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [1]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

In [2]:
# Loading Data
mnist = fetch_openml("mnist_784", version=1)
X,y = mnist["data"], mnist["target"]
y = y.astype(np.uint8)

# Splitting training, validation and test sets
X_tr_val, X_test, y_tr_val, y_test = train_test_split(X, y, test_size= 10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_tr_val, y_tr_val, test_size=10000, random_state=42)

In [3]:
# Training classifiers
# Random Forest
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf.fit(X_train, y_train)

# Extra-Trees
extra_tr_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
extra_tr_clf.fit(X_train, y_train)

# SVM
svm_clf = LinearSVC(random_state=42)
svm_clf.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
          verbose=0)

In [4]:
# Scores on validation sets
[estimator.score(X_val, y_val) for estimator in (rnd_clf, extra_tr_clf, svm_clf)]

[0.9692, 0.9715, 0.8626]

In [5]:
# Combining into an ensemble with hard voting
estimators = [("random_forest_clf", rnd_clf),
             ("extra_trees_clf", extra_tr_clf),
             ("svm_clf", svm_clf)]

voting_clf = VotingClassifier(estimators)
voting_clf.fit(X_train, y_train)

voting_clf.score(X_val, y_val)



0.9696

In [6]:
# Scores on validation sets
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

[0.9692, 0.9715, 0.8626]

In [7]:
# Removing SVM to see if performance improves
voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                                     n_jobs=N

In [8]:
del voting_clf.estimators_[2]

In [9]:
# Score without svm
voting_clf.score(X_val, y_val)

0.9713

In [10]:
# Trying with soft voting
voting_clf.voting = "soft"
voting_clf.score(X_val, y_val)

0.9719

In [11]:
# Trying with the test set
voting_clf.score(X_test, y_test)

0.9681

In [12]:
# Individual classifiers
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9645, 0.9691]

#### 9) Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let’s evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?

In [13]:
estimators = [rnd_clf, extra_tr_clf, svm_clf]
X_val_preds = np.empty((len(X_val), len(estimators)), dtype=np.float32)

#  Run the individual classifiers from the previous exercise to make predictions on the validation set,
for index, estimator in enumerate(estimators):
    X_val_preds[:, index] = estimator.predict(X_val)
    
X_val_preds

array([[5., 5., 5.],
       [8., 8., 8.],
       [2., 2., 2.],
       ...,
       [7., 7., 7.],
       [6., 6., 6.],
       [7., 7., 7.]], dtype=float32)

In [14]:
# Train a blender - Train a classifier on this new training set.
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_preds, y_val)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [15]:
rnd_forest_blender.oob_score_

0.9698

In [16]:
X_test_preds = np.empty((len(X_test), len(estimators)), dtype=np.float32) 

# For each image in the test set, make predictions with all your classifiers
for index, estimator in enumerate(estimators):
    X_test_preds[:, index] = estimator.predict(X_test)

# Feed the predictions to the blender to get the ensemble's predictions
y_pred = rnd_forest_blender.predict(X_test_preds)
accuracy_score(y_test, y_pred)

0.9661