# Setup

In [4]:
# handle math and data
import numpy as np
import pandas as pd

# to plot nice figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# handle files
import os
import joblib

SEED = 69420

# Get MNIST Dataset

In [5]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)

X, y = mnist.data, mnist.target
X.shape, y.shape

((70000, 784), (70000,))

In [49]:
type(y[0])

str

In [50]:
y = y.astype(np.int8)

In [51]:
type(y[0])

numpy.int8

In [52]:
y

array([5, 0, 4, ..., 4, 5, 6], dtype=int8)

In [54]:
type(X[0, 0])

numpy.float64

Ok types are set

## Split Train, Val, and Test Sets

In [55]:
train_size = 50000
val_size = 10000
test_size = 10000

X_train, y_train = X[:train_size], y[:train_size]
X_val, y_val = X[train_size:(train_size + val_size)], y[train_size:(train_size + val_size)]
X_test, y_test = X[-test_size:], y[-test_size:]

X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape

((50000, 784), (50000,), (10000, 784), (10000,), (10000, 784), (10000,))

We got our training set for training individual classifiers, validation set for training ensemble classifiers, and test set for final evaluation.

# Train Models

In [56]:
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

lin_svm_clf = LinearSVC(max_iter=100, tol=20, random_state=SEED)
knn_clf = KNeighborsClassifier()
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=SEED)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=SEED)

Now let's fit all our models.

In [57]:
estimators = [lin_svm_clf, knn_clf, random_forest_clf, extra_trees_clf]

for estimator in estimators:
    print("Training ", estimator)
    estimator.fit(X_train, y_train)

Training  LinearSVC(max_iter=100, random_state=69420, tol=20)
Training  KNeighborsClassifier()
Training  RandomForestClassifier(random_state=69420)
Training  ExtraTreesClassifier(random_state=69420)


## Single Model Evaluations

In [58]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.866, 0.9718, 0.9712, 0.9739]

So linearSVC is quite poor, while knn, random forest, and extra trees all did rather excellent. Let's try to train our voting classifier now.

## Voting Classifier

Now we need to do the following:
- train a hard voting classifier using our 4 individual models
    - evaluate it on the validation set
- fine tune our voting classifier
    - try without linearSVC
    - try soft voting (without linearSVC)
- evaluate all models on test set

In [59]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("lin_svm_clf", lin_svm_clf),
    ("knn_clf", knn_clf),
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
]

voting_clf = VotingClassifier(named_estimators, voting="hard")

In [60]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lin_svm_clf',
                              LinearSVC(max_iter=100, random_state=69420,
                                        tol=20)),
                             ('knn_clf', KNeighborsClassifier()),
                             ('random_forest_clf',
                              RandomForestClassifier(random_state=69420)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=69420))])

In [61]:
voting_clf.score(X_val, y_val)

0.9733

### Remove LinearSVC from Estimators

This isn't a notable increase, let's test removing linearSVC.
- We need to first remove it from `voting_clf`'s list of estimators used when `fit()` is called
- Then we need to remove our already trained linearSVC from `voting_clf`'s `estimators_` used for prediction

In [62]:
voting_clf.set_params(lin_svm_clf=None)

VotingClassifier(estimators=[('lin_svm_clf', None),
                             ('knn_clf', KNeighborsClassifier()),
                             ('random_forest_clf',
                              RandomForestClassifier(random_state=69420)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(random_state=69420))])

In [63]:
voting_clf.estimators

[('lin_svm_clf', None),
 ('knn_clf', KNeighborsClassifier()),
 ('random_forest_clf', RandomForestClassifier(random_state=69420)),
 ('extra_trees_clf', ExtraTreesClassifier(random_state=69420))]

In [64]:
del voting_clf.estimators_[0]

You can also re-train `voting_clf` using `fit()`

In [65]:
voting_clf.estimators_

[KNeighborsClassifier(),
 RandomForestClassifier(random_state=69420),
 ExtraTreesClassifier(random_state=69420)]

In [66]:
voting_clf.score(X_val, y_val)

0.9751

There was indeed an increase, so linearSVC was dragging down the ensemble.

### Try Soft Voting

In [67]:
voting_clf.voting = "soft"

In [68]:
voting_clf.score(X_val, y_val)

0.9768

This seems even better! So our final voting classifier is composed of 3 estimators and is a soft voting classifier.

## Evaluation on Test Set

In [69]:
[estimator.score(X_test, y_test) for estimator in estimators]

[0.8665, 0.9664, 0.9694, 0.9704]

In [70]:
voting_clf.score(X_test, y_test)

0.9741

Looks like the ensemble classifier is indeed better!

# Stacking Ensemble

Now let's create a stacking ensemble (training a model on the predictions of models):
- Create a 10,000 by 4 predictions dataset for the 4 models using our validation set
    - This matrix of predictions will be our training set for our blender model
- Create a random forest model that uses the predictions matrix as X and our y_val as the labels
- Create a function that will automate prediction
    - Takes an instance and feeds it to all 4 layer 0 models
    - Creates a layer 1 predictions array
    - Feeds layer 1 predictions array to our trained blender
    - Returns blender's prediction

## Layer 1 Predictions

In [71]:
layer_1_predictions = np.array([estimator.predict(X_val) for estimator in estimators]).T
layer_1_predictions

array([[3, 3, 3, 3],
       [8, 8, 8, 8],
       [6, 6, 6, 6],
       ...,
       [5, 5, 5, 5],
       [6, 6, 6, 6],
       [8, 8, 8, 8]], dtype=int8)

In [72]:
layer_1_predictions.shape

(10000, 4)

Perfect! This is now our layer_1_predictions which we will use to train our blender.

## Train Blender

In [102]:
blender_1 = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=SEED)

blender_1.fit(layer_1_predictions, y_val)

RandomForestClassifier(oob_score=True, random_state=69420)

In [103]:
blender_1.oob_score_

0.9739

We can evaluate blender performance without a separate validation set.

In [100]:
y_stack_pred = blender_1.predict(layer_1_predictions)
y_stack_pred

array([3, 8, 6, ..., 5, 6, 8], dtype=int8)

In [101]:
from sklearn.metrics import accuracy_score

accuracy_score(y_stack_pred, y_val)

0.9858

This is just a test for the next section of automating prediction, obviously this is a biased score since `blender_1` was trained on `layer_1_predictions`

## Automate Prediction

For each instance in input X
- predict its y using all 4 estimators
- feed the predictions into the blender to predict a layer_2 y
- return layer_2 y as final prediction

In [79]:
# a one layer stacking ensemble model
def stacking_predict(X, estimators, blender):
    y_1_pred = np.array([estimator.predict(X) for estimator in estimators]).T
    
    return blender.predict(y_1_pred)

In [85]:
y_stack_pred = stacking_predict(X_val, estimators, blender_1)
y_stack_pred

array([3, 8, 6, ..., 5, 6, 8], dtype=int8)

In [86]:
y_stack_pred.shape

(10000,)

In [87]:
accuracy_score(y_stack_pred, y_val)

0.9858

Ok the prediction function works. Now let's evaluate all models on the test set!

## Evaluate on Test Set

In [98]:
for estimator in estimators:
    print(estimator)
    print('{:>40} {:>6}  {:>6}'.format(' ', "Accuracy Score: ", estimator.score(X_test, y_test)))
    
y_stack_pred = stacking_predict(X_test, estimators, blender_1)
stack_acc_score = accuracy_score(y_stack_pred, y_test)

print("Stacking Ensemble")
print('{:>40} {:>6}  {:>6}'.format(' ', "Accuracy Score: ", stack_acc_score))

LinearSVC(max_iter=100, random_state=69420, tol=20)
                                         Accuracy Score:   0.8665
KNeighborsClassifier()
                                         Accuracy Score:   0.9664
RandomForestClassifier(random_state=69420)
                                         Accuracy Score:   0.9694
ExtraTreesClassifier(random_state=69420)
                                         Accuracy Score:   0.9704
Stacking Ensemble
                                         Accuracy Score:    0.971


The final result wasn't much better than extra trees, but it wasn't worse!

# Comparison to Handsonml

- Gerion used np.empty and then assigned each column in the (10000, 4) 2D array to the predictions
    - Although my method of converting list to array then taking the transpose also works well, even better perhaps
- Gerion used `oob_score=True` to evaluate the blender's performance without a separate validation set 
    - This is something I definitely should use
- Gerion used `del` to delete elements in arrays and lists which is something I should use (also to delete unnecessary variables from memory?)
<br>
</br>

<u>Changes I Should Make:</u>
- read more about the parameters and attributes of the libraries or packages you're using
    - This is so you don't miss out on useful functionality like `estimators_` in `VotingClassifier` and `oob_score_` in `RandomForestClassifier`