# Ensemble learning exercise

Train a Random Forest, an Extra Trees and an SVM classifiers on the MNIST data set and use the validation set to see their performance. At that point, create an ensemble of the three classifiers using hard and soft voting and check its performance on the validation set compared to that ottained previously. Has there been an overall improvement compared to each single previous case?

 - [Spliting the data into training, validation and testing sets](#Spliting-the-data-into-training,-validation-and-testing-sets)
 - [Model training](#Model-training)
   - [Decision tree](#Decision-tree)
   - [Random forest](#Random-forest)
   - [Support vector machine](#Support-vector-machine)
     - [SVC](#SVC)
     - [Linear SVC](#Linear-SVC)
   - [Ensemble model](#Ensemble-model)
     - [Hard voting](#Hard-voting)
 - [Model testing](#Model-testing)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# load the MNIST data set
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url'])

In [3]:
print(mnist.DESCR)

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  
**Please cite**:  

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image b

## Spliting the data into training, validation and testing sets

In [4]:
print(mnist['data'].shape)
print(type(mnist['data']))
print(mnist['target'].shape)
print(type(mnist['target']))

(70000, 784)
<class 'numpy.ndarray'>
(70000,)
<class 'numpy.ndarray'>


In [5]:
training_ratio = 0.6
validation_ratio = 0.2
testing_ratio = 0.2
training_size = int(training_ratio*len(mnist['data']))
test_size = int(testing_ratio*len(mnist['data']))
validation_size = len(mnist['data']) - training_size - test_size
shuffled_index = np.random.permutation(len(mnist['data']))

In [6]:
X_training_set = mnist['data'][shuffled_index[:training_size]]
X_validation_set = mnist['data'][shuffled_index[training_size:-test_size]]
X_testing_set = mnist['data'][shuffled_index[-test_size:]]

In [7]:
print(X_training_set.shape)
print(X_validation_set.shape)
print(X_testing_set.shape)

(42000, 784)
(14000, 784)
(14000, 784)


In [8]:
y_training_set = mnist['target'][shuffled_index[:training_size]]
y_validation_set = mnist['target'][shuffled_index[training_size:-test_size]]
y_testing_set = mnist['target'][shuffled_index[-test_size:]]

In [9]:
print(y_training_set.shape)
print(y_validation_set.shape)
print(y_testing_set.shape)

(42000,)
(14000,)
(14000,)


## Model training

Let's train a decision tree, random forest and support vector machine classifiers on the training set, and check its performance using the validation set.

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
from pathlib import Path
from time import time

In [11]:
tree_clf = DecisionTreeClassifier()
rf_clf = RandomForestClassifier(n_estimators=100)

In [12]:
svc_clf = Pipeline([('scaler', StandardScaler()), ('svc', SVC(gamma='scale', probability=True, kernel='linear'))])
linear_svc_clf = Pipeline([('scaler', StandardScaler()), ('linear_svc', LinearSVC(dual=False, tol=1e-3))])

### Decision tree

In [13]:
t0 = time()
joblib_file = Path('./mnist_models/tree_clf.sav')
if joblib_file.is_file():
    tree_clf = joblib.load(joblib_file)
else:
    tree_clf.fit(X_training_set, y_training_set)
    joblib.dump(tree_clf, 'tree_clf.sav')
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 0.01 sec


In [14]:
t0 = time()
y_prediction_tree_clf_set = tree_clf.predict(X_validation_set)
accuracy_score_tree_clf = accuracy_score(y_prediction_tree_clf_set, y_validation_set)
print(f'Time elapsed: {time()-t0:.2f} sec')
print(f'Accuracy score: {accuracy_score_tree_clf:.6f}')

Time elapsed: 0.07 sec
Accuracy score: 0.944857


### Random forest

In [15]:
t0 = time()
joblib_file = Path('./mnist_models/rf_clf.sav')
if joblib_file.is_file():
    rf_clf = joblib.load(joblib_file)
else:
    rf_clf.fit(X_training_set, y_training_set)
    joblib.dump(rf_clf, 'rf_clf.sav')
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 0.23 sec


In [16]:
t0 = time()
y_prediction_rf_clf_set = rf_clf.predict(X_validation_set)
accuracy_score_rf_clf = accuracy_score(y_prediction_rf_clf_set, y_validation_set)
print(f'Time elapsed: {time()-t0:.2f} sec')
print(f'Accuracy score: {accuracy_score_rf_clf:.6f}')

Time elapsed: 0.49 sec
Accuracy score: 0.987071


### Support vector machine

#### SVC

In [17]:
t0 = time()
joblib_file = Path('./mnist_models/svc_clf.sav')
if joblib_file.is_file():
    svc_clf = joblib.load(joblib_file)
else:
    svc_clf.fit(X_training_set, y_training_set)
    joblib.dump(svc_clf, 'svc_clf.sav')
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 0.07 sec


In [18]:
t0 = time()
y_prediction_svc_clf_set = svc_clf.predict(X_validation_set)
accuracy_score_svc_clf = accuracy_score(y_prediction_svc_clf_set, y_validation_set)
print(f'Time elapsed: {time()-t0:.2f} sec')
print(f'Accuracy score: {accuracy_score_svc_clf:.6f}')

Time elapsed: 77.30 sec
Accuracy score: 0.964286


#### Linear SVC

In [19]:
t0 = time()
joblib_file = Path('./mnist_models/linear_svc_clf.sav')
if joblib_file.is_file():
    linear_svc_clf = joblib.load(joblib_file)
else:
    linear_svc_clf.fit(X_training_set, y_training_set)
    joblib.dump(linear_svc_clf, 'linear_svc_clf.sav')
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 0.00 sec


In [20]:
t0 = time()
y_prediction_linear_svc_clf_set = linear_svc_clf.predict(X_validation_set)
accuracy_score_linear_svc_clf = accuracy_score(y_prediction_linear_svc_clf_set, y_validation_set)
print(f'Time elapsed: {time()-t0:.2f} sec')
print(f'Accuracy score: {accuracy_score_linear_svc_clf:.6f}')

Time elapsed: 0.12 sec
Accuracy score: 0.922857


### XGBoost

In [30]:
import xgboost as xgb
dtrain = xgb.DMatrix(X_training_set)
dtest = xgb.DMatrix(y_training_set)

In [31]:
t0 = time()
joblib_file = Path('./mnist_models/xgb_clf.sav')
if joblib_file.is_file():
    xgb_clf = joblib.load(joblib_file)
else:
    params = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic', 'nthread': 4, 'eval_metric': 'auc', 'n_jobs': -1}
    eval_list = [(dtest, 'eval'), (dtrain, 'train')]
    num_round = 10
    xgb_clf = xgb.train(params, dtrain, num_round, eval_list)
    joblib.dump(xgb_clf, 'xgb_clf.sav')
print(f'Time elapsed: {time()-t0:.2f} sec')

In [None]:
t0 = time()
y_prediction_xgb_clf_set = xgb_clf.predict(X_validation_set)
accuracy_score_xgb_clf = accuracy_score(y_prediction_xgb_clf_set, y_prediction_set)
print(f'Time elapsed: {time()-t0:.2f} sec')
print(f'Accuracy score for {xgb_clf.__class__.__name__}: {accuracy_score_xgb_clf:.6f}')

## Ensemble model

### Hard voting

In [None]:
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[('tr', tree_clf), ('rf', rf_clf), ('svc', svc_clf), ('linear_svc', linear_svc_clf)], voting='hard', n_jobs=-1)

In [None]:
t0 = time()
joblib_file = Path('./mnist_models/voting_clf.sav')
if joblib_file.is_file():
    voting_clf = joblib.load(joblib_file)
else:
    voting_clf.fit(X_training_set, y_training_set)
    joblib.dump(voting_clf, 'voting_clf.sav')
print(f'Time elapsed: {time()-t0:.2f} sec')

In [None]:
t0 = time()
y_prediction_voting_clf_set = voting_clf.predict(X_validation_set)
accuracy_score_voting_clf = accuracy_score(y_prediction_voting_clf_set, y_validation_set)
print(f'Time elapsed: {time()-t0:.2f} sec')

In [None]:
names = ('Decison tree classifier', 'Random forest classifier', 'SVC classifier', 'Linear SVC classifier', 'Voting classifier')
validation_scores = (accuracy_score_tree_clf, accuracy_score_rf_clf, accuracy_score_svc_clf, \
          accuracy_score_linear_svc_clf, accuracy_score_voting_clf)
for name, score in zip(names, validation_scores):
    print(f'Accuracy score for {name}: {score:.6f}')

## Model testing

In [None]:
t0 = time()
models = (tree_clf, rf_clf, svc_clf, linear_svc_clf, voting_clf)
testing_scores = []
for name, model in zip(names, models):
    y_prediction_set = model.predict(X_testing_set)
    score = accuracy_score(y_prediction_set, y_testing_set)
    testing_scores.append(score)
    print(f'Done with {name}')
print(f'Time elapsed: {time()-t0:.2f} sec')

In [None]:
for name, score in zip(names, testing_scores):
    print(f'Accuracy score for {name}: {score:.6f}')