# Ensemble learning exercise

Train a Random Forest, an Extra Trees and an SVM classifiers on the MNIST data set and use the validation set to see their performance. At that point, create an ensemble of the three classifiers using hard and soft voting and check its performance on the validation set compared to that ottained previously. Has there been an overall improvement compared to each single previous case?

 - [Spliting the data into training, validation and testing sets](#Spliting-the-data-into-training,-validation-and-testing-sets)
 - [Model selection](#Model-selection)
   - [Decision Tree](#Decision-Tree)
   - [Support Vector Machine](#Support-Vector-Machine)
   - [Random Forest](#Random-Forest)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# load the MNIST data set
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url'])

In [3]:
print(mnist.DESCR)

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  
**Please cite**:  

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image b

## Spliting the data into training, validation and testing sets

In [4]:
print(mnist['data'].shape)
print(type(mnist['data']))
print(mnist['target'].shape)
print(type(mnist['target']))

(70000, 784)
<class 'numpy.ndarray'>
(70000,)
<class 'numpy.ndarray'>


In [5]:
training_ratio = 0.6
validation_ratio = 0.2
testing_ratio = 0.2
training_size = int(training_ratio*len(mnist['data']))
test_size = int(testing_ratio*len(mnist['data']))
validation_size = len(mnist['data']) - training_size - test_size
shuffled_index = np.random.permutation(len(mnist['data']))

In [6]:
X_training_set = mnist['data'][shuffled_index[:training_size]]
X_validation_set = mnist['data'][shuffled_index[training_size:-test_size]]
X_testing_set = mnist['data'][shuffled_index[-test_size:]]

In [7]:
print(X_training_set.shape)
print(X_validation_set.shape)
print(X_testing_set.shape)

(42000, 784)
(14000, 784)
(14000, 784)


In [8]:
y_training_set = mnist['target'][shuffled_index[:training_size]]
y_validation_set = mnist['target'][shuffled_index[training_size:-test_size]]
y_testing_set = mnist['target'][shuffled_index[-test_size:]]

In [9]:
print(y_training_set.shape)
print(y_validation_set.shape)
print(y_testing_set.shape)

(42000,)
(14000,)
(14000,)


## Model selection

Let's train a decision tree, random forest and support vector machine classifiers on the training set, and check its performance using the validation set.

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from time import time

In [17]:
tree_clf = DecisionTreeClassifier()
svm_clf = SVC(gamma='auto')
rf_clf = RandomForestClassifier(n_estimators=100)

### Decision Tree

In [12]:
t0 = time()
tree_clf.fit(X_training_set, y_training_set)
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 13.39 sec


In [13]:
y_prediction_set = tree_clf.predict(X_validation_set)
print(f'Accuracy score: {accuracy_score(y_prediction_set, y_validation_set):.6f}')

Accuracy score: 0.862857


### Support Vector Machine

In [14]:
t0 = time()
svm_clf.fit(X_training_set, y_training_set)
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 3795.72 sec


In [15]:
y_prediction_set = svm_clf.predict(X_validation_set)
print(f'Accuracy score: {accuracy_score(y_prediction_set, y_validation_set):.6f}')

Accuracy score: 0.111357


### Random Forest

In [18]:
t0 = time()
rf_clf.fit(X_training_set, y_training_set)
print(f'Time elapsed: {time()-t0:.2f} sec')

Time elapsed: 28.16 sec


In [19]:
y_prediction_set = rf_clf.predict(X_validation_set)
print(f'Accuracy score: {accuracy_score(y_prediction_set, y_validation_set):.6f}')

Accuracy score: 0.966786
