# EECS 461 Machine Learning Assignment 3

### 25 Jan. 2018

### Amer Nour Eddin | 213171245

Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. We will use the MNIST dataset. First 60000 instances will form your training set, next 10000 instances will be used to create the validation and test sets.

In [1]:
from sklearn.datasets import fetch_mldata
from sklearn.utils import shuffle

mnist = fetch_mldata('MNIST original')
X = mnist['data']
y = mnist['target']

X_train, X_validation_test, y_train, y_validation_test = X[:60000], X[60000:], y[:60000], y[60000:]

# Training set
X_train, y_train = shuffle(X_train, y_train, random_state=0)

X_validation_test, y_validation_test = shuffle(X_validation_test, y_validation_test, random_state=0)

# Validation set
X_validation, y_validation, = X_validation_test[:5000], y_validation_test[:5000]

# Test set
X_test, y_test = X_validation_test[5000:], y_validation_test[5000:]


In [2]:
from sklearn.externals import joblib

## Voting Classifiers

### a. 
Create a Random Forest classifier with parameters random_state=0. Train the classifier using the training set. Save your classifier in pickle format as *RFClassifier.pkl* .

In [3]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(random_state=0)
rnd_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [4]:
joblib.dump(rnd_clf, 'RFClassifier.pkl')

['RFClassifier.pkl']

### b.
Create an Extra-Trees with parameters random_state=0. Train the classifier using the training set. Save your classifier in pickle format as *ETClassifier.pkl* .

In [5]:
from sklearn.ensemble import ExtraTreesClassifier

et_clf = ExtraTreesClassifier(random_state=0)
et_clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [6]:
joblib.dump(et_clf, 'ETClassifier.pkl')

['ETClassifier.pkl']

### c.
Combine Random Forest and Extra-Trees classifiers into an ensemble classifier using a soft Voting classifier. Save your trained classifier in pickle format as *SoftEnsembleClassifier.pkl* .

You may use the __Scikit-Learn’s VotingClassifier__ .

+ Note: In the ensemble classifier, Random Forest and Extra-Trees classifiers are new classifiers,
not the classifiers from part a and b.

In [7]:
from sklearn.ensemble import VotingClassifier

rnd_clf_ens = RandomForestClassifier(random_state=0)
et_clf_ens = ExtraTreesClassifier(random_state=0)

voting_clf = VotingClassifier(estimators=[('rf', rnd_clf_ens), ('et', et_clf_ens)], voting='soft')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_..._estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False))],
         n_jobs=1, voting='soft', weights=None)

In [None]:
joblib.dump(voting_clf, 'SoftEnsembleClassifier.pkl')

['SoftEnsembleClassifier.pkl']

### d.
How much better does the ensemble classifier perform compared to the individual classifiers? Use __test set__ to measure __accuracy score__ of each classifier and return them in a single list. Save your result in pickle format as *part_d.pkl* . The order of the classifiers is Random Forest, Extra-Trees and the ensemble classifier. Return your results according to that order.

For example, if Random Forest, Extra-Trees and the ensemble classifier has the accuracy score
of 0.6, 0.65 and 0.8 respectively, the result will be [0.6, 0.65, 0.8].

In [None]:
from sklearn.metrics import accuracy_score

result = []
for clf in (rnd_clf, et_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    result.append(accuracy_score(y_test, y_pred))


In [None]:
result

In [None]:
joblib.dump(result, 'part_d.pkl')

## Stacking

### e.
Run the individual classifiers (Random Forest and Extra-Trees mentioned in part a and b) to make __probabilistic predictions__ on the __validation set__ and create a new training set with the resulting predictions: each training instance is a vector containing the set of probabilistic predictions from all your classifiers for an image, and the target is the image’s class. Save the
new training set into a pickle file as *part_e.pkl* .

For instance, if you are going to predict the image of number 9, firstly, you will predict the probability vector of the image with your individual classifiers’ *predict_proba()* function.

Lets say RandomForest classifier’s probabilistic prediction output is:
[ 0. 0. 0.1 0. 0.2 0. 0. 0. 0.1 0.6 ]

and Extra-Trees classifier’s probabilistic prediction output is:
[ 0. 0. 0.1 0.1 0.2 0. 0.2 0. 0. 0.4 ]

The new representation of the image will be:
[ 0. 0. 0.1 0. 0.2 0. 0. 0. 0.1 0.6 0. 0. 0.1 0.1 0.2 0. 0.2 0. 0. 0.4 ].

The new training set will consist of new representation of validation set instances (keep the same order as in the validation set).

In [None]:
rnd_propa = rnd_clf.predict_proba(X_validation)
rnd_propa[0]

In [None]:
et_propa = et_clf.predict_proba(X_validation)
et_propa[0]

In [None]:
import numpy as np
z = []
for i in range(len(rnd_propa)):    
    x = np.concatenate((rnd_propa[i],et_propa[i]))
    z.append(x)

new_training =  np.array(z)
new_training[0:]

In [14]:
joblib.dump(new_training, 'part_e.pkl')

['part_e.pkl']

### f.
Train a new Random Forest Classifier (set random_state to 0) with the new training set you created in part e. Save your classifier in pickle format as *Blender.pkl* .

Congratulations, you have just trained a blender, and together with the classifiers they form a Stacking ensemble!

In [15]:
rnd_clf_blender = RandomForestClassifier(random_state=0)
rnd_clf_blender.fit(new_training, y_validation)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [16]:
joblib.dump(rnd_clf_blender, 'Blender.pkl')

['Blender.pkl']

### g.
Evaluate the ensemble on the test set: for each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. Report the __test accuracy__ of the stacking ensemble in pickle format as *part_g.pkl* .

In [17]:
rnd_pred = rnd_clf.predict_proba(X_test)
rnd_pred[0]
# accuracy_score(y_test, rnd_pred)

array([ 0.8,  0. ,  0.1,  0.1,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ])

In [18]:
et_pred = et_clf.predict_proba(X_test)
et_pred[0]
# accuracy_score(y_test, et_pred)

array([ 0.7,  0. ,  0.1,  0. ,  0.1,  0. ,  0. ,  0.1,  0. ,  0. ])

In [19]:
e = []
for i in range(len(rnd_pred)):    
    y = np.concatenate((rnd_pred[i],et_pred[i]))
    e.append(y)

preds =  np.array(e)
preds[0]

array([ 0.8,  0. ,  0.1,  0.1,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0.7,
        0. ,  0.1,  0. ,  0.1,  0. ,  0. ,  0.1,  0. ,  0. ])

In [20]:
Blender_preds = rnd_clf_blender.predict(preds)
test_accuracy = accuracy_score(y_test, Blender_preds)
test_accuracy

0.95740000000000003

In [21]:
joblib.dump(test_accuracy, 'part_g.pkl')

['part_g.pkl']