# CSE4502 Programming Assignment #3 - Eric Wang

This programming assignment has two parts.

## Part 1: Fashion MNIST

Load the fashion MNIST data set, and split it into a training set, a validation set, and a test set
(e.g., use 40,000 instances for training, 10,000 for validation, and 10,000 for testing). Then
train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an
MLP classifier (use code mlp_clf = MLPClassifier(random_state=42)).

Next, try to combine them into an ensemble that outperforms them all on the validation set,
using a soft or hard voting classifier. Once you have found one, try it on the test set. How much
better does it perform compared to the individual classifiers?

Run the individual classifiers to make predictions on the validation set, and create a new training
set with the resulting predictions: each training instance is a vector containing the set of
predictions from all your classifiers for an image, and the target is the image’s class.
Congratulations, you have just trained a blender, and together with the classifiers they form a
stacking ensemble! Now let’s evaluate the ensemble on the test set. For each image in the test
set, make predictions with all your classifiers, then feed the predictions to the blender to get the
ensemble’s predictions. How does it compare to the voting classifier you trained earlier?

In [1]:
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [2]:
#import mnist_reader
#X_train, y_train = mnist_reader.load_mnist('', kind='train')
#X_test, y_test = mnist_reader.load_mnist('', kind='t10k')

In [3]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url'])

In [4]:
X, y = mnist["data"], mnist["target"]

In [5]:
X.shape

(70000, 784)

In [6]:
y.shape

(70000,)

***Now splitting the data set into training set, validation set, and test set***

In [7]:
X_train, X_val, X_test = X[:50000], X[50000:60000], X[60000:]
y_train, y_val, y_test = y[:50000], y[50000:60000], y[60000:]

In [8]:
# set random seed 42
np.random.seed(42)

In [9]:
# Check some set sizes, make sure it's correct
X_train.shape

(50000, 784)

In [10]:
y_val.shape

(10000,)

In [11]:
X_test.shape

(10000, 784)

***Training 1st Classifier - Random Forest***

This Random Forest has modified hyperparameters after performing a quick random search to find the best hyperparameters

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

In [13]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_best_clf = RandomForestClassifier(random_state=42)
rnd_search = RandomizedSearchCV(forest_best_clf, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

In [14]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.8104443225786704 {'max_features': 7, 'n_estimators': 180}
1.0694578065543305 {'max_features': 5, 'n_estimators': 15}
0.8974185199782764 {'max_features': 3, 'n_estimators': 72}
0.9905654950582521 {'max_features': 5, 'n_estimators': 21}
0.82385678367056 {'max_features': 7, 'n_estimators': 122}
0.8886394094344455 {'max_features': 3, 'n_estimators': 75}
0.8901685233707155 {'max_features': 3, 'n_estimators': 88}
0.8384509526501833 {'max_features': 5, 'n_estimators': 100}
0.8658752797025677 {'max_features': 3, 'n_estimators': 150}
2.079721135152499 {'max_features': 5, 'n_estimators': 2}


In [15]:
rnd_search.best_params_

{'max_features': 7, 'n_estimators': 180}

In [16]:
forest_best_clf = rnd_search.best_estimator_

In [17]:
forest_best_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=7, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=180,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [18]:
cross_val_score(forest_best_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.96184534, 0.9598608 , 0.95979357])

In [19]:
y_pred_forest = forest_best_clf.predict(X_test)

from sklearn import metrics
best_forest_accuracy = metrics.accuracy_score(y_test, y_pred_forest)
print("Random Forest Accuracy:", best_forest_accuracy)
print("Random Forest Accuracy Percentage", best_forest_accuracy * 100, "%")

Random Forest Accuracy: 0.9669
Random Forest Accuracy Percentage 96.69 %


***Training 2nd Classifier - SVM Classifier*** 

Gamma set to "auto" so that the SVM Classifier figures out itself for which type of SVM to use

In [20]:
from sklearn.svm import SVC

svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train, y_train) 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=42, shrinking=True, tol=0.001,
    verbose=False)

In [21]:
cross_val_score(svm_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.1135641 , 0.11357773, 0.11353817])

In [22]:
y_pred_svm = svm_clf.predict(X_test)

from sklearn import metrics
best_svm_accuracy = metrics.accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy:", best_svm_accuracy)
print("SVM Accuracy in Percentage", best_svm_accuracy * 100, "%")

SVM Accuracy: 0.1135
SVM Accuracy in Percentage 11.35 %


***Training 3rd Classifier - Extra-Trees Classifier*** 

Random Search is used again to find the best parameters for the Extra Trees Classifier

In [23]:
from sklearn.ensemble import ExtraTreesClassifier

In [24]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

extratree_best_clf = ExtraTreesClassifier(random_state=42)
extratree_rnd_search = RandomizedSearchCV(extratree_best_clf, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)

extratree_rnd_search.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=ExtraTreesClassifier(bootstrap=False,
                                                  class_weight=None,
                                                  criterion='gini',
                                                  max_depth=None,
                                                  max_features='auto',
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators='warn',
                                                  n_jobs=None, oob_sco...
            

In [25]:
extratree_cvres = extratree_rnd_search.cv_results_
for mean_score, params in zip(extratree_cvres["mean_test_score"], extratree_cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.802807573457052 {'max_features': 7, 'n_estimators': 180}
1.066217613810614 {'max_features': 5, 'n_estimators': 15}
0.8927485648266258 {'max_features': 3, 'n_estimators': 72}
0.9971960689854328 {'max_features': 5, 'n_estimators': 21}
0.8157327993896041 {'max_features': 7, 'n_estimators': 122}
0.8882342033495445 {'max_features': 3, 'n_estimators': 75}
0.8793975210335767 {'max_features': 3, 'n_estimators': 88}
0.8470537173048708 {'max_features': 5, 'n_estimators': 100}
0.8658290824406396 {'max_features': 3, 'n_estimators': 150}
2.1229743286248186 {'max_features': 5, 'n_estimators': 2}


In [26]:
extratree_rnd_search.best_params_

{'max_features': 7, 'n_estimators': 180}

In [27]:
extratree_best_clf = rnd_search.best_estimator_

In [28]:
extratree_best_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=7, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=180,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [29]:
cross_val_score(extratree_best_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.96184534, 0.9598608 , 0.95979357])

In [30]:
y_pred_extratree = extratree_best_clf.predict(X_test)

from sklearn import metrics
best_extratree_accuracy = metrics.accuracy_score(y_test, y_pred_extratree)
print("Extra Tree Accuracy:", best_extratree_accuracy)
print("Extra Tree Accuracy Percentage", best_extratree_accuracy * 100, "%")

Extra Tree Accuracy: 0.9669
Extra Tree Accuracy Percentage 96.69 %


***Training 4th Classifier - MLP***

MLP (Multi-layer Perceptron) Classifier, uses Stochastic Gradient Descent or log-loss function LBFGS (Broygen-Fletcher-Goldfarb-Shanno) Algorithms

In [31]:
from sklearn.neural_network import MLPClassifier

mlp_clf = MLPClassifier(random_state=42)

In [32]:
cross_val_score(mlp_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.95476633, 0.95218096, 0.95583293])

In [33]:
mlp_clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=42, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

In [34]:
y_pred_mlp = mlp_clf.predict(X_test)

from sklearn import metrics
best_mlp_accuracy = metrics.accuracy_score(y_test, y_pred_mlp)
print("MLP Accuracy:", best_mlp_accuracy)
print("MLP Accuracy Percentage", best_mlp_accuracy * 100, "%")

MLP Accuracy: 0.9644
MLP Accuracy Percentage 96.44 %


## Part 2: Letter Recognition 

Load the letter-recognition.data.csv file, and do the letter classifications. You are free to choose
all the machine learning algorithms we have covered so far. Moreover, apply the ensemble
learning covered in Chapter 7 to improve your classification results.

### Data Set Information:

The objective is to identify each of a large number of black-and-white rectangular pixel displays
as one of the 26 capital letters in the English alphabet. The character images were based on 20
different fonts and each letter within these 20 fonts was randomly distorted to produce a file of
20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes
(statistical moments and edge counts) which were then scaled to fit into a range of integer
values from 0 through 15. We train on the first 16000 items and then use the resulting model to
predict the letter category for the remaining 4000.

### Attribute Information:

1. lettr capital letter (26 values from A to Z)
2. x-box horizontal position of box (integer)
3. y-box vertical position of box (integer)
4. width width of box (integer)
5. high height of box (integer)
6. onpix total # on pixels (integer)
7. x-bar mean x of on pixels in box (integer)
8. y-bar mean y of on pixels in box (integer)
9. x2bar mean x variance (integer)
10. y2bar mean y variance (integer)
11. xybar mean x y correlation (integer)
12. x2ybr mean of x * x * y (integer)
13. xy2br mean of x * y * y (integer)
14. x-ege mean edge count left to right (integer)
15. xegvy correlation of x-ege with y (integer)
16. y-ege mean edge count bottom to top (integer)
17. yegvx correlation of y-ege with x (integer)

Write your code using file name ensemble.ipynb, and submit this file to HuskyCT. Note you
need to use Markdown to explain your approaches. Include discussions on the importance of
the above 17 attributes. Also include charts and graphs whenever applicable. This can help the
TA better grade your work.