### 2. One versus all MNIST

##### 1.1  Finding optimal hyperparameters for SVC with rbf kernel

First the MNIST set is downloaded and split into a training set and a test set. The target set is converted to float.

The training and test set are normalized in order to speed up training. (SVCs are sensitive to non normalized data).
After that a grid search is performed over a number of hyperparameters in order to find the best pair for a subset
of 1000 samples.

In [6]:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing as pp
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.externals import joblib

# Fetch MNIST
mnist = fetch_openml('mnist_784', version=1, cache=True)
X, y = mnist['data'], mnist['target']

y = y.astype('float64')  # all y values are chars from the source for some reason..

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, random_state=10)

# Normalize data to speed up training
scaler = pp.StandardScaler().fit(X_train)
Xn_train = scaler.transform(X_train)
Xn_test = scaler.transform(X_test)

# Instantiate SVC
rbf = SVC(kernel='rbf', gamma=.001, C=2)

#### 1.2 Fit data to training set and calculate score on test set.

After training the score will yield an accuracy of 88.2% (0.8820714285714286) on the test set.

In [7]:
# Find good values for C and gamma.
C = np.arange(1, 11, 1)
gamma = np.arange(0.001, 0.01, 0.001)
param_grid = {'C': C, 'gamma': gamma}
grid_search = GridSearchCV(rbf, param_grid, scoring='accuracy', n_jobs=10)
grid_search.fit(Xn_train[:1000, :], y_train[:1000])

# Print training score and best params
print("Grid search score: ", grid_search.score(Xn_train, y_train))
print(grid_search.best_params_)

rbf = grid_search.best_estimator_
print("Test Accuracy: ", rbf.score(Xn_test, y_test))

pred_test = rbf.predict(Xn_test)

print(confusion_matrix(y_test, pred_test))
print(classification_report(y_test, pred_test))



Grid search score:  0.8828571428571429
{'C': 2, 'gamma': 0.001}
Test Accuracy:  0.8820714285714286
[[1347    1   23    6    3   14    8    3   16    0]
 [   0 1469   14    3    0    2    3    2   22    0]
 [  16   20 1279   12   36    4   13   22   35    1]
 [   4   13  109 1213    2   34    6   16   35    4]
 [   3   10   39    1 1263    0    3    7    7   58]
 [  14   12   37   58    8 1034   31   10   28   16]
 [   7   18   66    0   18   17 1159    0   18    0]
 [   5   24   76    5   15    0    0 1314    4   30]
 [  24   40   58   30   13   65    6    3 1114   19]
 [  10    9   39   19   67    3    0   87   12 1157]]
              precision    recall  f1-score   support

         0.0       0.94      0.95      0.94      1421
         1.0       0.91      0.97      0.94      1515
         2.0       0.74      0.89      0.80      1438
         3.0       0.90      0.84      0.87      1436
         4.0       0.89      0.91      0.90      1391
         5.0       0.88      0.83      0.85  

##### 1.2 One versus All

This next part compares One-Vs-One and One-Vs-All SVCs.

**General Approach**: One-Vs-All works by training one classifier for each class.
After training all classifiers one can predict a sample and compare the probabilistic results of the predictors.
SVCs are not probabilistic in nature so they need to use Platt Scaling in order to return a result of that nature.

The MNIST data set contains target values ranging from 0-9. By modifying the target values to either 1 or 0 for each classifier 
I can train a classifier to recognize only one number.

In [8]:
# Make hard copies for later binarization
y_0 = np.copy(y_train)
y_1 = np.copy(y_train)
y_2 = np.copy(y_train)
y_3 = np.copy(y_train)
y_4 = np.copy(y_train)
y_5 = np.copy(y_train)
y_6 = np.copy(y_train)
y_7 = np.copy(y_train)
y_8 = np.copy(y_train)
y_9 = np.copy(y_train)

# Make classifications binary
y_0[y_train != 0] = 1  # special case, inverse column of prediction for correct comparisons
y_1[y_train != 1] = 0 # all numbers that arent 1, -> set to 0
y_2[y_train != 2] = 0 # repeat
y_3[y_train != 3] = 0
y_4[y_train != 4] = 0
y_5[y_train != 5] = 0
y_6[y_train != 6] = 0
y_7[y_train != 7] = 0
y_8[y_train != 8] = 0
y_9[y_train != 9] = 0

All models have been trained on Xn_train in advance using the previous results of **gamma=0.001** and **c=2**.
Training takes quite a while due to the large training set.

Examples:

zero_ = SVC(kernel='rbf', gamma=0.001, c=2, Probability=True).fit(Xn_train, y_0)
one_ = SVC(kernel='rbf', gamma=0.001, c=2, Probability=True).fit(Xn_train, y_1)

In [9]:
# Load SVCs
zero_ = joblib.load('models/0.model')
one_ = joblib.load('models/1.model')
two_ = joblib.load('models/2.model')
three_ = joblib.load('models/3.model')
four_ = joblib.load('models/4.model')
five_ = joblib.load('models/5.model')
six_ = joblib.load('models/6.model')
seven_ = joblib.load('models/7.model')
eight_ = joblib.load('models/8.model')
nine_ = joblib.load('models/9.model')

##### 1.3 Predictions

This next section predicts all rows of the test set and adds it to a list of predictions. The sum of errors
and accuracy is calculated to **96.89%**

#### Conclusions

Accuracy wise the One-Vs-All approach is quite alot better at predictions than One-Vs-One but has a huge computational
disadvantage in that it has to train 10 different models compared to One-Vs-One.

Also it seems that One-Vs-All is better at predicting individual numbers when looking at the confusion matrix. Especially so for numbers 2, 5 and 8.



In [10]:
y_pred = np.array([])
# Predict all rows

for row in Xn_test:
    probs = np.array([zero_.predict_proba(row.reshape(1, -1))[0, 0]]) # special case, probability column is inverted
    probs = np.append(probs, [one_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [two_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [three_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [four_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [five_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [six_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [seven_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [eight_.predict_proba(row.reshape(1, -1))[0, 1]])
    probs = np.append(probs, [nine_.predict_proba(row.reshape(1, -1))[0, 1]])
    index = probs.argmax()
    y_pred = np.append(y_pred, index)


errors = np.sum(y_pred != y_test)
print("Accuracy: ", 1- (errors/len(Xn_test)))
print("Errors:" , errors)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy:  0.9689285714285715
Errors: 435
[[1402    0    4    0    0    2    6    3    4    0]
 [   0 1495    8    3    1    1    0    3    2    2]
 [   2    4 1379    9    5    2    4   15   15    3]
 [   2    0   11 1383    1    9    0   12   13    5]
 [   1    1    7    0 1340    1    7    4    5   25]
 [   3    3    0   10    0 1210    9    5    8    0]
 [   0    1    2    0    3   10 1280    2    5    0]
 [   2    8   13    3    8    1    1 1418    1   18]
 [   4    9    3    4    6    7    6    3 1325    5]
 [   3    3    3   11   14    5    0   23    8 1333]]
              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99      1421
         1.0       0.98      0.99      0.98      1515
         2.0       0.96      0.96      0.96      1438
         3.0       0.97      0.96      0.97      1436
         4.0       0.97      0.96      0.97      1391
         5.0       0.97      0.97      0.97      1248
         6.0       0.97      0.98      0.98     