# Exercise 6: One Versus All MNIST
The second part of this exercise is to compare the built-in binarization scheme used for the SVC class, namely one-vs-one, against the one-vs-all scheme, which was discussed in Lecture 5. You should implement your own version of one-vs-all SVM and compare your results  
against the built in version. To make the comparison simple you should keep the same hyperparameters which  
you found in the first part of this exercise. Which was the best classifier? If studying the confusion matrix  
was there any apparent difference between the two methods in terms of misclassifications? Include your findings  
either as comments in your code, in your Jupyter notebook or as a separate text document.

## Import Data

In [1]:
import os
import gzip
import numpy as np
from sklearn.preprocessing import StandardScaler

def load_mnist(path, kind="train"):
    labels_path = os.path.join(path, "%s-labels-idx1-ubyte.gz" % kind)
    images_path = os.path.join(path, "%s-images-idx3-ubyte.gz" % kind)

    with gzip.open(labels_path, "rb") as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8)

    with gzip.open(images_path, "rb") as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16).reshape(
            len(labels), 784
        )

    return images, labels


path = "./../resources/datasets/MNIST/"

X_train, y_train = load_mnist(path, kind="train")
X_test, y_test = load_mnist(path, kind="t10k")

# Normalize the data using standard deviation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Part 1: Hyperparameter Tuning

## Grid Search

Prepare Data

In [2]:
from sklearn.model_selection import train_test_split

# Take a smaller chunk of the training dataset for grid search
X_train_small, _, y_train_small, _ = train_test_split(X_train_scaled, y_train, test_size=0.9, random_state=1945)

# Note: On Windows, verbose output is only functional with n_jobs=1
# Use n_jobs=1 for debugging (single core) and n_jobs=-1 for full performance (all cores)
n_jobs = -1 # Number of cores to use for parallelization


### Iteration 1
Start with broad range of values spread in powers of 10.

In [15]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'gamma': [0.0001, 0.001, 0.01, 0.1, 1]
}

grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, n_jobs=n_jobs, verbose=2)
grid_search.fit(X_train_small, y_train_small)

best_params = grid_search.best_params_
best_score = grid_search.best_score_ * 100
print(f"Best parameters found: {best_params}")
print(f"Best cross-validation accuracy: {best_score:.2f}%")

Fitting 5 folds for each of 35 candidates, totalling 175 fits
[CV] END ..............................C=0.001, gamma=0.0001; total time=  44.1s
[CV] END ..............................C=0.001, gamma=0.0001; total time=  44.4s
[CV] END ..............................C=0.001, gamma=0.0001; total time=  44.4s
[CV] END ..............................C=0.001, gamma=0.0001; total time=  45.4s
[CV] END ...............................C=0.001, gamma=0.001; total time=  45.7s
[CV] END ..............................C=0.001, gamma=0.0001; total time=  45.9s
[CV] END ...............................C=0.001, gamma=0.001; total time=  46.3s
[CV] END ...............................C=0.001, gamma=0.001; total time=  46.6s
[CV] END ................................C=0.001, gamma=0.01; total time=  46.3s
[CV] END ...............................C=0.001, gamma=0.001; total time=  46.6s
[CV] END ...............................C=0.001, gamma=0.001; total time=  47.0s
[CV] END ................................C=0.00

Best parameters found are C = 10 and gamma = 0.001, yielding an accuracy of 93.27%

### Iteration 2

Now, let's explore the values close to this combination to try and further improve the accuracy

In [16]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [5, 8, 9, 10, 11, 12, 15],
    'gamma': [0.0005, 0.0008, 0.0009, 0.001, 0.0011, 0.0012, 0.0015]
}

grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, n_jobs=n_jobs, verbose=2)
grid_search.fit(X_train_small, y_train_small)

best_params = grid_search.best_params_
best_score = grid_search.best_score_ * 100
print(f"Best parameters found: {best_params}")
print(f"Best cross-validation accuracy: {best_score:.2f}%")

Fitting 5 folds for each of 49 candidates, totalling 245 fits
[CV] END ..................................C=5, gamma=0.0005; total time=   9.2s
[CV] END ..................................C=5, gamma=0.0005; total time=   9.6s
[CV] END ..................................C=5, gamma=0.0005; total time=   9.8s
[CV] END ..................................C=5, gamma=0.0005; total time=   9.9s
[CV] END ..................................C=5, gamma=0.0005; total time=  10.4s
[CV] END ..................................C=5, gamma=0.0008; total time=  10.6s
[CV] END ..................................C=5, gamma=0.0008; total time=  11.2s
[CV] END ..................................C=5, gamma=0.0008; total time=  11.3s
[CV] END ..................................C=5, gamma=0.0008; total time=  10.4s
[CV] END ..................................C=5, gamma=0.0008; total time=  10.9s
[CV] END ..................................C=5, gamma=0.0009; total time=  11.3s
[CV] END ..................................C=5,

Best parameters found are C = 11 and gamma = 0.0009, yielding an accuracy of 93.43%

### Validation

Now, let's validate this combination of hyperparameters on the test dataset

In [3]:
from sklearn.svm import SVC

svm_classifier = SVC(kernel='rbf', C=11, gamma=0.0009)
svm_classifier.fit(X_train_scaled, y_train)

test_accuracy = svm_classifier.score(X_test_scaled, y_test)
print(f"Test accuracy with C=11 and gamma=0.0009: {(test_accuracy * 100):.2f}%")

Test accuracy with C=11 and gamma=0.0009: 97.33%


# Part 2: One Versus All
- Built-in binarization scheme used for SVC-class, namely one-vs-one and one-vs-all
- Implement your own version of one-vs-all SVM and compare results against the built-in version
- Keep the comparison simple by keeping the same hyperparameters you found in the first part
- Which was the best classifier?
- Stufy the confusion matrix and tell whether there was any apparent difference between the two methods in terms of missclassifications

Train both OvA and OvO SVMs

In [8]:
from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np
from sklearn.metrics import confusion_matrix

class OneVsAllSVM(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, gamma='scale'):
        self.C = C
        self.gamma = gamma
        self.classifiers = []

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.classifiers = []
        for cls in self.classes_:
            binary_y = np.where(y == cls, 1, 0)
            clf = SVC(kernel='rbf', C=self.C, gamma=self.gamma)
            clf.fit(X, binary_y)
            self.classifiers.append(clf)
        return self

    def predict(self, X):
        predictions = np.array([clf.decision_function(X) for clf in self.classifiers]).T
        return self.classes_[np.argmax(predictions, axis=1)]

ova_svm = OneVsAllSVM(C=11, gamma=0.0009)
ova_svm.fit(X_train_scaled, y_train)

ovo_svm = SVC(kernel='rbf', C=11, gamma=0.0009, decision_function_shape='ovo')
ovo_svm.fit(X_train_scaled, y_train)


In [9]:
from sklearn.metrics import confusion_matrix

ova_predictions = ova_svm.predict(X_test_scaled)
ova_test_accuracy = np.mean(ova_predictions == y_test) * 100
print(f"One-vs-All SVM Test accuracy: {ova_test_accuracy:.2f}%")
ova_conf_matrix = confusion_matrix(y_test, ova_predictions)
print("One-vs-All SVM Confusion Matrix:")
print(ova_conf_matrix)


ovo_predictions = ovo_svm.predict(X_test_scaled)
ovo_test_accuracy = np.mean(ovo_predictions == y_test) * 100
print(f"One-vs-One SVM Test accuracy: {ovo_test_accuracy:.2f}%")
ovo_conf_matrix = confusion_matrix(y_test, ovo_predictions)
print("One-vs-One SVM Confusion Matrix:")
print(ovo_conf_matrix)

One-vs-All SVM Test accuracy: 97.66%
One-vs-All SVM Confusion Matrix:
[[ 970    0    1    0    1    1    3    2    2    0]
 [   0 1126    3    0    0    1    3    1    1    0]
 [   3    1  999    3    2    0    2   13    8    1]
 [   1    0    1  991    0    5    0    4    4    4]
 [   1    0    4    1  954    1    4    5    3    9]
 [   2    0    0    6    0  872    2    4    5    1]
 [   2    2    1    1    3    5  941    2    1    0]
 [   0    4    9    1    2    1    0  993    4   14]
 [   1    0    2    4    3    5    1    7  949    2]
 [   3    3    3    5    8    2    0   10    4  971]]
One-vs-One SVM Test accuracy: 97.33%
One-vs-One SVM Confusion Matrix:
[[ 968    0    3    2    1    2    1    1    2    0]
 [   0 1128    3    0    0    1    2    1    0    0]
 [   6    2 1001    0    2    0    1   12    7    1]
 [   0    0    3  987    1    6    0    5    7    1]
 [   0    0    5    0  956    1    3    6    2    9]
 [   3    0    0    9    2  864    3    5    5    1]
 [   4    2

### Discussion
In conclusion, the comparison between the One-vs-All (OvA) and One-vs-One (OvO) SVM approaches on the MNIST dataset reveals that both methods are highly effective, achieving test accuracies of 97.66% and 97.33%, respectively. The slight edge in accuracy for the OvA approach suggests it may handle certain class distinctions better, though both models exhibit similar patterns of misclassification, particularly among visually similar digits. Generally, the choice between OvA and OvO depends on computational resources, as OvA tends to be more resource-efficient due to its simpler model structure. In this case, OvA is preferred due to both lower resource demand and higher accuracy.