<a href="https://colab.research.google.com/github/dragoa/MachineLearning/blob/main/ML/04-Classification/02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Class classification of MNIST digits

Now we want to use all the dataset and so classify all the digits of the MNIST dataset. <br>
So we are trying to solve a **multi-class classification problem**.

### We will perform the following steps:
1. Load the MNIST dataset from sklearn.
2. Preprocess the data and data sampling with randomly selecting a subset of data points for training and testing.
3. Train a linear SVM classifier.
4. Evaluate the models performance using classification error and confusion matrices.
5. Hyperparameter tuning for linear SVM: Perform a grid search to find the best regularization parameter 'C' for the linear SVM and retrain the model with the optimal 'C' value.
6. Evaluate the tuned SVM performance using classification error and confusion matrices.
7. Hyperparameter tuning for SVM with RBF kernel: Perform a grid search to find the best 'C', 'kernel', and 'gamma' values for an SVM with an RBF kernel and retrain the model with the optimal parameters.
8. Evaluate the tuned SVM with RBF kernel performance using classification error and confusion matrices.

In [16]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml

The MNIST dataset is a collection of 70,000 grayscale images and have a resolution of 28x28 pixels, resulting in a total of 784 features (pixels) for each sample.

In [17]:
# Mnist dataset of images 28*28 px = 784 px
mnist = fetch_openml('mnist_784')

  warn(


In [18]:
print(type(mnist))

# Our Data as a pandas DF
X = mnist.data
Y = mnist.target

print(X.shape, type(X))
print(Y.shape, type(Y))

<class 'sklearn.utils._bunch.Bunch'>
(70000, 784) <class 'pandas.core.frame.DataFrame'>
(70000,) <class 'pandas.core.series.Series'>


In [19]:
# Convert X and Y to NumPy arrays so we can us them
X = X.to_numpy()
Y = Y.to_numpy()

print(X.shape, type(X))
print(Y.shape, type(Y))

(70000, 784) <class 'numpy.ndarray'>
(70000,) <class 'numpy.ndarray'>


In [20]:
# We see that out target is not an integer so we need to convert it
print(type(Y[0]))

Y = np.asarray([eval(i) for i in Y])

print(type(Y[0]))

<class 'str'>
<class 'numpy.int64'>


In [21]:
# Normalize the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

Since I have a lot of data and since the complexity is O(n**) what I can do is
subsamplig the data. <br>

I use more data for testing because it's cheaper than learning.

In [22]:
from random import sample

# Data used for training (more of less 10 samples for each class)
n = 300
# Data used for testing (more of less 1000 samples for each class)
nt = 10000

# Randomly sample n+nt data points from the dataset
mask = np.asarray(sample(range(X.shape[0]),n+nt))

# Split the data into training and testing sets
X = X[mask,:]
Y = Y[mask]

# Create the testing set (Xt and Yt) using samples from n to n+nt
mask = range(n,n+nt)
Xt = X[mask,]
Yt = Y[mask]

# Create the testing set (Xt and Yt) using samples from n to n+nt
mask = range(0,n)
X = X[mask,]
Y = Y[mask]

print(X.shape)
print(Y.shape)
print(Xt.shape)
print(Yt.shape)

(300, 784)
(300,)
(10000, 784)
(10000,)


### Logistic Regression for multiclass problems

The Ridge regression algorithm does not directly deal with multiclass classification problems. <br>
What we can use is Logistic Regression for multiclass problems that implement the one-vs-all strategy.  

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# Create a Logistic Regression model for multiclass classification
model = LogisticRegression(multi_class='ovr', penalty='l2', C=1)

# Train the Logistic Regression model on the training data (X and Y)
model.fit(X, Y)

# Make predictions on the test data (Xt)
Yp = model.predict(Xt)

print(np.mean(Yp != Yt))
print(confusion_matrix(Yt, Yp))

0.1934
[[ 910    0    4    7    1   19   31   11    6    2]
 [   0 1113    9    7    3    2    1    2   42    0]
 [  15   49  716   34   22    5   58   37   29   19]
 [  13   18   30  844    4   25   13   30   16   19]
 [   3    7    5    7  841   12   34    8   18   60]
 [  43   14   11  104   14  572   33   32   42   27]
 [  22   10   14    4   15   15  889    4    4   10]
 [   5   29    7   18   25    2    0  921    3   16]
 [  17   36   32  110    9   57   18    8  633   59]
 [   4    8    5   40  102   16    5  108   40  627]]


In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

grid = {'penalty': ['l2'],
        'C'      : np.logspace(-4,3,30)}

MS = GridSearchCV(estimator   = LogisticRegression(),
                  param_grid  = grid,
                  scoring     = 'accuracy',
                  cv          = 10)

H = MS.fit(X,Y)

M = LogisticRegression(penalty=H.best_params_['penalty'],
                       C=H.best_params_['C'],
                       max_iter=10000)

# Refit the model using the best params
M.fit(X, Y)

# Make predictions
Yp = model.predict(Xt)

print(np.mean(Yp != Yt))
print(confusion_matrix(Yt, Yp))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.1934
[[ 910    0    4    7    1   19   31   11    6    2]
 [   0 1113    9    7    3    2    1    2   42    0]
 [  15   49  716   34   22    5   58   37   29   19]
 [  13   18   30  844    4   25   13   30   16   19]
 [   3    7    5    7  841   12   34    8   18   60]
 [  43   14   11  104   14  572   33   32   42   27]
 [  22   10   14    4   15   15  889    4    4   10]
 [   5   29    7   18   25    2    0  921    3   16]
 [  17   36   32  110    9   57   18    8  633   59]
 [   4    8    5   40  102   16    5  108   40  627]]


### Support Vector Machine (SVC) for Binary Classification
An SVC is a powerful machine learning algorithm used for binary classification tasks.<br>
It works by finding the optimal decision boundary (a hyperplane) that separates two classes while maximizing the margin between them. The "C" parameter controls the trade-off between achieving a wider margin (lower C) and minimizing misclassification of training examples (higher C). <br>The 'kernel' parameter specifies the type of kernel function used to transform the data, and 'linear' indicates a linear kernel. SVC is effective in cases where data points are not easily separable with a straight line, as it can make use of kernel tricks to handle non-linear separations.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

M = SVC(C=1, kernel='linear') # We use linear kernel

# Train the model
M.fit(X, Y)

# Make predictions
Yp = M.predict(Xt)

print(np.mean(Yp != Yt))
print(confusion_matrix(Yt, Yp))

In [40]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

grid = {'C'      : np.logspace(-4,3,10),
        'kernel' : ['linear']}

MS = GridSearchCV(estimator   = SVC(),
                  param_grid  = grid,
                  scoring     = 'accuracy',
                  cv          = 10,
                  verbose     = 2)

H = MS.fit(X,Y)

M =  SVC(C= H.best_params_['C'],
        kernel = H.best_params_['kernel'])

# Refit the model using the best params
M.fit(X, Y)

# Make predictions
Yp = model.predict(Xt)

print(np.mean(Yp != Yt))
print(confusion_matrix(Yt, Yp))

Fitting 10 folds for each of 10 candidates, totalling 100 fits
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END ............................C=0.0001, kernel=linear; total time=   0.0s
[CV] END .............C=0.0005994842503189409, kernel=linear; total time=   0.0s
[CV] END .............C=0.0005994842503189409,

With the linear model I obtain always the same percentage of error. So we're gonna use a non linear model, thanks to kernels.

In [41]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

# Constructing the grid
grid = {'C'      : np.logspace(-4,3,10),
        'kernel' : ['rbf'],
        'gamma'  : np.logspace(-4,3,10)}

MS = GridSearchCV(estimator   = SVC(),
                  param_grid  = grid,
                  scoring     = 'accuracy',
                  cv          = 10,
                  verbose     = 2)

H = MS.fit(X,Y)

M = SVC(C      = H.best_params_['C'],
        kernel = H.best_params_['kernel'],
        gamma  = H.best_params_['gamma'])

M.fit(X,Y)

Yp = M.predict(Xt)

print(np.mean(Yp * Yt <= 0))
print(confusion_matrix(Yt,Yp))

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.0s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.0s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.1s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.1s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.1s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.0s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.0s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.0s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.1s
[CV] END .................C=0.0001, gamma=0.0001, kernel=rbf; total time=   0.0s
[CV] END ..C=0.0001, gamma=0.0005994842503189409, kernel=rbf; total time=   0.0s
[CV] END ..C=0.0001, gamma=0.000599484250318