# Machine Learning

## Session #3: Classifying digits with SVM

The aim of this session is to solve a real data problem using SVM implementation of scikit-learn. The problem is based on the <a href = https://en.wikipedia.org/wiki/MNIST_database>  MNIST database</a> of handwritten digits that can be loaded from <a href = http://mldata.org> mldata.org</a>.



## Required packages:

    * numy
    * matpltlib.pyplot
    * sklearn (svm, preprocessing, cross_validation, datasets, metrics)    
        . np.fmod
        . preprocessing.scale
        . cross_validation.train_test_split
    
    

The iPython Notebook should be sent using the assignment activity module (See Aula Global). The deadline for submitting your reports ends on **March 8**. **The iPython Notebook should indicate your names and your email address**.

### 1. Loading the MNIST Database and preparing the data

* Divide the dataset into train and test datasets
* Visualize some individual images (Remember that each row of the dataset corresponds to an image)

# Import required packages

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, preprocessing, metrics
from sklearn.datasets import fetch_openml


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

## Download MNIST database

In [2]:
X, y = fetch_openml(data_id=41082, as_frame=False, return_X_y=True)
print (X.shape)
print (y.shape)
print (y)

(9298, 256)
(9298,)
['7' '6' '5' ... '5' '1' '2']


## Divide the dataset into train and test datasets

In [3]:
# Create train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3/7, random_state=0)

## Visualize some individual images (Remember that each row of the dataset corresponds to an image)

### 2. Training a Gaussian RBF SVC

On a randomly selected subset (2000 samples) of the training set, 

a) Compute the cross-validated metrics of a RBF SVC fitted with the default parameters. The simplest way to use cross-validation is to call the <a href = http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score > cross_val_score</a> helper function on the estimator and the dataset.  Adopt a 5-fold cross-validation.

In [4]:
# Length of X_train
n = len(X_train) 

# Random index
i = np.random.randint(n-1, size=2000) 

# Select random subset of X_train
X_train_sub=X_train[i,:]
y_train_sub=y_train[i]

# Fit the non-linear SVC
rbf_svc = svm.SVC(kernel='rbf')
rbf_svc.fit(X_train_sub, y_train_sub)

# Compute cross-validated metrics on the SVC
scores = cross_val_score(rbf_svc, X_train_sub, y_train_sub, cv=5)
print ("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() / 2))

Accuracy: 0.9630 (+/- 0.0039)


b) Based on this preprocessed subset, use the <a href = http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV > GridSearchCV</a>  function to performan an exhaustive search over specified parameters for the RBF SVC. The parameter space is given by values of gamma in the range [10^-3, 10^3], and C in the range [1, 10, 100, 1000]. Adopt a 5-fold cross-validation.

In [5]:
# Use gridsearch to investigate different C and gamma parameters

# Parameter space for C
C_range = [1, 10, 100, 1000]

# Parameter space for gamma
gamma_range = np.logspace(-3, 1, 4)

# GridSearch with 5-fold CV
param_grid = dict(gamma=gamma_range, C=C_range)
grid = GridSearchCV(svm.SVC(kernel='rbf'), param_grid=param_grid, cv=5)
grid.fit(X_train_sub, y_train_sub)

In [6]:
# Obtain the best parameters
print('The best parameters are %s with a score of %0.2f'% (grid.best_params_, grid.best_score_))

The best parameters are {'C': 10, 'gamma': 0.021544346900318832} with a score of 0.97


c) On the whole train data set:
* Train a single SVC with the best C and $\gamma$ obtained in the previous step

In [7]:
# Random index
i = np.random.randint(n-1, size=10000) 

# Select random subset of X_train
X_train_partial=X_train[i,:]
y_train_partial=y_train[i]

# Standardize the data
X_train_partial = preprocessing.scale(X_train_partial)

In [8]:
# Standardize the data
X_train = preprocessing.scale(X_train)
X_test = preprocessing.scale(X_test)

In [9]:
# Fit the non-linear SVC with the best parameters
rbf_svc = svm.SVC(kernel='rbf', C=grid.best_params_['C'], gamma=grid.best_params_['gamma'])
rbf_svc.fit(X_train_partial, y_train_partial)

e) On the test data set:
* Evaluate the missclasification rate with the trained classifier. Using Scikit learn "metrics" functions give the classification report and the confusion matrix.

In [10]:
y_pred = rbf_svc.predict(X_test)

print ('Error measures:')
print ("EVS: %.4f" % explained_variance_score(y_test, y_pred))
print ("MAE: %.4f" % mean_absolute_error(y_test, y_pred))
print ("MSE: %.4f" % mean_squared_error(y_test, y_pred))
print ("R2: %.4f" % r2_score(y_test, y_pred))

print('Detailed classification report:')
print(classification_report(y_test, y_pred))

print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred))

Error measures:
EVS: 0.7930
MAE: 0.4404
MSE: 2.0028
R2: 0.7781
Detailed classification report:
              precision    recall  f1-score   support

           1       0.99      0.93      0.96       641
          10       0.99      0.88      0.93       356
           2       0.99      0.99      0.99       540
           3       0.47      0.99      0.63       388
           4       0.97      0.81      0.88       343
           5       0.97      0.80      0.88       384
           6       0.93      0.70      0.79       309
           7       1.00      0.90      0.95       359
           8       0.99      0.85      0.91       332
           9       0.98      0.75      0.85       333

    accuracy                           0.88      3985
   macro avg       0.93      0.86      0.88      3985
weighted avg       0.93      0.88      0.89      3985

Confusion matrix:
[[599   0   0  41   0   0   0   0   0   1]
 [  0 315   0  31   2   5   0   0   3   0]
 [  0   0 534   4   0   2   0   0   0   0]

### 3. Training a Polynomial SVC

From Section 2, the dataset has been standardized over the feature axsis. On a randomly selected subset (2000 samples) of the training set,

a) Set $\gamma=r=1$ in the polynomial kernel and, Compute the cross-validated metrics of a Polynomial SVC fitted with the default parameters. The simplest way to use cross-validation is to call the <a href = http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score > cross_val_score</a> helper function on the estimator and the dataset.

In [11]:
# Fit the non-linear SVC
poly_svc = svm.SVC(kernel='poly', coef0=1, gamma=1)
poly_svc.fit(X_train_sub, y_train_sub)

# Compute cross-validated metrics on the SVC
scores = cross_val_score(poly_svc, X_train_sub, y_train_sub, cv=5)
print ("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() / 2))

Accuracy: 0.9720 (+/- 0.0031)


b) Use the <a href = http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV > GridSearchCV</a>  function to performan an exhaustive search over specified parameters for the Polynomial SVC. The parameter space is given by values of C in the range [1, 10, 100, 1000], and values of M from 1 to 6. Adopt a 5-fold cross-validation.

In [12]:
# Parameter space for C
C_range = [1, 10, 100, 1000]

# Parameter space for the degree M
M_range = [1,2,3,4,5,6]

# GridSearch with 5-fold CV
param_grid_poly = dict(degree=M_range, C=C_range)
grid_poly = GridSearchCV(svm.SVC(kernel='poly', coef0=1, gamma=1), param_grid=param_grid_poly, cv=5)
grid_poly.fit(X_train_sub, y_train_sub)

In [13]:
# Obtain the best parameters
print('The best parameters are %s with a score of %0.2f'% (grid_poly.best_params_, grid_poly.best_score_))

The best parameters are {'C': 1, 'degree': 4} with a score of 0.97


c) On the whole train data set:
* Train a single SVC with the best C and M obtained in the previous step

In [14]:
# Fit the polynomial SVC with the best parameters
poly_svc = svm.SVC(kernel='poly', coef0=1, gamma=1, degree=grid_poly.best_params_['degree'], C=grid_poly.best_params_['C'])
poly_svc.fit(X_train_partial, y_train_partial)

d) On the test data set:
* Evaluate the missclasification rate with the trained classifier. Using Scikit learn "metrics" functions give the classification report and the confusion matrix.

In [15]:
y_pred = poly_svc.predict(X_test)

print ('Error measures:')
print ("EVS: %.4f" % explained_variance_score(y_test, y_pred))
print ("MAE: %.4f" % mean_absolute_error(y_test, y_pred))
print ("MSE: %.4f" % mean_squared_error(y_test, y_pred))
print ("R2: %.4f" % r2_score(y_test, y_pred))

print('Detailed classification report:')
print(classification_report(y_test, y_pred))

print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred))

Error measures:
EVS: 0.9110
MAE: 0.1752
MSE: 0.8040
R2: 0.9109
Detailed classification report:
              precision    recall  f1-score   support

           1       0.98      0.95      0.97       641
          10       0.94      0.95      0.94       356
           2       0.95      0.99      0.97       540
           3       0.97      0.93      0.95       388
           4       0.91      0.93      0.92       343
           5       0.93      0.93      0.93       384
           6       0.94      0.92      0.93       309
           7       0.97      0.96      0.96       359
           8       0.94      0.96      0.95       332
           9       0.89      0.91      0.90       333

    accuracy                           0.95      3985
   macro avg       0.94      0.94      0.94      3985
weighted avg       0.95      0.95      0.95      3985

Confusion matrix:
[[610   2   8   2   6   6   0   0   2   5]
 [  1 339   1   1   3   3   0   1   2   5]
 [  0   0 536   2   0   2   0   0   0   0]

### 4. Training other classifiers

Following the same procedure above, fit a Naive Bayes Classifier and compare the results with those obtained using SVC.

In [16]:
# Fit a Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train_partial, y_train_partial)

In [23]:
y_pred = gnb.predict(X_test)
print(y_pred.shape)
print(y_test.shape)

print ('Error measures:')
print ("MAE: %.4f" % mean_absolute_error(y_test, y_pred))
print ("MSE: %.4f" % mean_squared_error(y_test, y_pred))
print ("R2: %.4f" % r2_score(y_test, y_pred))

print('Detailed classification report:')
print(classification_report(y_test, y_pred))

print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred))

(3985,)
(3985,)
Error measures:


ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead.

We see that, while the RBF and polynomial SVC have similar prediction accuracy (97%), the Naive Bayes classifier 
fails to correctly predict the even numbers. The Naive Bayes classifier show an overall accuracy of 75%.