# MNIST data set

What kind of accuracy can I achieve using the support vector classifier from Scikit-Learn?

 - [Loading data](#Loading-data)
 - [Model training](#Model-training)
 - [Grid search](#Grid-search)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook

## Loading data

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url'])

In [3]:
import os
os.listdir('/Users/angelo/scikit_learn_data/openml/openml.org/data/v1/download')

['52667.gz']

In [4]:
print(mnist['DESCR'])

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  
**Please cite**:  

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image b

In [5]:
X = mnist['data']
y = mnist['target']
print(f'X.shape: {X.shape}')
print(f'y.shape: {y.shape}')
print(f'X.ndim: {X.ndim}')
print(f'y.ndim: {y.ndim}')
print(f'X.size: {X.size}')
print(f'y.size: {y.size}')

X.shape: (70000, 784)
y.shape: (70000,)
X.ndim: 2
y.ndim: 1
X.size: 54880000
y.size: 70000


## Model training

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [11]:
shuffled_index = np.random.permutation(len(X))
sample_ratio = 0.1
sample_size = int(len(X) * sample_ratio)
X_sample = X[:sample_size]
y_sample = y[:sample_size]
print(X_sample.shape)
print(y_sample.shape)

(7000, 784)
(7000,)


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, random_state=42)

In [13]:
print(f'X_train: {X_train.shape}\nX_test: {X_test.shape}\ny_train: {y_train.shape}\ny_test: {y_test.shape}')

X_train: (4900, 784)
X_test: (2100, 784)
y_train: (4900,)
y_test: (2100,)


In [38]:
from time import time
t0 = time()
svc_clf = SVC(gamma='scale')
svc_clf.fit(X_train, y_train)
print(f'{svc_clf}\n')
print(f'Time elapsed: {time() - t0:.4f} sec.')

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Time elapsed: 6.0791 sec.


In [39]:
print(f'Mean accuracy: {svc_clf.score(X_test, y_test):.6f}')

Mean accuracy: 0.959048


## Grid search

In [40]:
from sklearn.model_selection import GridSearchCV

In [56]:
param_grid = [{'C': [0.1, 1.0], 'kernel': ['rbf', 'poly', 'sigmoid'], 'degree': [2, 3, 4], 'coef0': [0.0, 1.0]}, \
              {'gamma': ['scale'], 'tol': [1e-3, 1e-2]}]

In [59]:
grid_search = GridSearchCV(svc_clf, param_grid, cv=3, scoring='neg_mean_squared_error', return_train_score=True)

In [62]:
from time import time
t0 = time()
grid_search.fit(X_train, y_train)
print(f'{grid_search}\n')
print(f'Time elapsed: {time() - t0:.4f} sec.')

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [0.1, 1.0], 'coef0': [0.0, 1.0],
                          'degree': [2, 3, 4],
                          'kernel': ['rbf', 'poly', 'sigmoid']},
                         {'gamma': ['scale'], 'tol': [0.001, 0.01]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='neg_mean_squared_error', verbose=0)

Time elapsed: 1453.5753 sec.


In [66]:
print(f'Best grid search parameters: {grid_search.best_params_}\n')
print(f'Best estimator: {grid_search.best_estimator_}')

Best grid search parameters: {'C': 1.0, 'coef0': 0.0, 'degree': 2, 'kernel': 'rbf'}

Best estimator: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


In [80]:
cvres = grid_search.cv_results_
cvres_results = []
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    cvres_results.append((np.sqrt(-mean_score), params))
display(sorted(cvres_results, key=lambda x: x[0])[:8])

[(0.9195162881002341, {'C': 1.0, 'coef0': 0.0, 'degree': 2, 'kernel': 'rbf'}),
 (0.9195162881002341, {'C': 1.0, 'coef0': 0.0, 'degree': 3, 'kernel': 'rbf'}),
 (0.9195162881002341, {'C': 1.0, 'coef0': 0.0, 'degree': 4, 'kernel': 'rbf'}),
 (0.9195162881002341, {'C': 1.0, 'coef0': 1.0, 'degree': 2, 'kernel': 'rbf'}),
 (0.9195162881002341, {'C': 1.0, 'coef0': 1.0, 'degree': 3, 'kernel': 'rbf'}),
 (0.9195162881002341, {'C': 1.0, 'coef0': 1.0, 'degree': 4, 'kernel': 'rbf'}),
 (0.9195162881002341, {'gamma': 'scale', 'tol': 0.001}),
 (0.9205144967764006, {'gamma': 'scale', 'tol': 0.01})]

In [67]:
best_clf = grid_search.best_estimator_
print(f'Mean accuracy: {best_clf.score(X_test, y_test):.6f}')

Mean accuracy: 0.959048
