# Gaussian generative models for handwritten digit classification

Recall that the 1-NN classifier yielded a 3.09% test error rate on the MNIST data set of handwritten digits. We will now see that a Gaussian generative model does almost as well, while being significantly faster and more compact.

### Import

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt 
import numpy as np

from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score

### Data MNIST

In [None]:
!find ../../_data | grep -i train-images-idx3-ubyte.gz

In [None]:
## Load the training set
train_data = np.load('../../_data/MNIST/train_data.npy')
train_labels = np.load('../../_data/MNIST/train_labels.npy')

## Load the testing set
test_data = np.load('../../_data/MNIST/test_data.npy')
test_labels = np.load('../../_data/MNIST/test_labels.npy')

The function **displaychar** shows a single MNIST digit. To do this, it first has to reshape the 784-dimensional vector into a 28x28 image.

In [None]:
def display_char(image):
    plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
    plt.axis('off')
    plt.show()

In [None]:
display_char(train_data[58])

The training set consists of 60,000 images. Thus `train_data` should be a 60000x784 array while `train_labels` should be 60000x1. Let's check.

In [None]:
train_data.shape, train_labels.shape
# test_data.shape, test_labels.shape

### Polynomial SVC

In [None]:
# Be patient for 10 minutes or so!
for C in [1.0]:
    clp = SVC(C=C, kernel='poly', degree=2).fit(train_data, train_labels)
    train_pred = clp.predict(train_data)
    test_pred = clp.predict(test_data)
    
    train_error = 1-accuracy_score(train_pred, train_labels)
    test_error = 1-accuracy_score(test_pred, test_labels)
    
    print('Poly: C: {}, train error:{}, test error: {}'.format(C, train_error, test_error))

In [None]:
train_pred[2]

### Linear SVC

Hyperparameter __`C`__ is the cost of misclassification:
 - reducing C means less misclassification cost, expect more misclassifications
 - increases the boundary margin
 - increases bias (misclassifications)
 - lowers variance and as result overfitting
 - the default value for parameter `C` is 1.0

In [None]:
for C in [0.1, 1.0, 10.0]:
    clf = LinearSVC(C=C, loss='hinge').fit(train_data, train_labels)
    train_pred = clf.predict(train_data)
    test_pred = clf.predict(test_data)
    
    train_error = 1-accuracy_score(train_pred, train_labels)
    test_error = 1-accuracy_score(test_pred, test_labels)
    
    print('Linear: C: {}, train error:{}, test error: {}'.format(C, train_error, test_error))