<a href="https://colab.research.google.com/github/dpcarry/introml/blob/master/unit08_svm/lab_emnist_partial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: SVMs on Extended MNIST

In the [MNIST demo](demo06_mnist_svm.ipynb), we saw how SVMs can be used for the classic MNIST problem of digit recognition. In this lab, we are going to extend the MNIST dataset by adding non-digit letters and see if the classifier can distinguish the digits from the non-digits. All non-digits will be lumped as a single 11-th class. This is a highly simplified version of a full character classification, but will illustrate some concepts.

In addition to the concepts in the demo, you will learn:
* Select the SVM parameters (`C` and `gamma`) via cross-validation.
* Use the `GridSearchCV` method to search for parameters with cross-validation.


As usual, we download the standard packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, preprocessing

## Downloading the EMNIST data

The EMNIST data can be downloaded from `pytorch` as follows.  The initial download is 562 MB, so it may take a little while.  On Google colab, it will be in less than a minute.

Note:  In an earlier version of the lab, you directly downloaded the data from the [EMNIST](https://www.nist.gov/itl/products-and-services/emnist-dataset), but the dataset appears to have been removed.

In [None]:
import torch
import torchvision.datasets as datasets
import torchvision.transforms as transforms

# Define the transform to convert the images to tensors
transform = transforms.Compose([transforms.ToTensor()])

# Download the EMNIST dataset
emnist_dataset = datasets.EMNIST(root='./data', split='balanced', train=True, transform=transform, download=True)

Downloading https://biometrics.nist.gov/cs_links/EMNIST/gzip.zip to ./data/EMNIST/raw/gzip.zip


100%|██████████| 562M/562M [00:07<00:00, 75.7MB/s]


Extracting ./data/EMNIST/raw/gzip.zip to ./data/EMNIST/raw


You can then convert the data to numpy arrays as follows.  
*  `Xtr` is `(ntr,28,28)` the training data array representing `28x28` images with `ntr` samples.
* `ytr[i]` is the integer label from 0 to `nclasses-1`.
When `ytr[i] = k` it corresponds to the string digit n `classes[k]`.

In [None]:
Xtr = emnist_dataset.train_data.numpy()
ytr = emnist_dataset.targets.numpy()
Xts = emnist_dataset.train_data.numpy()
yts = emnist_dataset.targets.numpy()
classes = emnist_dataset.classes



In [None]:
# TODO:  Print the number of training and test samples
#   ntr = ...
#   nts = ...


Print the classes.  You should see there are classes for:
* digits 0 to 9
* capital letters `A` to `Z`
* some lowercase letters: `a``, `b`, `d`, `e`, `f`, `g`, `h`, `n`, `q`, `r`, `t`

Not all lowercase letters are included since they may be too easy to confuse with the uppercase.

In [None]:
# TODO:  Print the classes and the total number of classes
#  print(classes)
#  nclasses = ...

## Displaying the Characters

We will use the function from the demo to plot the characters.  Note that the `label` argument takes either `None` or a string, not a decimal.

In [None]:
def plt_digit(x,label=None):
    nrow = 28
    ncol = 28
    xsq = x.reshape((nrow,ncol))
    plt.imshow(xsq.T,  cmap='Greys_r')
    plt.xticks([])
    plt.yticks([])
    if label:
        plt.title(label)

Plot 8 random samples from the training data.  You can use the `plt_digit` function above with `subplot` to create a nice display.  You may want to size your plot with the `plt.figure(figsize=(10,20))` command.

In [None]:
# TODO:  Plot 8 random samples from the training data of the digits
# Select random digits

## Creating a Non-Digit Class


Before we begin, we first need to remove all the letters corresponding to `I`, `L` and `O` since they are too similar to the other characters.  Loop over the elements in `remove_list` and delete the rows of `Xtr`, `Xts`, `ytr`, and `yts` corresponding to these characters.

In [None]:
remove_list = ['I', 'L', 'O']

# TODO
#  Delete rows in data where character corresponds to r
#  for r in remove_list:
#     ...

SVM classifiers are VERY SLOW to train.  The training is particularly slow when there are a large number of classes, since the one classifier must be trained for each pair of labels.  To make the problem easier, we are going to lump all of the letters in one class and add that class to the digits.  

We can do this by creating a label for the non-digit class `npn_digit_class = 10`.  Then whenever `ytr[i] >= non_digit_class` we set `ytr[i] = non_digit_class`.  We do the same with `yts`.


In [None]:
non_digit_class = 10

# TODO
#  ytr = ...
#  yts = ...

Next, to reduce the training and test time, we will use only a small subset of the training and test data.  Of course, you will not get great results with this small dataset.  But, we can at least illustrate the basic concepts.  Create new training and test data `Xtr1`, `ytr1`, `Xts1` and `yts1` with a randomly selected sample of digits and non-digits shown.

In [None]:
# Number of training and test digits and letters
ntr_dig = 5000
ntr_let = 1000
nts_dig = 5000
nts_let = 1000

# TODO:
#   Xtr1, ytr1 = ...
#   Xt21, yt21 = ...

To confirm that selection worked, print

In [None]:
# TODO:  Plot 8 random samples from the training data of the digits
# Select random digits

The training data above takes values from 0 to 255.  Rescale the data from -1 to 1.  This will get slightly better performance on the SVM.  Save the scaled data into arrays `Xtr1` and `Xts1`.

In [None]:

# TODO:  Rescale the data from -1 to 1
# Xtr1 = ...
# Xts1 = ...


# TODO:  Reshape the arrays to (n, 28*28)
# Xtr1 = Xtr1.reshape(...)
# Xts1 = Xts1.reshape(...)


## Run the SVM classifier

First create the SVM classifer. Use an `rbf` classifier with `C=2.8` and `gamma=.0073`. We will look at how to select these parameters laters.

In [None]:
from sklearn import svm

# TODO:  Create a classifier: a support vector classifier
# svc = svm.SVC(...)

Fit the classifier using the scaled training data.  SVMs are insanely slow to train.  But, in this lab, we have kept the training size very small. So, the fitting should take about a minute or two.

In [None]:
# TODO:  Fit the classifier on the training data.
#   svc.fit(...)

Measure the accuracy on the test data.  This too will take another huge amount of time.  Print the accuracy.  If you did everything right, you should get an accuracy of around 89%.

In [None]:
# TODO:  Measure error on the test data

The error rate is quite a bit higher than what we got in the digits only case.  Actually, had we done a classifier using all 36 labels instead of collapsing the letters to a single class, the SVM classifier would have done much better.  The reason is that the "letters" class is now extremely complex.  

Print a confusion matrix.  You should see that the error rate on the "letters" class is much higher.

In [None]:
from sklearn.metrics import confusion_matrix

# TODO:  Print a confusion matrix
#  C = confusion_matrix(...)

# TODO:  Normalize the confusion matrix so that each row sums to one


# TODO:  Print the confusion matrix


Print:
* What fraction of digits are mislabeled as non-digits?  
* What fraction of letters are mislabeled as non-digits?

In [None]:
# TODO:  Print above two error rates

## Selecting gamma and C via Cross-Validation (Using For-Loops)

In the above example, and in the demo, we used a given `gamma` and `C` value.  The selection of the parameters depend on the problem and decent performance of the SVM requires that you select these parameters carefully.  The best way to select the parameters is via cross validation.  Specifically, generally, one tries different values of `gamma` and `C` and selects the pair of values the lowest test error rate.

In the code below, we will try to use 3 values for `C` and `gamma` as specified in the arrays `C_test` and `gam_test`.  For each `C` and `gamma` in these arrays, fit a model on the training data and measure the accuracy on the test data.  Then, print the `C` and `gamma` that result in the best accuracy.   

Normally, you would try a large number of values for each of the parameters, but an SVM is very slow to train -- even with this small data set.  So, we will just do 3 values of each.  Even then, this could take 30 minutes or so to complete.

In this lab, you may do the parameter search over `C` and `gamma` in one of two ways:
* This section:  Use for loops and manually search over the parameters.  This is more direct and you will see and control exactly what is happening.
* Next section:  Use the `GridSearchCV` method in the `sklearn` package.  This takes a little reading, but once you learn this method, you can more easily use this for complex parameter searches.

**You only need to submit the solutions to one of the two sections.**  Pick whichever one you want.

In [None]:
C_test = [0.1,1,10]
gam_test = [0.001,0.01,0.1]

nC = len(C_test)
ngam = len(gam_test)
acc = np.zeros((nC,ngam))

# TODO:  Measure and print the accuracy for each C and gamma value.
#  Store the results in acc
#
#  for i, C in enumerate(C_test):
#       for j, gam in enumerate(gam_test):
#             ...
#             acc[i,j] = ...



In [None]:
# TODO:  Print the accuracy matrix

In [None]:
# TODO:  Print the maximum accuracy and the corresponding best C and gamma

## Using `GridSearchCV` (Optional Section)


In the previous section, you would have likely used `for-loops` to search over the different `C` and `gamma` values.  Since this type of parameter search is so commonly used, `sklearn` has an excellent method `GridSearchCV` that can perform all the operations for you.  In this lab, `GridSearchCV` is not that useful.  But, once you get to more complex parameter searches, the `GridSearchCV` method can save you writing a lot of code.  Importantly, `GridSearchCV` supports parallelization so that fits with different parameters can be fit at the same time.  In this optional section, we will show how to use this method.  

**You do not have to do this section, if you did the previous section**.  

The `GridSearchCV` method does the train-test split in addition to the parameter search.  In this case, you have already a fixed train-test split.  So, you first need to combine the train and test data back into a single dataset.

Create arrays `X` and `y` from `Xtr1`, `Xts1`, `ytr` and `yts`.  Use `np.vstack` and `np.hstack`.

In [None]:
# TODO:  Create combined trained and test data X and y.
# X = ...
# y = ...

Normally, `GridSearchCV` will do $K$-fold validation and automatically split the data into training and test in each fold.  But, in this case, we want it to perform only one fold with a specific train-test split.  To do this, we need to do the following:
* Create a vector `test_fold` where `test_fold[i] = -1` for the samples `i` in the training data (this indicates that they should not be used as test data in any fold) and `test_fold[i] = 0` for the samples `i` in the test data (this indicates that they should be as test data in fold 0).
* Call the method  `ps = sklearn.model_selection.PredefinedSplit(test_fold)` to create a predefined test split object.

In [None]:
# TODO:  Create a pre-defined test split object
# import sklearn.model_selection
# test_fold = ...
# ps = sklearn.model_selection.PredefinedSplit(test_fold)

Next, read about the `GridSearchCV` method to set up a classifier that includes searching over the parameter grid.  
* For the `param_grid` parameter, you will want to create a dictionary to search over `C` and `gamma`.  You will also need to select the `kernel` parameter.
* Set `cv = ps` to use the fixed train-test split.
* Set `verbose=10` to monitor the progress

In [None]:
# TODO:  Create a GridSearchCV classifier
# clf = ...

Fit the classifier using the `fit` method.  The fit method will now search over all the parameters. This will take about 30 minutes.

In [None]:
# TODO: Fit the classifier

Print the `best_score_` and `best_params_` attributes of the classifier to find the best score and parameters

In [None]:
# TODO:  Print the best parameter and score of the classifier


Finally, you can print the test and train score from the `cv_results_['mean_test_score']` and `cv_results_['mean_train_score']`.

In [None]:
# TODO:  Print the mean test score for each parameter value.