# Exercise 3: Digit Classification with Keras
The [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) is a database of hand-written digits matched with their actual value that has been exceedingly well used by the machine learning community. It is large and easily describable, which makes it a great example for learning to use convolutional neural networks.

This exercise draws extensively from [Keras tutorials](https://github.com/keras-team/keras/blob/master/examples).

This notebook contains many sections that are filled out for you and many that you will need to fill out to complete the exercise (marked in <font color='red'>RED</font>). You are finished when "Restarting and Run All Cells" executes the entire notebook without producing any errors. Do not remove assert statements.


In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score, log_loss
from keras import optimizers as opt
from keras.datasets import mnist
from keras import Sequential
from keras import layers
from time import perf_counter
from math import isclose
import tensorflow as tf
import keras.backend as K
import pandas as pd
import numpy as np
import warnings
import keras
np.random.seed(1)

Some maintenance things

In [None]:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
K.set_session(session)

## Load Data
The MNIST data is easily accessible from Keras and needs a little bit of preprocessing before it is useful.

In [None]:
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(f'Input data shape: {x_train.shape}')
print(f'Output data shape: {y_train.shape}')

The data is returned as 60000 28x28 images.
Depending on the backend for Keras, we need to turn these into either 28x28x1 or 1x28x28 images.

In [None]:
img_rows, img_cols = 28, 28

In [None]:
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

In [None]:
print(f'New data shape: {input_shape}')

The data are also integer values between 0 and 255. We need them as single-precision floating point numbers.

In [None]:
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

The classes are integers between 0 and 9. 
As we are treating digits as simple categories and not considering the ordering between them, we need to one-hot encode the data.
(Recall doing this in the last exercise for the categorical input variables)

In [None]:
num_classes = 10

In [None]:
print(f'Output for entry 0: {y_train[0]}')

In [None]:
y_train = keras.utils.to_categorical(y_train, num_classes)

In [None]:
print(f'New output for entry 0: {y_train[0]}')

In [None]:
print(f'New output data shape: {y_train.shape}')

Alright, we are now ready to go

## Quick Tutorial: Classification Models and Keras
All of our previous examples have used regression models.
So, some brief lessons on classification.

#### Scoring Classification Models
There are [many ways to rate the quality of classification](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics), each with their own benefits.

For example, the [False Positive Rate (FPR)](https://en.wikipedia.org/wiki/False_positive_rate) scores how often your model yields an incorrect prediction.
FPR is good for metrics where the cost of reacting to an incorrect positive is high, but would be a poor choice when missing a detection is bad (e.g., fast screening for disease).

Our digit classification challenge is simple. We just want to get as many digits correct as possible.
Getting more "0"s correct is just as important as getting any other digit.
So, for that reason, we will use accuracy as a metric.

In [None]:
accuracy_score([0, 1, 0], [0, 1, 1])

Accuracy is great for humans to understand the quality of a model but has a big issue if it were used as a loss function: discontinuous derivates.

Classification models produce probabilities of an entry (e.g., an image) being in a certain category (e.g., a certain digit) and accuracy scores do not use them well.
Small changes in predictions for the probabilities of each entry can lead to step changes in the accuracy.
Step changes lead to infinite gradients, which causes problems with gradient decent optimization.

So, instead, we use "log-loss" or "categorical cross entry." 
Log-loss has smooth derivates for all changes in probability, which is good for neural network optimization.
It also has a nice trait that predictions that are not just correct but more confidently-correct are given better (lower) scores.
Ther are other classification quality metrics that have these properties, but log-loss is what we will use today.

In [None]:
assert log_loss([0], [[0.6, 0.4]], labels=[0, 1]) > log_loss([0], [[0.7, 0.3]], labels=[0, 1])

#### Classification Layers and Keras
As you will see, special activation functions are needed for performing classification in Keras.

- `softmax` is a good choice for multi-class (i.e., more than 2 classes) classification problem. 
  It takes a vector of real numbers in and returns them in the range [0, 1] with a sum of 1, which looks like a probability distribution.
- `sigmoid` is a good choice for binary classification (only 2 classes) as, like `softmax`, it produces a number on [0, 1]. But, it only takes a single number as input, which makes it simpler than `softmax` to evaluate.

Illustrating `softmax`

In [None]:
model = Sequential([layers.Dense(10, activation='softmax', input_shape=(2,))])

In [None]:
output = model.predict(np.array([[-1, 2]]))  # Not trained, so the outputs are meaningless

In [None]:
print('Note that all numebrs are between 0 and 1:', output)

In [None]:
print('And they have a sum of 1 (or close to it): ', output.sum())

## <font color='red'>Part 1: Train a Fully Connected Neural Network</font>
Use what you learned from Excerise 2 to train a regular, fully-connected neural network with two hidden layers of 512 units.

**HINT**: You will need to use the `softmax` activation in the last layer only.

**HINT**: You will need a [Flatten](https://keras.io/layers/core/#flatten) layer to shape the data from a 28x28x1 array to a 1D vector (see below)

In [None]:
model = Sequential()
model.add(layers.Flatten())

In [None]:
assert model.predict(x_train[:1]).shape == (1, 784)

<font color='red'>At present the model just flattens the images. You need to add the rest</font>

In [None]:
assert model.count_params() == 669706 

<font color='red'>Compile and train the model using the RMSProp optimizer</font>

**HINT**: Use an appropriate [loss function](https://keras.io/losses/)

<font color='red'>Now, fit the model with enough epochs for it to converge and a reasonable batch size</font>

**HINT**: How can you prevent overfitting?

In [None]:
dnn_accuracy = accuracy_score(model.predict_classes(x_test), y_test)
print(f'Accuracy on hold-out set: {dnn_accuracy * 100 : .2f}%')
assert dnn_accuracy > 0.975

## <font color='red'>Part 2: Make a CNN</font>
Our next step is to use a Convolutional Neural Network. 

The simple "convolution" plus "pooling" example in lecture is indeed simpler than the common types of CNNs seen in practice.
We did not have a last layer that performs the actual classification, and used a "max pool" that reduced the image down to a single value.
Typically, we want multiple layers of convolutions to learn very complex filters and do not want to reduce an image down to a single pixel between each stage.

Your task is to train network from [Muhammad Rizwan's tutorial](https://engmrk.com/convolutional-neural-network-3/) with ReLU activation functions: 

<img width=50% src="https://engmrk.com/wp-content/uploads/2018/09/Image-Architecture-of-Convolutional-Neural-Network.png"/>

**HINT**: Read the Keras documentation for [Convolutional](https://keras.io/layers/convolutional/) and [Pooling](https://keras.io/layers/pooling/) layers.

**HINT**: Your input shape is the shape of the image

<font color='red'>Make the model</font>

In [None]:
assert model.count_params() == 1111946

<font color='red'>Train it</font>

Compute the score

In [None]:
cnn_accuracy = accuracy_score(model.predict_classes(x_test), y_test)
print(f'Accuracy on hold-out set: {cnn_accuracy * 100 : .2f}%')
assert cnn_accuracy > 0.985

The accuracy should be higher than your fully-connected nueral network!