# Mnist

Let's dive right in and train our first model!

The problem we are solving here is to classify grayscale images of handwritten digits (28x28 pixels) into 10 different categories (0 through 9). 

We’ll use the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. 

It’s a set of 60,000 training images, plus 10,000 test images.

Here are some example images from MNIST:

<img src="images/mnist-1.jpg" height="450" width="600"/>

Note the different train and test datasets. 

We'll train the model on the train dataset and evaluate it on the test dataset. 

This separation is very important for a correct measurement of the model performance. 

We want to measure how well the model performs on data it has never seen before, this property is called **generalization**. 

At some point in the training process the model will overfit on the training data and as a result the performance on the test data will get worse.

Some ML terminology:

 * data points are called **examples**, usually denoted as $x$ for a single example or $X$ for multiple examples
 * a category in a classification problem is called a **class**
 * the class associated with a specific sample is called a **label**, usually denoted as $y$



### Load the MNIST dataset

In [None]:
from tensorflow import keras
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print('Train images:', train_images.shape)
print('Train labels:', train_labels.shape)
print('Test images:', test_images.shape)
print('Test labels:', test_labels.shape)

### Plot some training examples
What does the data look like?

In [None]:
print(train_images[0])

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

for i in range(3):
    plt.imshow(train_images[i], cmap='gray')
    print('Label:', train_labels[i])
    plt.show()


`train_images` and `train_labels` form the training set, the data that the model will learn from. 

The model will then be tested on the test set, `test_images` and `test_labels`.

The images are encoded as Numpy arrays, and the labels are an array of digits, ranging from 0 to 9. 

The images and labels have a one-to-one correspondence.

The workflow will be as follows: 

 1. We’ll feed the neural network the training data, `train_images` and `train_labels`. 
 2. The network will then learn to associate images and labels. 
 3. We’ll ask the network to produce predictions for `test_images`, and we’ll verify whether these predictions match the labels from `test_labels`.

## Build the model

The core building block of neural networks is the **layer**. 

A layer is a function with some input and some output. 

In the following example, our network consists of a sequence of two **dense layers**, also sometimes called fully connected layers.

A dense layer implements the following function: 

$$ 
output = activation(W \cdot input + b) 
$$

$input$ and $output$ are vectors.

$\cdot$ is the dot product.

$W$ is a matrix and $b$ is a vector, they are called the **parameters** of the layer. 

When you think about a layer as a stateful function than $W$ and $b$ would be the state.

The **activation function** computes the output of a layer. 

In most cases the activation function is a non-linear function. 

This model uses **Rectified Linear Unit (ReLU)** and **Softmax** activation functions.


In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers

# define the model input
inputs = layers.Input((28 * 28,))

# define the first dense layer (hidden layer)
x = layers.Dense(units=256, activation='relu')(inputs)

# define the output layer
output = layers.Dense(units=10, activation='softmax')(x)

# define a model that holds everything together
model = models.Model(inputs=inputs, outputs=output)

The last layer uses a softmax activation function, that returns a vector of 10 probability scores (summing to 1). 

Each score will be the probability that the input image belongs to one of our 10 digit classes.

The actual prediction will be the class with the highest probability.

In summary the model calculates the following function:

$$ 
probabilities = softmax(W_2 \cdot relu(W_1 \cdot input + b_1) + b_2)
$$

$W_1$, $b_1$, $W_2$ and $b_2$ are the parameters of the model.

Lets look at some details:

In [None]:
# print some details about each layer in the model
model.summary()

The model has about 200K parameters. In general this depends on the input size and the type and configuration of the individual layers. 

In our specific model, given that the input size and the number of classes is fixed, the number of parameters depends only on the `units` argument in the first dense layer.

A Keras model has the following primary methods:

 * `fit()`: this trains the model by 'fitting' the model parameters to the training data
 * `evaluate()`: measure the performance of the model by calculating evaluation metrics
 * `predict()`: predict labels for a set of examples


To be able to train the model we need three more things:

 * **loss function -** this measures the models prediction error, the difference between the prediction and the ground truth.
 * **optimizer -** an algorithm through which the model will update its parameters based on the training examples it sees and the loss calculated by the loss function
 * **other metrics -** in this case we only care about **accuracy**, the fraction of the images that were correctly classified
 
The model must be compiled to be ready for training.

In [None]:
model.compile(optimizer='rmsprop',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Preprocessing

Before we can start the training we need to preprocess the images. 

The model expects each train/test example as a vector, therefore we flatten the images. 

We also transform the image values to the range [0, 1].

In [None]:
print('Before preprocessing:')
print('Train images:', train_images.shape, train_images.dtype)
print('Test images:', test_images.shape, test_images.dtype)

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

print('\nAfter preprocessing:')
print('Train images:', train_images.shape, train_images.dtype)
print('Test images:', test_images.shape, test_images.dtype)

We’re now ready to train the network. This is done by calling the models `fit()` method, it fits the model to its training data. 

`fit()` takes the following arguments:

 * train_images - the examples to train on
 * train_labels - the labels to calculate the loss
 * epochs - number of iterations over the training data
 * batch_size - number of examples to process in one step

## Train the model

In [None]:
h = model.fit(train_images, train_labels, epochs=5, batch_size=128)

Two quantities are displayed during training: 
 * `loss`: the loss of the network over the training data
 * `acc`: the accuracy of the network over the training data
 
Now lets check how the model performs on the test data:

## Evaluate the model

In [None]:
_, test_acc = model.evaluate(test_images, test_labels, verbose=0)
print('Test accuracy:', test_acc)

The accuracy is the percentage of correct predictions. 

The test accuracy might be lower than the train accuracy. 

This is expected and called **overfitting** - during training the model learns aspects of the examples that are specific for the training data.

On a realworld problem it is usually not possible to train a model with perfect test accuracy. 

Having a test accuracy close to 1.0 usually means you have a bug in your preprocessing or training code and the model does not generalize.

## Plot a confusion matrix
The accuracy metric is the mean accuracy over all classes. 

The **confusion matix** is a tool to better understand the predictive performance of a model on individual classes. 

It breaks the evaluation result down and shows the correct and incorrect predictions for individual classes.

In [None]:
import numpy as np
import pandas as pd
probabilities = model.predict(test_images)
predictions = np.argmax(probabilities, axis=1)
confusion_matrix = pd.crosstab(test_labels, predictions, rownames=['True'], colnames=['Predicted'], margins=False)
print(confusion_matrix)


The vertical axis represents the true labels and the horizontal axis represents the predicted labels. 

The values in the matrix are the counts how many times a true label has been predicted as one of the 10 classes. 

The main diagonal are the correct predictions, this is where a good model will have most of the counts.

In the example you can see that `4` and `9` are often confused with each other. 

This makes sense because the two digits have a similar shape. 

On the other hand `0` and `1` are never confused with each other.

In [None]:
def plot_confusion_matrix(df_confusion, cmap=plt.cm.YlOrRd):
    plt.matshow(df_confusion, cmap=cmap)
    plt.colorbar()
    tick_marks = np.arange(len(df_confusion.columns))
    plt.xticks(tick_marks, df_confusion.columns, rotation=45)
    plt.yticks(tick_marks, df_confusion.index)
    plt.ylabel(df_confusion.index.name)
    plt.xlabel(df_confusion.columns.name)

plot_confusion_matrix(confusion_matrix)

## Use the model for prediction

Generate predictions for the first 5 images of the test dataset.

In [None]:
import numpy as np

probabilities = model.predict(test_images[0:5])
predicted_labels = np.argmax(probabilities, axis=1)

for i in range(predicted_labels.shape[0]):
    print('label:', test_labels[i], 'prediction:', predicted_labels[i])
    print(probabilities[i])
    plt.imshow(test_images[i].reshape(28,28), cmap='gray')
    plt.show()


## Plot some mispredictions

In [None]:
import numpy as np

for i in range(400):
    probabilities = model.predict(np.expand_dims(test_images[i], axis=0))
    predicted_labels = np.argmax(probabilities, axis=1)[0]
    true_label = test_labels[i]
    if predicted_labels != true_label:
        print('label:', true_label, 'prediction:', predicted_labels)
        print(probabilities)
        plt.imshow(test_images[i].reshape(28,28), cmap='gray')
        plt.show()



**Note::** There is dataset from Zalando called [fashion-mnist](https://github.com/zalandoresearch/fashion-mnist) that is a drop-in replacement for the original MNIST dataset but more challenging.