## Recognizing Handwritten Digits

For this goal, we'll use the MNIST (refer to http://yann.lecun.com/exdb/mnist/), a database of handwritten digits made up of a training set of 60,000 examples and a test set of 10,000 examples. Each MNIST image is in greyscale and it consists of 28x28 pixels.

Keras provides suitable libraries to load the dataset and split it into training sets and tests sets, used for assessing the performance. Data is converted to `float32` for supporting GPU computation and normalized to `[0, 1]`. In addition, we load the true labels `Y_train` and `Y_test` respectively and perform a one-hot encoding on them.

* The input layer has a neuron associated with each pixel in the image for a total of 28 x 28 = 784 neurons, one for each pixel in the MNIST images;
* Typically, the values associated with each pixel are normalized in the range [0, 1] (which means that the intensity of each pixel is divided by 255, the maximum intensity value);
* The final layer is a single neuron with activation function `softmax`, which is a generalization of the `sigmoid` function;

Once we defined the model, we have to compile it so that it can be executed by the Keras backend (either Theano or TensorFlow). There are a few choices to be made during compilation:

* We need to select the `optimizer` that is the algorithm used to update weights while we train our model;
* We need to select the `objective function` that is used by the optimizer to navigate the space of weights (frequently, objective functions are called `loss function`, and the process of optimization is defined as a process of loss minimization);
* We need to evaluate the trained model.

Some common choices for metrics (a complete list of Keras metrics is at https://keras.io/metrics/) are as follows:

* **Accuracy**: This is the proportion of correct predictions with respect to the targets;
* **Precision**: This denotes how many selected items are relevant for a multilabel classification;
* **Recal**: This denotes how many selected items are relevant for a multilabel classification.

Metrics are similar to objective functions, with the only difference that they are not used for training a model but only for evaluating a model.

Once the model is compiled, it can be then trained with the fit() function, which specifies a few parameters:

* **epochs**: This is the number of times the model is exposed to the training set. At each iteration, the optimizer tries to adjust the weights so that the objective function is minimized;
* **batch_size**: This is the number of training instances observed before the optimizer performs a weight update.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# network and training
NB_EPOCH = 200
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = SGD()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

# Creates the model
model = Sequential()
model.add(Dense(NB_CLASSES, input_shape=(RESHAPED,)))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE,
                    epochs=NB_EPOCH,
                    verbose=VERBOSE,
                    validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

### Insights
* The network is trained on 48,000 samples, and 12,000 are reserved for validation;
* Once the neural model is built, it is then tested on 10,000 samples;
* we can notice that the program runs for 200 iterations, and each time, the accuracy improves;

This means that a bit less than one handwritten character out of ten is not correctly recognized. We can certainly do better than that.

## Improving our neural network

* A first improvement is to add additional layers to our network;
* So, after the input layer, we have a first dense layer with the `N_HIDDEN` neurons and an activation function `relu`;
* This layer is called _hidden_ because it is not directly connected to either the input of the output;
* After the first hidden layer, we have a second hidden layer, again with the `N_HIDDEN` neurons, followed by an output layer with 10 neurons.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Network and training
NB_EPOCH = 20
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = SGD()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE, epochs=NB_EPOCH,
                    verbose=VERBOSE, validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

## Further improving our neural network

* The second improvement is to randomly drop with the dropout probability some of the values propagated inside our internal dense network of hidden layers;
* In Machine Learning, this is a well known form of regularization;
* Surprisingly enough, this idea can improve our performance.

**OBS:** try first training the network with `NB_EPOCH` set to 20. Note that training accuracy should be above test accuracy, otherwise we're not training long enough. After testing it with 20, set the `NB_EPOCH` value to 250 and see the results.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Network and training
NB_EPOCH = 250
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = SGD()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation
DROPOUT = 0.3

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE, epochs=NB_EPOCH,
                    verbose=VERBOSE, validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

## Getting Started with Keras

### What is a tensor?

* A tensor is nothing but a multidimensional array or matrix;
* Keras uses either Theano or TensorFlow to perform very efficient computations on tensors;
* Both the backends are capable of efficient symbolic computations on tensors, which are the fundamental building blocks for creating neural networks.

### Predefined Neural Network Layers

* **Regular dense**: A dense model is a fully connected neural network layer;
* **Recurrent neural networks -- simple LSTM and GRU**: Recurrent neural networks are a class of neural networks that exploit the sequential nature or their input. Such inputs could be a text, a speech, time series, and anything else where the occurrence of an element in the sequence is dependent on the elements that appeared before it;
* **Convolutional and pooling layers**: ConvNets are a class of neural networks using convolutional and pooling operations for progressively learning rather sophisticated models based on progressive levels of abstraction. It resembles vision models that have evolved over millions of years inside the human brain. People called it deep with 3-5 layers a few years ago, and now it has gone up to 100-200;
* **Regularization**: Regularization is a way to prevent overfitting. Multiple layers have parameters for regularization. One example is `Dropout`, but there are others;
* **Batch normalization**: It's a way to accelerate learning and generally achieve better accuracy;

### Losses functions

Losses functions (or objective functions, or optimization score function) can be classified into four categories:

* **Accuracy** which is used for classification problems;
* **Error loss**, which measures the difference between the values predicted and the values actually observed. There are multiple choices: `mse` (mean square error), `rmse` (root mean square error), `mae` (mean absolute error), `mape` (mean percentage error) and `msle` (mean squared logarithmic error);
* **Hinge loss**, which is generally used for training classifiers;
* **Class loss** is used to calculate the cross-entropy for classification problems (see https://en.wikipedia.org/wiki/Cross_entropy).

### Metrics

A metric function is similar to an objective function. The only difference is that the results from evaluating a metric are not used when training the model.

### Optimizers

Optimizers include `SGD`, `RMSprop`, and `Adam`.