## Recurrence, Depth and High-dimensional data
# Keras ANN notebook

In this notebook we introduce the MNIST dataset, and present basic methods to setup and train a *artificial neural network (ANN)* with the Keras package.

The following elements will be presented:

* MNIST dataset
* data pre-processing: size reduction, scaling
* shallow ann: setup and training 
* loss and accuracy graphs
* optimizer options: learning rate
* object-oriented interface
* deep ann: setup and training
* generating network architecture graphs

**References:**
* THE MNIST DATABASE of handwritten digits (http://yann.lecun.com/exdb/mnist/)
* [Are we there yet?](http://rodrigob.github.io/are_we_there_yet/build/)
* [Keras](https://keras.io/): The Python Deep Learning library
* Getting started with the Keras [functional API](https://keras.io/getting-started/functional-api-guide/)
* Getting started with the Keras Sequential model ([object oriented](https://keras.io/getting-started/sequential-model-guide/))
* Keras [model visualization](https://keras.io/visualization/)
* An [overview](http://sebastianruder.com/optimizing-gradient-descent/) of gradient descent optimization algorithms
* [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/)
* colah's [blog](http://colah.github.io/)

*Please execute the cell bellow in order to initialize the notebook environment*

In [None]:
%autosave 0
# %matplotlib inline
%matplotlib notebook

from __future__ import division, print_function
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import mod3

plt.rcParams.update({'figure.figsize': (5.0, 4.0), 'lines.linewidth': 2.0})

## MNIST dataset

The MNIST dataset has the following properties:
* fixed-size images of handwritten digits
* images are size-normalized and centered
* training set of 60,000 samples, test set of 10,000 samples
* test samples are from different writters

The MNIST dataset has enough complexity to apply key machine learning concepts while not being too computationaly intensive to train. 

**EXERCISE 1**

The Keras framework provides access to several popular datasets, with the module `keras.datasets`. The MNIST dataset is loaded as follows:
```
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
```

Load the MNIST dataset and review its basic properties


**INSTRUCTIONS**
* load the MNIST dataset with `mnist.load_data()`
* print the shapes of `x_train`, `y_train`, `x_test` and `y_test`
* plot a few random samples from the training set using `plt.imshow(img, cmap=plt.cm.gray)`
* plot the distribution of pixel values in the train set

In [None]:
# import MNIST dataset
import keras
from keras.datasets import mnist

(x_train_orig, y_train), (x_test_orig, y_test) = mnist.load_data()

# print dataset properties
print('[Dataset properties]')
print('train set data shape:', x_train_orig.shape)
print('train set label shape:', y_train.shape)
print('test set data shape:', x_test_orig.shape)
print('test set label shape:', y_test.shape)
print()

# print sample properties
print('[Sample properties]')
print('label:', y_train[0])
print('shape:', x_train_orig[0].shape)
print('min:', x_train_orig[0].min())
print('max:', x_train_orig[0].max())

# select n_show samples randomly
n_show = 9
selected = np.random.randint(0, high=len(x_train_orig), size=n_show)

# plot samples
plt.figure(figsize=(9, 1))
for idx, img in enumerate(x_train_orig[selected]):
    plt.subplot(1, n_show, idx+1)
    plt.imshow(img, cmap=plt.cm.gray)
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)
plt.tight_layout()
plt.show()

# plot the distribution of pixel values in the train set
plt.figure()
plt.title('Distribution of pixel values')    
plt.hist(x_train_orig.flatten(), bins=16)
plt.tight_layout()
plt.show()

**EXPECTED OUTPUT**
```
[Dataset properties]
train set data shape: (60000, 28, 28)
train set label shape: (60000,)
test set data shape: (10000, 28, 28)
test set label shape: (10000,)

[Sample properties]
label: 5
shape: (28, 28)
min: 0
max: 255
```
<img src="fig/keras_ann_mnist.png" style="width:90%;height:90%;display:inline;margin:1px">
<img src="fig/keras_ann_mnist_hist.png" style="width:50%;height:50%;display:inline;margin:1px">

### Data pre-processing

**EXERCISE 2**

Several pre-processing steps should take place before training the MNIST dataset.

* reduce pixel count by factor of 4, in order to train the dataset on CPU
* scale pixel intensities between 0 and 1

**INSTRUCTIONS**
* drop one in every two pixels with smart indexing kung-fu 
* scale pixel intensities between 0 and 1
* print the shapes of `x_train`, `y_train`, `x_test` and `y_test`
* plot a few samples
* plot the distribution of the pixel values in the train set

In [None]:
# reduce pixel count by factor of 4
x_train = x_train_orig[:, ::2, ::2].copy()
x_test = x_test_orig[:, ::2, ::2].copy()

# scale intensities between 0 and 1
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

print('train set shape:', x_train.shape)
print('test set shape:', x_train.shape)
      
# plot samples
plt.figure(figsize=(9, 0.75))
for idx, img in enumerate(x_train[selected]):
    plt.subplot(1, n_show, idx+1)
    plt.imshow(img, cmap=plt.cm.gray)
    plt.gca().get_xaxis().set_visible(False)
    plt.gca().get_yaxis().set_visible(False)
plt.tight_layout()
plt.show()

# plot the distribution of pixel values in the train set
plt.figure()
plt.title('Distribution of pixel values')    
plt.hist(x_train.flatten(), bins=16)
plt.tight_layout()
plt.show()

**EXPECTED OUTPUT**
```
train set shape: (60000, 14, 14)
test set shape: (60000, 14, 14)
```
<img src="fig/keras_ann_mnist_small.png" style="width:90%;height:90%;display:inline;margin:1px">
<img src="fig/keras_ann_mnist_hist_scaled.png" style="width:50%;height:50%;display:inline;margin:1px">

## Shallow ANN

<img src="fig/keras_ann_shallow_schema.png" style="width:50%;height:50%;display:inline;margin:1px">

### Adapting dataset format for ANN encoding

As shown in the graph above, the inputs to the ANN are arranged as a vector of pixels, and the outputs are arranged as a vector of labels. The outputs are encoded as *1-out-of-n*, where all units are $0$ except for the unit corresponding to the class, i.e. label $2$ is encoded as $(0, 0, 1, 0, 0, 0, 0, 0, 0, 0)$.

**INSTRUCTIONS**
* use function `keras.utils.to_categorical()` to transform the labels to *1-out-of-n* encoding
* transform the samples to the required shape with smart indexing kung-fu (totaly unrelated to python package `kungfu`)

In [None]:
# convert numeric class to 1-of-n binary vector
n_out = 10
labels_train = keras.utils.to_categorical(y_train, n_out)
labels_test = keras.utils.to_categorical(y_test, n_out)

# transform image to vector
input_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
input_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))
input_train_shape = input_train.shape[1:]

print('train set data shape:', input_train.shape)
print('train set label shape:', labels_test.shape)
print('test set data shape:', input_test.shape)
print('test set label shape:', labels_test.shape)

**EXPECTED OUTPUT**
```
train set data shape: (60000, 196)
train set label shape: (10000, 10)
test set data shape: (10000, 196)
test set label shape: (10000, 10)
```

### Network setup

Setting up a shallow ANN uses the following elements:

* `keras.layersInput()` sets the input
* `keras.layersDense()` adds a fully connected layer
* `keras.models.Model()` defines the ANN model
* `compile()` method of `keras.models.Model` implements the ANN in Tensorflow
* `summary()` method of `keras.models.Model` prints the ANN structure

In [None]:
from keras.layers import Input, Dense
from keras.models import Model

n_out = 10
n_fc1 = 256

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])
model.summary()

**EXPECTED OUTPUT**
```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, 196)               0         
_________________________________________________________________
fc1 (Dense)                  (None, 256)               50432     
_________________________________________________________________
output (Dense)               (None, 10)                2570      
=================================================================
Total params: 53,002.0
Trainable params: 53,002.0
Non-trainable params: 0.0
_________________________________________________________________
```

### Network train

The ANN is trained by calling the method `fit()` and specifying the train set. The trained model is evaluated by calling the method `evaluate()` and the specifying test set.

The default named parameters of the `fit()` method are the following:
* `batch_size=32`
* `epochs=1`
* `verbose=1`
* `shuffle=True`

In [None]:
history = model.fit(input_train, labels_train)

evaluation = model.evaluate(input_test, labels_test, verbose=0)

print('\n[Train parameters]')
for item in history.params:
    print(item+':', history.params[item])
    
print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT**
```
Epoch 1/1
60000/60000 [==============================] - 4s - loss: 0.0981 - acc: 0.1559     

[Train parameters]
metrics: ['loss', 'acc']
samples: 60000
batch_size: 32
epochs: 1
do_validation: False
verbose: 1

[Model evaluation]
test loss 0.0895
test acc 0.1804
```

### Changing default train parameters

Retrain the ANN under the following conditions:

* 5 training epochs
* batch size of 128

**INSTRUCTIONS**
* Call the `fit()` method with relevant names parameters.

In [None]:
n_epochs = 5
n_batch_size = 128

history = model.fit(input_train, labels_train,
                    epochs=n_epochs,
                    batch_size=n_batch_size)

evaluation = model.evaluate(input_test, labels_test, verbose=0)

print('\n[Train parameters]')
for item in history.params:
    print(item+':', history.params[item])
    
print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT**
```
Epoch 1/5
60000/60000 [==============================] - 1s - loss: 0.0898 - acc: 0.1506     
Epoch 2/5
60000/60000 [==============================] - 1s - loss: 0.0897 - acc: 0.1620     
Epoch 3/5
60000/60000 [==============================] - 1s - loss: 0.0896 - acc: 0.1770     
Epoch 4/5
60000/60000 [==============================] - 1s - loss: 0.0895 - acc: 0.1924     
Epoch 5/5
60000/60000 [==============================] - 1s - loss: 0.0894 - acc: 0.2083     

[Train parameters]
metrics: ['loss', 'acc']
samples: 60000
batch_size: 128
epochs: 5
do_validation: False
verbose: 1

[Model evaluation]
test loss 0.0894
test acc 0.2213
```

## Resetting the ANN weights

You might have noticed that the ANN keeps updating its weights with each call to the `fit()` method.
Resetting the network requires to redefine its structure and compiling.

**INSTRUCTIONS**
* Redefine the network, compile and train
* compare the performance of 1 epoch vs 5 epochs training

In [None]:
n_out = 10
n_fc1 = 256
n_epochs = 5
n_batch_size = 128

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

history = model.fit(input_train, labels_train,
                    epochs=n_epochs,
                    batch_size=n_batch_size)

evaluation = model.evaluate(input_test, labels_test, verbose=0)

print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT**
```
Epoch 1/5
60000/60000 [==============================] - 2s - loss: 0.1332 - acc: 0.0825     
Epoch 2/5
60000/60000 [==============================] - 2s - loss: 0.0917 - acc: 0.1470     
Epoch 3/5
60000/60000 [==============================] - 2s - loss: 0.0904 - acc: 0.1804     
Epoch 4/5
60000/60000 [==============================] - 2s - loss: 0.0901 - acc: 0.1964     
Epoch 5/5
60000/60000 [==============================] - 2s - loss: 0.0899 - acc: 0.2005     

[Model evaluation]
test loss 0.0899
test acc 0.2097
```

### Changing the learning rate $\eta$

Changing the learning rate $eta$ requires creating an optimizer instance,
and passing it to the relevant optimizer option:
```
from keras import optimizers
sgd = optimizers.SGD(lr=2)
```

**INSTRUCTIONS**
* create an optimizer instance, set $\eta=2$, and pass it to the `compile()` method
* retrain network for 5 epochs

In [None]:
from keras import optimizers

n_out = 10
n_fc1 = 256
n_epochs = 5
n_batch_size = 128
eta = 2

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

sgd = optimizers.SGD(lr=eta)

model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])

history = model.fit(input_train, labels_train,
                    epochs=n_epochs,
                    batch_size=n_batch_size,
                    shuffle=True)

evaluation = model.evaluate(input_test, labels_test, verbose=0)

print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT**
```
Epoch 1/5
60000/60000 [==============================] - 1s - loss: 0.0750 - acc: 0.5020     
Epoch 2/5
60000/60000 [==============================] - 2s - loss: 0.0444 - acc: 0.7879     
Epoch 3/5
60000/60000 [==============================] - 1s - loss: 0.0340 - acc: 0.8398     
Epoch 4/5
60000/60000 [==============================] - 1s - loss: 0.0294 - acc: 0.8578     
Epoch 5/5
60000/60000 [==============================] - 2s - loss: 0.0268 - acc: 0.8678     

[Model evaluation]
test loss 0.0248
test acc 0.8780
```

### Visualizing loss and accuracy

### Loss and accuracy visualisation (Tensorboard)

Open up a terminal and start a TensorBoard server that will read logs stored at `/tmp/ann`.

`tensorboard --logdir=/tmp/ann`

This allows us to monitor training in the TensorBoard web interface at http://127.0.0.1:6006

In [None]:
from keras.callbacks import TensorBoard

from keras import optimizers

n_out = 10
n_fc1 = 256
n_epochs = 5
n_batch_size = 128
eta = 2

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

sgd = optimizers.SGD(lr=eta)

model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])

history = model.fit(input_train, labels_train,
                    epochs=n_epochs,
                    batch_size=n_batch_size,
                    validation_data=(input_test, labels_test),
                    verbose=0,
                    callbacks=[TensorBoard(log_dir='/tmp/ann')])

evaluation = model.evaluate(input_test, labels_test, verbose=0)

print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT**

<img src="fig/tensorboard_ann_1.png" style="display:inline;margin:1px"><img src="fig/tensorboard_ann_2.png" style="display:inline;margin:1px">
<img src="fig/tensorboard_ann_3.png" style="display:inline;margin:1px"><img src="fig/tensorboard_ann_4.png" style="display:inline;margin:1px">

```
[Model evaluation]
test loss 0.0249
test acc 0.8789
```

### Loss and accuracy visualisation (Animated plots)

The evolution of loss and accuracy can also be monitored by querying the method `evaluate()` multiple times per epoch. However, the `fit()` method only trains entire epochs. One solution is to split the train set into multiple chunks, and instruct `fit()` to train each chunk as if it was the entire train set.

First, let's create a looping structure splits the train set into multiple chunks.

**INSTRUCTIONS**
* insert a loop in the code below that splits the list `indexes` into `n_eval` chunks of lenght `n_chunk`

In [None]:
n = 50
n_eval = 5
n_chunk = int(n/n_eval)

indexes = np.arange(n)

# insert looping kung-fu here
for chunk in [indexes[i:i+n_chunk] for i in xrange(0, n, n_chunk)]:

    print(chunk)

**EXPECTED OUTPUT**
```
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
```

In [None]:
# setup network
n_out = 10
n_fc1 = 256
n_epochs = 5
n_batch_size = 128
eta = 2
n = input_train.shape[0]
n_eval = 5
n_chunk = int(n/n_eval)

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

sgd = optimizers.SGD(lr=eta)

model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])

evaluate_train = model.evaluate(input_train, labels_train, verbose=0)
evaluate_test = model.evaluate(input_test, labels_test, verbose=0)

loss_train = [evaluate_train[0]]
loss_test = [evaluate_test[0]]
accuracy_train = [evaluate_train[1]]
accuracy_test = [evaluate_test[1]]

x_range = [0]

fig=plt.figure(figsize=(9, 3))

gs = gridspec.GridSpec(1, 2)

fig1 = plt.subplot(gs[0])
plt.plot(x_range, loss_train, 'C0', alpha=0.8, label='loss train')
plt.plot(x_range, loss_test, 'C1', alpha=0.8, label='loss test')

plt.ylim([0, max(loss_train[0], loss_test[0])*1.1])
plt.xlim([0, n_epochs*1.1])
plt.title('Loss')
plt.xlabel('Epoch (n)')
plt.ylabel('Loss')
plt.legend()

fig2 = plt.subplot(gs[1])
plt.plot(x_range, loss_train, 'C0', alpha=0.8, label='loss train')
plt.plot(x_range, loss_test, 'C1', alpha=0.8, label='loss test')

plt.ylim([0, 1])
plt.xlim([0, n_epochs*1.1])
plt.title('Accuracy')
plt.xlabel('Epoch (n)')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
fig.show()
fig.canvas.draw()

indexes = np.arange(n)
for epoch in range(n_epochs):
    
    np.random.shuffle(indexes)

    # insert looping kung-fu here
    
    for chunk in [indexes[i:i+n_chunk] for i in xrange(0, n, n_chunk)]:
    
        history = model.fit(input_train[chunk], labels_train[chunk],
                            epochs=1,
                            batch_size=n_batch_size,
                            verbose=0)
        
        evaluation = model.evaluate(input_test, labels_test, verbose=0)
        
        loss_train += history.history['loss']
        accuracy_train += history.history['acc']
        loss_test += [evaluation[0]]
        accuracy_test += [evaluation[1]]
        
        x_range += [x_range[-1]+len(chunk)/n_chunk/n_eval]

        fig1.plot(x_range, loss_train, 'C0', alpha=0.8, label='loss train')
        fig1.plot(x_range, loss_test, 'C1', alpha=0.8, label='loss test')
        fig2.plot(x_range, accuracy_train, 'C0', alpha=0.8, label='accuracy train')
        fig2.plot(x_range, accuracy_test, 'C1', alpha=0.8, label='accuracy test')
        fig.canvas.draw()        

print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT**
<img src="fig/ann_keras_plot_custom.png">

```
[Model evaluation]
test loss 0.0247837326184
test acc 0.8787
```

### Object oriented interface

There is also an object oriented (OO) interface, as you can see in the exemple below.

In [None]:
from keras import optimizers
from keras.models import Sequential

# setup network
n_out = 10
n_fc1 = 256
n_epochs = 5
n_batch_size = 128
eta = 2

model = Sequential()
from keras.layers import Dense, Activation

model.add(Dense(units=n_fc1, input_dim=input_train_shape[0]))
model.add(Activation('sigmoid'))
model.add(Dense(units=n_out))
model.add(Activation('sigmoid'))

sgd = optimizers.SGD(lr=eta)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])
model.summary()

**EXPECTED OUTPUT**
```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 256)               50432     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2570      
_________________________________________________________________
activation_2 (Activation)    (None, 10)                0         
=================================================================
Total params: 53,002.0
Trainable params: 53,002.0
Non-trainable params: 0.0
_________________________________________________________________

```

## Deep ANN

<img src="fig/keras_ann_deep_schema.png" style="width:50%;height:50%;display:inline;margin:1px">

### Network setup and train

Adding more layers to the network is done by stacking additional hidden layers.

**INSTRUCTIONS**
* Add two additional fully connected layers to the ANN
* set layer size to 64
* set batch size to 32 and $\eta$ to 3
* retrain with  batch size of 128 and observe the difference in performance

In [None]:
# setup network
n_out = 10
n_fc1 = 64
n_fc2 = n_fc1
n_fc3 = n_fc1
n_epochs = 5
n_batch_size = 32
eta = 3

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
x = Dense(n_fc2, activation='sigmoid', name='fc2')(x)
x = Dense(n_fc3, activation='sigmoid', name='fc3')(x)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

sgd = optimizers.SGD(lr=eta)
model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])
model.summary()

history = model.fit(input_train, labels_train,
                    epochs=n_epochs,
                    batch_size=n_batch_size,
                    shuffle=True)

evaluation = model.evaluate(input_test, labels_test, verbose=0)

print('\n[Model evaluation]')
print('test', history.params['metrics'][0], format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT (with batch size = 32)**
```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, 196)               0         
_________________________________________________________________
fc1 (Dense)                  (None, 64)                12608     
_________________________________________________________________
fc2 (Dense)                  (None, 64)                4160      
_________________________________________________________________
fc3 (Dense)                  (None, 64)                4160      
_________________________________________________________________
output (Dense)               (None, 10)                650       
=================================================================
Total params: 21,578.0
Trainable params: 21,578.0
Non-trainable params: 0.0
_________________________________________________________________
Epoch 1/5
60000/60000 [==============================] - 3s - loss: 0.0899 - acc: 0.1251     
Epoch 2/5
60000/60000 [==============================] - 3s - loss: 0.0700 - acc: 0.4602     
Epoch 3/5
60000/60000 [==============================] - 3s - loss: 0.0377 - acc: 0.7696     
Epoch 4/5
60000/60000 [==============================] - 3s - loss: 0.0246 - acc: 0.8489     
Epoch 5/5
60000/60000 [==============================] - 3s - loss: 0.0196 - acc: 0.8782     

[Model evaluation]
test loss 0.0178
test acc 0.8901
```

## Will you MNIST me?

Let's do a mini-competition for training MNIST, where any ANN model is accepted under the following rules: 

**CCNSS MNIST competition manifesto **

We abide to the the rules of training MNIST under the following conditions:
* optimizer is restricted to SGD with momentum
* train set is reduced to the fist 20,000 samples of MNIST
* results are reported by averaging over 5 training sessions
* communicate the changes made to the plain vanilla ANN provided below

**INSTRUCTIONS**
* improve on the plain vanilla ANN provided below

```
sgd = optimizers.SGD(lr=eta, decay=1e-6, momentum=0.9)

input_train_comptetition = input_train[:20000]
labels_train_competition = labels_train[:20000]
```

In [None]:
n_out = 10
n_fc1 = 256
n_epochs = 5
n_batch_size = 128
eta = 2

input_train_competition = input_train[:20000]
labels_train_competition = labels_train[:20000]

input_layer = Input(shape=(input_train_shape), name='input')
x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

optimizers.SGD(lr=eta, decay=1e-6, momentum=0.9)

model = Model(input_layer, output_layer)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['accuracy'])

history = model.fit(input_train_competition , labels_train_competition,
                    epochs=n_epochs,
                    batch_size=n_batch_size)

evaluation = model.evaluate(input_test, labels_test, verbose=0)
  
print('\n[Model evaluation]')
print('test', history.params['metrics'][0], '\t', format(np.mean(evaluation[0]), '.4f'))
print('test', history.params['metrics'][1], '\t', format(np.mean(evaluation[1]), '.4f'))

**EXPECTED OUTPUT (with batch size = 32)**
```
Epoch 1/5
20000/20000 [==============================] - 0s - loss: 0.0860 - acc: 0.3224     
Epoch 2/5
20000/20000 [==============================] - 0s - loss: 0.0655 - acc: 0.6290     
Epoch 3/5
20000/20000 [==============================] - 0s - loss: 0.0495 - acc: 0.7404     
Epoch 4/5
20000/20000 [==============================] - 0s - loss: 0.0412 - acc: 0.7948     
Epoch 5/5
20000/20000 [==============================] - 0s - loss: 0.0363 - acc: 0.8226     

[Model evaluation]
test loss 	 0.0343
test acc 	 0.8261
```

## Generating network architecture graphs

In [None]:
# from IPython.display import display, SVG
# from keras.utils import plot_model
# from keras.utils.vis_utils import model_to_dot

# # setup network
# n_out = 10
# n_fc1 = 256

# input_layer = Input(shape=(input_train_shape), name='input')
# x = Dense(n_fc1, activation='sigmoid', name='fc1')(input_layer)
# output_layer = Dense(n_out, activation='sigmoid', name='output')(x)

# model = Model(input_layer, output_layer)
# model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])

# # save model graph
# plot_model(model, to_file='fig/ann_graph.png')

# # plot model graph
# display(SVG(model_to_dot(model).create(prog='dot', format='svg')))

**EXPECTED OUTPUT**
<img src="fig/ann_graph.png">

## Extended exercises

**EXTENDED EXERCISE 1**

Synthetic image expansion is a way to reduce overfitting by performing random transformations of samples during training, effectively never presenting the same image twice during training.

Investigate the benefits of this technique, and compare to other techniques such as elastic distortions.

**References:**
* Simard, Patrice Y., David Steinkraus, and John C. Platt. [Best practices for convolutional neural networks applied to visual document analysis.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.5032&rep=rep1&type=pdf), in ICDAR, vol. 3, pp. 958-962. 2003.
* The function `keras.preprocessing.image.ImageDataGenerator()` provides random affine transformations.

**EXTENDED EXERCISE 2**

The amount of parameters in deep ANNs rapidly increases with layer depth, and easily becomes much larger than the available train samples. This  introduces an important risk of overfitting, since deep ANNs have the capacity to fully memorize train sets of random data.

Investigate the effectiveness of modern techniques to overcome this issue as the number of layers increases. Consider methods such as careful weight initialization, weight regularization, layer normalization, specialized activation functions, and stochastic gradient descent engine.

**References:**
* Zhang, Chiyuan, et al. [Understanding deep learning requires rethinking generalization](https://arxiv.org/pdf/1611.03530.pdf), arXiv preprint arXiv:1611.03530 (2016).

**EXTENDED EXERCISE 3**

*Dropout* is a popular regularization technique, in which random subsets of units are not available during each mini-batch. This is equivalent to training a large number of smaller networks in parallel, and pooling their average predictions at test time. In practice it reduces co-adaptation of units during training, and regularizes weights.

Investigate the effectiveness of dropout, and compare it to other alternatives, such as batch norm, ELUs, etc.

**References:**
* S. Nitish, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. [Dropout: A simple way to prevent neural networks from overfitting](http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf). The Journal of Machine Learning Research (2014).
* Sergey Ioffe, Christian Szegedy. [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167).  arXiv preprint arXiv:1502.03167 (2015).
* Clevert, Djork-Arn√©, Thomas Unterthiner, and Sepp Hochreiter. [Fast and accurate deep network learning by exponential linear units (elus)](http://arxiv.org/abs/1511.07289). arXiv preprint arXiv:1511.07289 (2015).