# Abstract

**Objective:** To build and run a convolutional neural network

**Method:** We recreate the simple model from the first notebook in this series, and than an analoguous one, on a same dataset (CIFAR10). We then create a second model using convolutional layers

**Observations & Results:** For convolution you need to be aware of strides (how many steps you take right and then down) and padding (how far out of the image you start and end). If `padding=same` and `strides=1` then the output will have the same height and width of the input (pad with zeros to accomplish this). If `padding=same` and `strides=2` then the output will be half the height & width. Without padding the output width will be the input - kernel_size, and similarly for height.

Note that unlike a dense layer, which only consumes the last dimension, a convolutional layer will consume all-but-the-first dimension.

Convolution with $A$ is just $||A X||_1^1$, i.e. `np.sum(A * X)`

If your loss function starts returning `NaN` you may have an exploding graident problem, for which batch-normalization exists. Why this works needs some more explaining.

With BatchNormalization and LeakyReLU, Dropout seems not to be as necessary as it once was to avoid overfitting.

Conv models are massively slower to train than normal models despite the drastic reduction in parameter count. It seems max-pooling is no longer in vogue.

**Conclusions:** Convolutional models are powerful, but slow-to-train, method of learning feature-representations. With LeakyReLUs and BatchNormalization, Dropout & pre-training are no longer as important. Further MaxPooling is less and less important.

# Load and Prepare Data

In [None]:
import numpy as np

import keras.utils as kutils
from keras.datasets import cifar10

In [None]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

x_train = x_train.astype(np.float32) / 255
x_test  = x_test.astype(np.float32) / 255

y_train = kutils.to_categorical(y_train)
y_test  = kutils.to_categorical(y_test)

# Simple Non-Convolutional Model

In [None]:
from keras.layers import Input, Dense, Flatten
from keras.models import Model 
from keras.losses import CategoricalCrossentropy
from keras.optimizers import Adam

input_layer = Input(shape=(32, 32, 3))

x = Flatten()(input_layer)
x = Dense(units=200, activation='relu')(x)
x = Dense(units=150, activation='relu')(x)

output_layer = Dense(units=10, activation='softmax')(x)

nn = Model(input_layer, output_layer)

In [None]:
opt = Adam(learning_rate=0.005)

nn.compile(optimizer=opt,
           loss='categorical_crossentropy',
           metrics=['accuracy'])

In [None]:
nn.summary()

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_15 (Dense)             (None, 200)               614600    
_________________________________________________________________
dense_16 (Dense)             (None, 150)               30150     
_________________________________________________________________
dense_17 (Dense)             (None, 10)                1510      
Total params: 646,260
Trainable params: 646,260
Non-trainable params: 0
_________________________________________________________________


In [None]:
nn.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=32,
          epochs=10,
          shuffle=True)

nn.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=256,
          epochs=5,
          shuffle=True)

nn.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=2048,
          epochs=5,
          shuffle=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f772b9d4080>

# Convolution Model Take #1

In [None]:
from keras.layers import Input, Flatten, Dense, Conv2D
from keras.optimizers import Adam
from keras.models import Model

In [None]:
input_layer = Input(shape=(32, 32, 3))

conv_1 = Conv2D(
  filters = 10,  # For every channel we create 10 projections
  kernel_size=(4, 4),
  strides=2,
  padding='same',   
)(input_layer)

conv_2 = Conv2D(
    filters=20, # for every projection (channel-count x 10) we create 20 projections
    kernel_size=(3,3),
    strides=2,
    padding='same'
)(conv_1)

flatten_3 = Flatten()(conv_2)

output_layer = Dense(units=10, activation='softmax')(flatten_3)

dnn = Model(input_layer, output_layer)

In [None]:
dnn.summary()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         [(None, 32, 32, 3)]       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 16, 16, 10)        490       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 8, 8, 20)          1820      
_________________________________________________________________
flatten_6 (Flatten)          (None, 1280)              0         
_________________________________________________________________
dense_18 (Dense)             (None, 10)                12810     
Total params: 15,120
Trainable params: 15,120
Non-trainable params: 0
_________________________________________________________________


First thing that leaps out is that there are radically fewer parameters in this model. 

The next step is explaining the evolution of the data as it passes through the network. We have

```
N x 32 x 32 x 3

      |
     \|/
      '
N x16 x 16 x 10  (strides=2, padding=same)

      |
     \|/
      '
N x 8 x 8 x 20 (strides=2, padding=same)

      |
     \|/
      '
N x 1280   (64 x 20 = 1280)

      |
     \|/
      '
N x 10   (Dense 1280 x 10 + 10)
```

So the first convolution transitions us from an 32 x 32 image with 3 channels, to a 16 x 16 image with 10 channels.

There are 490 parameters. Each filter is 49 params (4 x 4 x 3 + 1), i.e. each filter consumes all three channels in one go! And also has an intercept term. The we have 10 filters to make 490 parameters total.

So note that unlike a dense layer, which only consumes the last dimension, a convolutional layer will consume all-but-the-first dimension.

FIXME Firm up the maths on this one.



## Convolution Model Take #2

This will introduce BAD: BatchNormalization, Activation, Dropout

If weights near the start of the net never change, you have a _vanishing gradient problem_.

If hidden layer values never change, i.e. the unit has "died", it's gradient is always 0. So replace RELU with leaky relu to never have a zero gradient.

If weights explode then you have the _exploding gradient problem_. This is usually indicated by your loss function returning `NaN`.

Since weights are randomized to 0..1, we rescale input to -1..+1 in order to avoid massive gradients early on.

However as the network is trained you may experience _covariate shift_. Back propagation assumes (informally) that the distribution of the input to a layer doesn't change. However it may change significantly, leading to cascading overcorrections, exploding the gradient.

Batch normalization just centres and scales the output of one hidden layer before presenting it to the next. It has a sense of inertia (aka moementum) to ensure it doesn't over-correct data however, so the input isn't perfectly centred and scaled.

Data are localled scaled and shifted: i.e. we center and scale as normal using population mean. Then we rescale and relocate that using global parameters $\gamma$ (the global scale) and $\beta$ the global location. These are updated with a momentum param.

FIXME clarify maths on this

BatchNormalization & RELU is usually okay on its own to prevent overfitting, so the old solution dropout is nowadays skipped entirely sometimes.



In [None]:
from keras.layers import Input, Flatten, Dense, Conv2D, BatchNormalization, \
  Dropout, LeakyReLU, Activation

# This architecture has so many odds and ends it seems hard to believe it's
# not been exhaustively tuned...
input_layer = Input(shape=(32, 32, 3))

x = Conv2D(
    filters=32,
    kernel_size=(3,3),
    strides=1,
    padding='same'
)(input_layer)
x = BatchNormalization()(x)
x = LeakyReLU()(x)  # Note how we need separate activations to insert BatchNorm

x = Conv2D(
    filters=32,
    kernel_size=(3,3),
    strides=2,
    padding='same'
)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Conv2D(
    filters=64,
    kernel_size=(3,3),
    strides=1,
    padding='same'
)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)


x = Conv2D(
    filters=64,
    kernel_size=(3,3),
    strides=2,
    padding='same'
)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Flatten()(x)

x = Dense(128)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)
x = Dropout(rate=0.5)(x)

x = Dense(10)(x)
output_layer = Activation('softmax')(x)

dnn = Model(input_layer, output_layer)

In [None]:
dnn.summary()

Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        [(None, 32, 32, 3)]       0         
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
batch_normalization_15 (Batc (None, 32, 32, 32)        128       
_________________________________________________________________
leaky_re_lu_14 (LeakyReLU)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 16, 16, 32)        9248      
_________________________________________________________________
batch_normalization_16 (Batc (None, 16, 16, 32)        128       
_________________________________________________________________
leaky_re_lu_15 (LeakyReLU)   (None, 16, 16, 32)        0   

In [None]:
opt = Adam(learning_rate=0.0005)
dnn.compile(optimizer=opt,
            loss='categorical_crossentropy',
            metrics=['accuracy'])

In [None]:
dnn.fit(x_train, y_train,
        validation_data=(x_test, y_test),
        batch_size=32,
        epochs=10,
        shuffle=True)

dnn.fit(x_train, y_train,
        validation_data=(x_test, y_test),
        batch_size=256,
        epochs=5,
        shuffle=True)

dnn.fit(x_train, y_train,
        validation_data=(x_test, y_test),
        batch_size=1024,
        epochs=5,
        shuffle=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f772b05d240>

It must be said, given the reduction in parameters, that this comparisoon of methods seems to be a little bit of a cheat.