# 3.1 Convolutional Neural Networks

Basic deep neural network layers can be represented by a linear transformation and a non-linear activation function:

> *y = f(w * x + b)*

where w is a matrix of trainable parameters, and b is a vector of trainable parameters.  This type of neural network layer is often called a dense, or fully-connected, layer.

An issue arises when applying these networks to data with important spatial relationships, such as images or volumetric data.  As we saw in the [Introduction to Tensorflow with MNIST](https://colab.research.google.com/drive/1xi8_50VAG_awaMuRRkxJZzQs8qGnaCtv), these spatial relationships are lost when the input image is flattened.

Convolutional neural networks extend the basic deep network idea by implementing what is known as a convolutional layer.  A convolutional layer is a implemented as a set of MxM kernal matrices which are then convolved with the input data.  The training of the convolutional layer then consists of choosing the best values for each kernal matrix.

Some extremely simple examples of kernal matrices include [-1, 0, 1] and its transpose which are used for vertical and horizontal edge detection, respectively.  In CNNs the training process designs the kernals that best allow the model to minimize its loss function.

After convolving the image with the kernals in one or more layers, a CNN typically applies a pooling layer.  A pooling layer's job is to collect the most relevent information from the output of each kernal.  It is computed by collecting the most relevent cell (e.g. max value) of an NxN window, moving (striding) K pixels and repeating.

### 3.1.1 CNN Basic Architecture

The basic architecture of a CNN is to apply multiple layers of the convolution/pooling operation and to then pass the output to a set of dense layers for interpretation.

CNNs are a popular tool for analyzing image data.  In this notebook, a simple CNN will be demonstrated on the MNIST dataset.  The code to load the dataset and setup the processing pipeline is equivalent to [notebook 1](https://colab.research.google.com/drive/1xi8_50VAG_awaMuRRkxJZzQs8qGnaCtv#scrollTo=qBT-P-4q4HtT).

In [None]:
# Import the needed Tensorflow components
import tensorflow as tf
import tensorflow_datasets as tfds

# Load the MNIST dataset.  Load checks whether the dataset is locally available and downloads it from 
# its official repository at http://yann.lecun.com/exdb/mnist if it cannot be found.
(train, test), info = tfds.load('mnist',                  # Pick the MNIST dataset
                                 split=['train', 'test'], # Load both the training and testing parts of the dataset
                                 with_info=True,          # Generate summary information about the dataset
                                 as_supervised=True)      # return both the inputs and labels as a tuple

print(info.description)
print(info.splits)

# Define the data preprocessing pipeline.  For MNIST, the only needed preprocessing is to convert from unit8 to 
# float.  Other data sets are likely more extensive.
def preprocess_data(input, label):
  # Convert unit8 to real on [0, 1]
  input = tf.cast(input, tf.float32) / 255.0
  return input, label

# Assign the preprocessing pipeline to each dataset: train and test
train = train.map(preprocess_data)
test = test.map(preprocess_data)

# Tell each dataset how many images it will load at once for processing
BATCH_SIZE=128
train = train.batch(BATCH_SIZE)
test = test.batch(BATCH_SIZE)

The MNIST database of handwritten digits.
{'test': <tfds.core.SplitInfo num_examples=10000>, 'train': <tfds.core.SplitInfo num_examples=60000>}


Now that the data pipeline is set up, it is time to design the CNN.  In this simple network, we have two convolution / pooling layers.  Each convolution layer has 8 3x3 kernals.  Setting the padding parameter to 'same' zero pads the output so that the convolution result is still 28 x 28 pixels.

The max pooling layer divides the 28 x 28 output from each of the 8 kernals into 2 x 2 blocks and keeps just the maximum pixel value from each block.  The data size is now 14 x 14 x 8.

The next convolution layer reads in the 14 x 14 x 8 array.  The 8 kernal outputs are simply treated as different channels of the data.  The next layer's convolution kernals are applied to each 14 x 14 array in the data, and the result is summed across channels.  2x2 Max pooling is applied just as before resulting in a 7 x 7 x 8 block of data for each image.

The idea behind using multiple iterations of convolution / pooling is that the first iteration will capture basic features such as edges and lines, while later layers capture higher level features such as curves or digit fragments.  This concept follows the hierarchical neural system of the human visual cortex.

After the final convolution / pooling iteration, the data block is flattened and passed to a dense network with ReLU activation and then output.

In [None]:
# Specify a basic sequential neural network
model = tf.keras.models.Sequential([
                                   tf.keras.layers.Conv2D(8, 3, input_shape=(28, 28, 1), padding='same'), # Convolution with 8 3x3 kernals
                                   tf.keras.layers.MaxPool2D(pool_size=(2, 2)),  # Keep max value of each 2x2 block
                                   tf.keras.layers.Conv2D(8, 3, padding='same'), # Another convolution with 8 3x3 kernals
                                   tf.keras.layers.MaxPool2D(pool_size=(2, 2)),  # Keep max value of each 2x2 block (data is 7x7x8)
                                   tf.keras.layers.Flatten(),                    # Flatten 7x7x8 array into a 392 x 1 vector 
                                   tf.keras.layers.Dense(20, activation='relu'), # Hidden Layer: Define a layer with 20 neurons
                                   tf.keras.layers.Dense(10)                     # Output Layer: Define a layer with a slot for each digit.
                                   ])

# Define the optimizing algorithm and optimizing matrix for the model
model.compile(
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),              # Use the Adam optimizer to train the model
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), # Minimize the categorical crossentropy function in training
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]                # Measure our overall accuracy to see how well we've trained
              )

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 28, 28, 8)         80        
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 14, 14, 8)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 14, 14, 8)         584       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 7, 7, 8)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 392)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                7860      
_________________________________________________________________
dense_3 (Dense)              (None, 10)               

Although the final model is more complicated than that of [Notebook 1](https://colab.research.google.com/drive/1xi8_50VAG_awaMuRRkxJZzQs8qGnaCtv#scrollTo=m-SrDXeLD5jq) the total number of trainable parameters is reduced from  nearly 16,000 to under 9,000.  At the same time, by capturing the spatial correlation of pixels, the model should perform better than a purely dense model like notebook 1.  Just as in Notebook 1, we train using the fit function.

In [None]:
model.fit(train, epochs=2, validation_data=test)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fe6aa76e550>

After 2 epochs, the model accuracy is almost 96% compared with 92% for the dense network in Notebook 1.  This demonstrates that the CNN architecture is more efficient for image processing.