# Lab 4 - Convolutional Neural Network with MNIST


# Model Overview

In this lab we will train a Convolutional Neural Network (CNN) on MNIST data. 


In [None]:
from IPython.display import display, Image
Image(url= "http://3.bp.blogspot.com/_UpN7DfJA0j4/TJtUBWPk0SI/AAAAAAAAABY/oWPMtmqJn3k/s1600/mnist_originals.png", width=200, height=200)


## Overview of convolutional neural networks
A [convolutional neural network](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNN, or ConvNet) is a type of [feed-forward](https://en.wikipedia.org/wiki/Feedforward_neural_network) artificial neural network made up of neurons that have learnable weights and biases. The CNNs take advantage of the spatial nature of the data. In nature, we perceive different objects by their shapes, size and colors. For example, objects in a natural scene are typically edges, corners/vertices (defined by two of more edges), color patches etc. These primitives are often identified using different detectors (e.g., edge detection, color detector) or combination of detectors interacting to facilitate image interpretation (object classification, region of interest detection, scene description etc.) in real world vision related tasks. These detectors are also known as filters. Convolution is a mathematical operator that takes an image and a filter as input and produces a filtered output (representing say egdges, corners, colors etc in the input image).  Historically, these filters are a set of weights that were often hand crafted or modeled with mathematical functions (e.g., [Gaussian](https://en.wikipedia.org/wiki/Gaussian_filter) / [Laplacian](http://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm) / [Canny](https://en.wikipedia.org/wiki/Canny_edge_detector) filter).  The filter outputs are mapped through non-linear activation functions mimicking human brain cells called [neurons](https://en.wikipedia.org/wiki/Neuron).

Convolutional networks provide a machinery to learn these filters from the data directly instead of explicit mathematical models and have been found to be superior (in real world tasks) compared to historically crafted filters.  With convolutional networks, the focus is on learning the filter weights instead of learning individually fully connected pair-wise (between inputs and outputs) weights. In this way, the number of weights to learn is reduced when compared with the traditional MLP networks from the previous tutorials.  In a convolutional network, one learns several filters ranging from few single digits to few thousands depending on the network complexity.

Many of the CNN primitives have been shown to have a conceptually parallel components in brain's [visual cortex](https://en.wikipedia.org/wiki/Visual_cortex). The group of neurons cells in visual cortex emit responses when stimulated. This region is known as the receptive field (RF). Equivalently, in convolution the input region corresponding to the filter dimensions can be considered as the receptive field. Popular deep CNNs or ConvNets (such as [AlexNet](https://en.wikipedia.org/wiki/AlexNet), [VGG](https://arxiv.org/abs/1409.1556), [Inception](http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf), [ResNet](https://arxiv.org/pdf/1512.03385v1.pdf)) that are used for various [computer vision](https://en.wikipedia.org/wiki/Computer_vision) tasks have many of these architectural primitives (inspired from biology).  


<a id='#Model Creation'></a>
## CNN Model Creation

CNN is a feedforward network made up of bunch of layers in such a way that the output of one layer becomes the input to the next layer (similar to MLP). In MLP, all possible pairs of input pixels are connected to the output nodes with each pair having a weight, thus leading to a combinatorial explosion of parameters to be learnt and also increasing the possibility of overfitting ([details](http://cs231n.github.io/neural-networks-1/)). Convolution layers take advantage of the spatial arrangement of the pixels and learn multiple filters that significantly reduce the amount of parameters in the network ([details](http://cs231n.github.io/convolutional-networks/)). The size of the filter is a parameter of the convolution layer.  

In this section, we introduce the basics of convolution operations. We show the illustrations in the context of RGB images (3 channels), eventhough the MNIST data we are using in this tutorial is a grayscale image (single channel).

![input-rgb](https://www.cntk.ai/jup/cntk103d_rgb.png)

### Convolution Layer

A convolution layer is a set of filters. Each filter is defined by a weight (**W**) matrix, and  bias ($b$).

![input-filter](https://www.cntk.ai/jup/cntk103d_filterset.png)

These filters are scanned across the image performing the dot product between the weights and corresponding input value ($\vec{x}^T$). The bias value is added to the output of the dot product and the resulting sum is optionally mapped through an activation function. This process is illustrated in the following animation.

In [None]:
Image(url="https://www.cntk.ai/jup/cntk103d_conv2d_final.gif", width= 300)

Convolution layers incorporate following key features:

   - Instead of being fully-connected to all pairs of input and output nodes , each convolution node is **locally-connected** to a subset of input nodes localized to a smaller input region, also referred to as receptive field (RF). The figure above illustrates a small 3 x 3 regions in the image as the RF region. In the case of an RGB, image there would be three such 3 x 3 regions, one each of the 3 color channels. 
   
   
   - Instead of having a single set of weights (as in a Dense layer), convolutional layers have multiple sets (shown in figure with multiple colors), called **filters**. Each filter detects features within each possible RF in the input image.  The output of the convolution is a set of `n` sub-layers (shown in the animation below) where `n` is the number of filters (refer to the above figure).  
   
     
   - Within a sublayer, instead of each node having its own set of weights, a single set of **shared weights** are used by all nodes in that sublayer. This reduces the number of parameters to be learnt and thus overfitting. This also opens the door for several aspects of deep learning which has enabled very practical solutions to be built:
    -- Handling larger images (say 512 x 512)
    - Trying larger filter sizes (corresponding to a larger RF) say 11 x 11
    - Learning more filters (say 128)
    - Explore deeper architectures (100+ layers)
    - Achieve translation invariance (the ability to recognize a feature independent of where they appear in the image). 

### Strides and Pad parameters

**How are filters positioned?** In general, the filters are arranged in overlapping tiles, from left to right, and top to bottom.  Each convolution layer has a parameter to specify the `filter_shape`, specifying the width and height of the filter in case most natural scene images.  There is a parameter (`strides`) that controls the how far to step to right when moving the filters through multiple RF's in a row, and how far to step down when moving to the next row.  The boolean parameter `pad` controls if the input should be padded around the edges to allow a complete tiling of the RF's near the borders. 

The animation above shows the results with a `filter_shape` = (3, 3), `strides` = (2, 2) and `pad` = False. The two animations below show the results when `pad` is set to True. First, with a stride of 2 and second having a stride of 1.
Note: the shape of the output (the teal layer) is different between the two stride settings. Many a times your decision to pad and the stride values to choose are based on the shape of the output layer needed.

In [None]:
# Plot images with strides of 2 and 1 with padding turned on
images = [("https://www.cntk.ai/jup/cntk103d_padding_strides.gif" , 'With stride = 2'),
          ("https://www.cntk.ai/jup/cntk103d_same_padding_no_strides.gif", 'With stride = 1')]

for im in images:
    print(im[1])
    display(Image(url=im[0], width=200, height=200))

# Code Walkthrough
## Initialize environment

In [None]:
import numpy as np
import os
import sys
import time

import cntk as C
from cntk.logging.progress_print import ProgressPrinter

# Select the right target device 
# C.device.try_set_default_device(C.device.cpu())
# C.device.try_set_default_device(C.device.gpu(0))



## Data reading


In previous tutorials, as shown below, we have always flattened the input image into a vector.  With convoultional networks, we do not flatten the image in this way.

![MNIST-flat](https://www.cntk.ai/jup/cntk103a_MNIST_input.png)

**Input Dimensions**:  

In convolutional networks for images, the input data is often shaped as a 3D matrix (number of channels, image width, height), which preserves the spatial relationship between the pixels. In the figure above, the MNIST image is a single channel (grayscale) data, so the input dimension is specified as a (1, image width, image height) tuple. 

![input-rgb](https://www.cntk.ai/jup/cntk103d_rgb.png)

Natural scene color images are often presented as Red-Green-Blue (RGB) color channels. The input dimension of such images are specified as a (3, image width, image height) tuple. If one has RGB input data as a volumetric scan with volume width, volume height and volume depth representing the 3 axes, the input data format would be specified by a tuple of 4 values (3, volume width, volume height, volume depth). In this way CNTK enables specification of input images in arbitrary higher-dimensional space.

Since our training data is stored on our local machine in the CNTK CTF format,
    |labels 0 0 0 1 0 0 0 0 0 0 |features 0 255 0 123 ... 
                                                  (784 integers each representing a pixel gray level)
    
we will need to reshape our data into a 3D matrix when defining the input variable.


In [None]:
# Ensure we always get the same amount of randomness
np.random.seed(0)

# Define a reader for the CTF formatted MNIST files 
def create_reader(path, is_training, input_dim, label_dim):
    return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(
        features  = C.io.StreamDef(field='features', shape=input_dim, is_sparse=False),
        labels    = C.io.StreamDef(field='labels',   shape=label_dim, is_sparse=False)
    )), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)



# Model Training
## Setup a computational network


The first model we build is a simple convolution only network. Here we have two convolutional layers. Since, our task is to detect the 10 digits in the MNIST database, the output of the network should be a vector of length 10, 1 element corresponding to each digit. This is achieved by projecting the output of the last convolutional layer using a dense layer with the output being `num_output_classes`. We have seen this before with Logistic Regression and MLP where features were mapped to the number of classes in the final layer. Also, note that since we will be using the `softmax` operation that is combined with the `cross entropy` loss function during training (see a few cells below), the final dense layer has no activation function associated with it.

The following figure illustrates the model we are going to build. Note the parameters in the model below are to be experimented with. These are often called network hyperparameters. Increasing the filter shape leads to an increase in the number of model parameters, increases the compute time and helps the model better fit to the data. However, one runs the risk of [overfitting](https://en.wikipedia.org/wiki/Overfitting). Typically, the number of filters in the deeper layers are more than the number of filters in the layers before them. We have chosen 8, 16 for the first and second layers, respectively. These hyperparameters should be experimented with during model building.

![conv-only](https://www.cntk.ai/jup/cntk103d_convonly2.png)

In [None]:
# Define a convolutional neura network with 
def create_cnn_model(features, num_output_classes):
    with C.layers.default_options(init = C.layers.glorot_uniform(), activation = C.ops.relu):
        h = features
        h = C.layers.Convolution2D(filter_shape=(5,5),
                                   num_filters = 8,
                                   strides = (2,2),
                                   pad = True,
                                   name = 'first_conv')(h)
        h = C.layers.Convolution2D(filter_shape = (5,5),
                                  num_filters = 16,
                                  strides = (2, 2),
                                  pad = True,
                                  name = 'second_conv')(h)
        r = C.layers.Dense(num_output_classes, activation=None, name = 'classify')(h)
        return r

    # Define MNIST data dimensions
input_dim = 784
input_dim_model = (1, 28, 28)
num_output_classes = 10

# Create inputs for features and labels
features = C.input(input_dim_model)
labels = C.input(num_output_classes)

# Create the CNN model while scaling the input to 0-1 range by dividing each pixel by 255.
z = create_cnn_model(features/255.0, num_output_classes)



## Understanding model parameters

In [None]:
# Print the output shapes / parameters of different components
print("Output Shape of the first convolution layer:", z.first_conv.shape)
print("Bias value of the last dense layer:", z.classify.b.value)

Understanding number of model parameters to be estimated is key to deep learning since there is a direct dependency on the amount of data one needs to have. You need more data for a model that has larger number of parameters to prevent overfitting. In other words, with a fixed amount of data, one has to constrain the number of parameters. There is no golden rule between the amount of data one needs for a model. However, there are ways one can boost performance of model training with [data augmentation](https://deeplearningmania.quora.com/The-Power-of-Data-Augmentation-2). 

In [None]:
# Number of parameters in the network
C.logging.log_number_of_parameters(z)

Our model has 2 convolution layers each having a weight and bias. This adds up to 4 parameter tensors. Additionally the dense layer has weight and bias tensors. Thus, the 6 parameter tensors.

Let us now count the number of parameters:
- *First convolution layer*: There are 8 filters each of size (1 x 5 x 5) where 1 is the number of channels in the input image. This adds up to 200 values in the weight matrix and 8 bias values.


- *Second convolution layer*: There are 16 filters each of size (8 x 5 x 5) where 8 is the number of channels in the input to the second layer (= output of the first layer). This adds up to 3200 values in the weight matrix and 16 bias values.


- *Last dense layer*: There are 16 x 7 x 7 input values and it produces 10 output values corresponding to the 10 digits in the MNIST dataset. This corresponds to (16 x 7 x 7) x 10 weight values and 10 bias values.

Adding these up gives the 11274 parameters in the model.

## Define a trainer using the SGD learner

In [None]:
# Define a trainer using a given reader and the SGD learner 
def train_model_with_SGD(model, features, labels, reader, num_samples_per_sweep, num_sweeps):
 
    # Define loss and error functions
    loss = C.cross_entropy_with_softmax(model, labels)
    error = C.classification_error(model, labels)

    # Instantiate the trainer object to drive the model training
    learning_rate = 0.2
    lr_schedule = C.learning_rate_schedule(learning_rate, C.UnitType.minibatch)
    learner = C.sgd(model.parameters, lr_schedule)
    progress_printer = ProgressPrinter(500)
    trainer = C.Trainer(model, (loss, error), [learner], [progress_printer])

   # Initialize the parameters for the trainer
    minibatch_size = 64
    num_minibatches_to_train = (num_samples_per_sweep * num_sweeps) / minibatch_size

       # Map the data streams to the input and labels.
    input_map = {
        labels  : reader.streams.labels,
        features  : reader.streams.features
    } 

    # Run the trainer on and perform model training
    start_time = time.time()
    for i in range(0, int(num_minibatches_to_train)):
        data = reader.next_minibatch(minibatch_size, input_map = input_map)
        trainer.train_minibatch(data)

    print(time.time() - start_time)



### Run the trainer


In [None]:
# Create the reader to the training data set
train_file = "../../Data/MNIST_train.txt"
reader = create_reader(train_file, True, input_dim, num_output_classes)
num_samples_per_sweep = 50000
num_sweeps = 10
train_model_with_SGD(z, features, labels, reader, num_samples_per_sweep, num_sweeps)


# Model evaluation
## Define the helper test function

In [None]:
# Define the evaluater function 
def test_model(model, features, labels, reader):
    evaluator = C.Evaluator(C.classification_error(model, labels))
    input_map = {
       features : reader.streams.features,
       labels: reader.streams.labels
    }
    
    minibatch_size = 2000
    test_result = 0.0
    num_minibatches = 0
    data = reader.next_minibatch(minibatch_size, input_map = input_map)
    while bool(data):
        test_result = test_result + evaluator.test_minibatch(data)
        num_minibatches += 1
        data = reader.next_minibatch(minibatch_size, input_map = input_map)
    return None if num_minibatches == 0 else test_result*100 / num_minibatches



## Run the test

In [None]:
validation_file = "../../Data/MNIST_validate.txt"
reader = create_reader(validation_file, False, input_dim, num_output_classes)
error_rate = test_model(z, features, labels, reader)
print("Average validation error: {0:.2f}%".format(error_rate))



# Evolving the model
## Pooling Layer

Often a times, one needs to control the number of parameters especially when having deep networks. For every layer of the convolution layer output (each layer, corresponds to the output of a filter), one can have a pooling layer. Pooling layers are typically introduced to:
- Reduce the dimensionality of the previous layer (speeding up the network),
- Makes the model more tolerant to changes in object location in the image. For example, even when a digit is shifted to one side of the image instead of being in the middle, the classifer would perform the classification task well.

The calculation on a pooling node is much simpler than a normal feedforward node.  It has no weight, bias, or activation function.  It uses a simple aggregation function (like max or average) to compute its output.  The most commonly used function is "max" - a max pooling node simply outputs the maximum of the input values corresponding to the filter position of the input. The figure below shows the input values in a 4 x 4 region. The max pooling window size is 2 x 2 and starts from the top left corner. The maximum value within the window becomes the output of the region. Every time the model is shifted by the amount specified by the stride parameter (as shown in the figure below) and the maximum pooling operation is repeated. 
![maxppool](https://cntk.ai/jup/201/MaxPooling.png)

Another alternative is average pooling, which emits that average value instead of the maximum value. The two different pooling opearations are summarized in the animation below.

In [None]:
# Plot images with strides of 2 and 1 with padding turned on
images = [("https://www.cntk.ai/jup/c103d_max_pooling.gif" , 'Max pooling'),
          ("https://www.cntk.ai/jup/c103d_average_pooling.gif", 'Average pooling')]

for im in images:
    print(im[1])
    display(Image(url=im[0], width=200, height=200))

# Typical convolution network

![mnist-conv-mp](http://www.cntk.ai/jup/conv103d_mnist-conv-mp.png)

A typical CNN contains a set of alternating convolution and pooling layers followed by a dense output layer for classification. You will find variants of this structure in many classical deep networks (VGG, AlexNet etc).    

The illustrations are presented in the context of 2-dimensional (2D) images, but the concept and the CNTK components can operate on any dimensional data. The above schematic shows 2 convolution layer and 2 max-pooling layers. A typical strategy is to increase the number of filters in the deeper layers while reducing the spatial size of each intermediate layers. intermediate layers.

Typical convolutional networks have interlacing convolution and max pool layers. The previous model had only convolution layer. In this section, you will create a model with the following architecture.

![conv-only](https://www.cntk.ai/jup/cntk103d_conv_max2.png)

You will use the CNTK [MaxPooling](https://cntk.ai/pythondocs/cntk.layers.layers.html#cntk.layers.layers.MaxPooling) function to achieve this task. You will edit the `create_model` function below and add the MaxPooling operation. 

Hint: We provide the solution a few cells below. Refrain from looking ahead and try to add the layer yourself first.

## Define and train the updated model

In [None]:
# Define a convolutional neural network - v2 
def create_cnn2_model(features, num_output_classes):
    with C.layers.default_options(init = C.layers.glorot_uniform(), activation = C.ops.relu):
        h = features
        h = C.layers.Convolution2D(filter_shape=(5,5),
                                   num_filters = 8,
                                   strides = (2,2),
                                   pad = True,
                                   name = 'first_conv')(h)
        h = C.layers.MaxPooling(filter_shape=(2,2),
                               strides=(2,2),
                               name = "first_max")(h)
        h = C.layers.Convolution2D(filter_shape = (5,5),
                                  num_filters = 16,
                                  strides = (2, 2),
                                  pad = True,
                                  name = 'second_conv')(h)
        h = C.layers.MaxPooling(filter_shape=(3,3),
                                strides=(3,3),
                                name = 'second_max')(h)
        r = C.layers.Dense(num_output_classes, activation=None, name = 'classify')(h)
        return r


In [None]:
#
zv2 = create_cnn2_model(features/255.0, num_output_classes)
train_file = "../../Data/MNIST_train.txt"
reader = create_reader(train_file, True, input_dim, num_output_classes)
train_model_with_SGD(zv2, features, labels, reader, num_samples_per_sweep, num_sweeps)


In [None]:
#
validation_file = "../../Data/MNIST_validate.txt"
reader = create_reader(validation_file, False, input_dim, num_output_classes)
error_rate = test_model(zv2, features, labels, reader)
print("Average validation error for v2 model: {0:.2f}%".format(error_rate))


# Hackathon

Try to improve the performance of the model. 

Hints:
- Try different number of feature maps
- Try different number of convolutional layers
- Try two dense layers in the final classification part of the network

## Final testing


DON'T CHEAT. DON'T USE MNIST_test.txt FOR MODEL TRAINING AND SELECTION



In [None]:
test_file = '../../Data/MNIST_test.txt'
reader = create_reader(test_file, False, input_dim, num_output_classes)
error_rate = test_model(z, features, labels, reader)
print("Average test error: {0:.2f}%".format(error_rate))