In [None]:
from IPython.display import Image

To serve the slides: 
- Open a terminal window 
- `jupyter nbconvert /full/paht/to/notebook.ipynb --to slides --post serve`

## Week 3 – Convolutional Neural Networks


# Outline

- Introduction
- The convolutional layer
    - Filter, stacking, and implementation
- Pooling layer
- CNN Architectures
    - Le Net, AlexNet, GoogLeNet, ResNet
    

# Learning Objectives

- Understanding of:
    - Where CNNs came from
    - What CNNs building blocks look like
    - How to implement them using TensorFlow and Keras. 
- Exposure to the theory and code to train, and evalute CNNs
- Appreciate diverse number of applications with CNNs
- Grasp practical considerations: memory, training time, parameters 
- Review and study the most important CNNs architectures:
    - LeNet-5, AlexNet, GoogLeNet, ResNet
- Use MNIST as sandbox to understand different levels of abtractions of the CNNs


![Conv. Net.](imgs/convolutional1.png "Conv Net")
    

# Architecture of the Visual Cortex

- David H. Hubel and Torsten Wiesel performed a series of experiments on cats in 1958 and 1959 giving crucial insights on the structure of the visual cortex. (Nobel Prize in Medicine in 1981). 
- They showed that many neurons in the visual cortex have a small local receptive field:
    - The receptive fields of different neurons may overlap, and together they tile the whole visual field. 
    - They showed that some neurons react only to images of horizontal lines, while others react only to lines with different orientations. 
    - They also noticed that some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower-level patterns. 
- Their observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons

![Receptive Fields](imgs/receptivefields.png "Conv Net")



# Why Not a Fully Connected Network?

- This approach works well for small images (e.g., MNIST), however it breaks for larger images due to the large number of parameters required. 
    - For example, a 100 × 100 image has 10,000 pixels,
    - If the first layer has just 1,000 neurons this means a total of 10 million connections,
    - And that’s just the first layer!
- CNNs solve this problem using partially connected layers.

# Introduction

- Convolutional neural networks (CNNs) emerged from the study of the brain’s visual cortex, and they have been used in image recognition since the 1980s. 

- In the last few years, thanks to the increase in computational power, the amount of available training data, and better training technings for deep nets, CNNs have managed to achieve superhuman performance. 

- They power image search services, self-driving cars, automatic video classification systems, and more. They are also successful at other tasks, such as voice recognition or natural language processing.

- Today we will present where CNNs came from, what their building blocks look like, and how to implement them using TensorFlow.

# Setup

First, let's make sure this notebook has all the required libraries, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [None]:
#File containing all definitions and utility functions.
from setups import *
from plotting import *
%matplotlib inline
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "cnn"

# Convolutional Layer

- Neurons in the first convolutional layer are not connected to every single pixel in the input image. The neurons are only connected to pixels in their receptive fields.


![Conv Fields](imgs/convFirstLayer.png "First Layer Net")


- In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. 


- This architecture allows the network to concentrate on low-level features in the first hidden layer, then assemble them into higher-level features in the next hidden layer, and so on. 

# Convolutional Layer

- A neuron located in row $i$, column $j$ of a given layer is connected to the outputs of the neurons in the previous layer located in rows $i$ to $i + f_h – 1$, columns $j$ to $j + f_w – 1$. 


### Connections between layers and zero padding


- In order for a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs.

![ZeroPadding](imgs/convZeroPadding.png "Zero Padding")




# Convolutional Layer

## Stride

- It is also possible to connect a large input layer to a much smaller layer by spacing out the receptive fields. The distance between two consecutive receptive fields is called the stride. 



##### A 5 × 7 input layer (plus zero padding) is connected to a 3 × 4 layer, using 3 × 3 receptive fields and a stride of two. 

![TwoPadding](imgs/convTwoPadding.png "Two Padding")


- By using a stride greater than one, the dimentionality of the layer can be reduced.

## Filters

- A neuron’s weights can be represented as a small image the size of the receptive field. 
    - Example: Two possible sets of weights, called filters (or convolution kernels). 
        - The first one is represented as a black square with a vertical white line in the middle. Neurons using these weights will ignore everything in their receptive field except for the central **vertical** line 
        - The second filter is a black square with a horizontal white line in the middle. Neurons using these weights will ignore everything in their receptive field except for the central **horizontal** line.

![ConvFIlters](imgs/convFilters.png "convFilters")

- A layer full of neurons using the same filter gives you a feature map, which highlights the areas in an image that are most similar to the filter. 
    - If all neurons in a layer use the same vertical line filter, the layer output  will enhance the white vertical lines, the rest gets blurred. 
    - Similarly, for the horizontal line filter. 
- During training, a CNN finds the most useful filters for its task, and it learns to combine them into more complex patterns 

## Stacking Multiple Feature Maps

- CNN can be composed of several feature maps of equal sizes, so it is more accurately represented in 3D. 
    - Within one feature map, all neurons share the same parameters 
    - A convolutional layer simultaneously applies multiple filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.
    - Since a feature map shares the same parameters dramatically reduces the number of parameters in the model, but most importantly it means that once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. 
    - Input images are also composed of multiple sublayers: one per color channel.
    
![convStackLayers](imgs/convStackLayers.png "convStackLayers")   

## Stacking Multiple Feature Maps

- A neuron located in row $i$, column $j$ of the feature map $k$ in a given convolutional layer $l$ is connected to the outputs of the neurons in the previous layer $l – 1$, located in rows $i × s_h$ to $i × s_h + f_h – 1$ and columns $j × s_w$ to $j × s_w + f_w – 1$, across all feature maps (in layer $l – 1$).

![image.png](imgs/convEquation.png)

- $z_{i,j,k}$ is the output of the neuron located in row $i$, column $j$ in feature map $k$ of the convolutional layer (layer $l$).

- $s_h$ and $s_w$ are the vertical and horizontal strides, $f_h$ and $f_w$ are the height and width of the receptive field, and $f_n′$ is the number of feature maps in the previous layer (layer $l – 1$).

- $x_{i′, j′, k′}$ is the output of the neuron located in layer $l – 1$, row $i′$, column $j′$, feature map $k′$

- $b_k$ is the bias term for feature map $k$ (in layer $l$). 

- $w_{u, v, k′ ,k}$ is the connection weight between any neuron in feature map $k$ of the layer $l$ and its input located at row $u$, column $v$, and feature map $k′$.

# TensorFlow Implementation

- In TensorFlow, each input image is typically represented as a 3D tensor of shape [height, width, channels]. 
- A mini-batch is represented as a 4D tensor of shape [mini-batch size, height, width, channels]. 
- The weights of a convolutional layer are represented as a 4D tensor of shape [fh, fw, fn′, fn]. The bias terms of a convolutional layer are simply represented as a 1D tensor of shape [fn].

The following code loads two sample images. Then it creates two 7 × 7 filters, and applies them to both images using a convolutional layer built using TensorFlow’s tf.nn.conv2d() function 

In [None]:
import numpy as np
from sklearn.datasets import load_sample_image

# Load sample images
china = load_sample_image("china.jpg")
flower = load_sample_image("flower.jpg")
dataset = np.array([china, flower], dtype=np.float32)
batch_size, height, width, channels = dataset.shape

# Create 2 filters
filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32)
filters[:, 3, :, 0] = 1  # vertical line
filters[3, :, :, 1] = 1  # horizontal line

# Create a graph with input X plus a convolutional layer applying the 2 filters
X = tf.placeholder(tf.float32, shape=(None, height, width, channels))
convolution = tf.nn.conv2d(X, filters, strides=[1,2,2,1], padding="SAME")

with tf.Session() as sess:
    output = sess.run(convolution, feed_dict={X: dataset})

plt.imshow(output[0, :, :, 1], cmap="gray") # plot 1st image's 2nd feature map
plt.show()

In [None]:
for image_index in (0, 1):
    for feature_map_index in (0, 1):
        plot_image(output[image_index, :, :, feature_map_index])
        plt.show()

### TF Padding

- Padding must be either "VALID" or "SAME":

    - If set to "VALID", the convolutional layer does not use zero padding, and may ignore some rows and columns at the bottom and right of the input image.
    - If set to "SAME", the convolutional layer uses zero padding if necessary.
    
    
    
    
![tfPadding.png](imgs/tfPadding.png)
    
    

## Training

- While training CNNs the algorithm will discover the best filters automatically. 
- `tf.layers.conv2d()` creates the filters (named kernel), and initializes it randomly. It also create the bias. 

In [None]:
# Creates an input placeholder followed by a convolutional layer 
# with two 7 × 7 feature maps, using 2 × 2 strides 

X = tf.placeholder(shape=(None, height, width, channels), dtype=tf.float32)
conv = tf.layers.conv2d(X, filters=2, kernel_size=7, strides=[2,2],
                        padding="SAME")

### - Convolutional layers have several hyperparameters: 
    - Number of filters,
    - Filter height and width, 
    - Strides, 
    - and the padding type. 
- One can use cross-validation to find the right hyperparameter values, but this is very time-consuming.

## Memory requirements

- Convolutional layers require a huge amount of RAM, especially during training, because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass.

Example: A convolutional layer with 5 × 5 filters, outputting 200 feature maps of size 150 × 100, with stride 1 and SAME padding. 
- If the input is a 150 × 100 RGB image (three channels), then the number of parameters is (5 × 5 × 3 + 1) × 200 = 15,200 
- However, each of the 200 feature maps contains 150 × 100 neurons, and each of these neurons needs to compute a weighted sum of its 5 × 5 × 3 = 75 inputs: that’s a total of 225 million float multiplications. 
- If the feature maps are represented using 32-bit floats, then the convolutional layer’s output will occupy 200 × 150 × 100 × 32 = 96 million bits (about 11.4 MB) of RAM.
- If a training batch contains 100 instances, then this layer will use up over 1 GB of RAM!

If training crashes because of an out-of-memory error, try:
- reducing the mini-batch size,
- reducing dimensionality using a stride, 
- remove a few layers,
- use 16-bit floats instead of 32-bit floats,
- distribute the CNN across multiple devices.

# Pooling layer

- Their goal is to subsample the input image in order to reduce the computational load, memory usage, and number of parameters
- Each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field
- Parameters: size, stride, and padding type. 
- Note that, a pooling neuron has no weights; it aggregates the inputs using a function such as the max or mean. 

![poolingLayer.png](imgs/poolingLayer.png)

In this example, we use a 2 × 2 pooling kernel, a stride of 2, and no padding. max aggregation


- A small 2 × 2 kernel and a stride of 2. Will make the output two times smaller in both directions (so its area will be four times smaller).

- A pooling layer works on every input channel independently, so the output depth is the same as the input depth. 
- You may alternatively pool over the depth dimension.

The following code creates a max pooling layer using a 2 × 2 kernel, stride 2, and no padding, then applies it to all the images in the dataset:

In [None]:
batch_size, height, width, channels = dataset.shape

X = tf.placeholder(tf.float32, shape=(None, height, width, channels))
max_pool = tf.nn.max_pool(X, ksize=[1,2,2,1], strides=[1,2,2,1],padding="VALID")

with tf.Session() as sess:
    output = sess.run(max_pool, feed_dict={X: dataset})

plt.imshow(output[0].astype(np.uint8))  # plot the output for the 1st image
plt.show()

- The ksize argument contains the kernel shape along all four dimensions of the input tensor:

`              [batch size, height, width, channels]`.

- To create an average pooling layer, just use the avg_pool() function instead of max_pool().

# CNN Architectures:

- CNN architectures stack a few convolutional layers, then a pooling layer, then another few convolutional layers, then another pooling layer, and so on. 
- The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper. 
- At the top of the stack, a regular feedforward neural network is added, and the final layer outputs the prediction.


![typicalConvolutional.png](imgs/typicalConvolutional.png)


- Variants of this fundamental architecture have been developed, leading to amazing advances in the field. A good measure of this progress is the error rate in competitions such as the ILSVRC ImageNet challenge. 

- We will first look at the classical LeNet-5 architecture (1998), then three of the winners of the ILSVRC challenge: AlexNet (2012), GoogLeNet (2014), and ResNet (2015).

# LeNet-5

It was created by Yann LeCun in 1998 and widely used for handwritten digit recognition (MNIST).

![lenet.jpg](imgs/lenet.jpg)


- MNIST images are 28 × 28 pixels, but they are zero-padded to 32 × 32 pixels and normalized before being fed to the network. 

- For each neuron in the the pooling layer, it computes the mean of its inputs, then multiplies the result by a learnable coefficient (one per map) and adds a learnable bias term (again, one per map), then finally applies the activation function.

- Most neurons in C3 maps are connected to neurons in only three or four S2 maps 

- Output layer: each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. 

[Yann LeCun’s website](http://yann.lecun.com/exdb/lenet/index.html) features great demos of LeNet-5 classifying digits.




# AlexNet

- The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge:
    - It achieved 17% top-5 error rate while the second best achieved only 26%! 
- It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 
- It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. 

![alexnet.jpg](imgs/alexnet.png)


- To reduce overfitting, the authors used two regularization techniques:
    - dropout (with a 50% dropout rate) during training to the outputs of layers F8 and F9. 
    - They performed data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

- AlexNet also uses local response normalization immediately after the ReLU step of layers C1 and C3. This form of normalization makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps.
    - This normalization encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.

![eqAlexnet.jpg](imgs/eqAlexnet.png)


- $b_i$ is the normalized output of the neuron located in feature map i, at some row u and column v.

- $a_i$ is the activation of that neuron after the ReLU step, but before normalization.

- $k, \alpha, \beta$ and $r$ are hyperparameters. $k$ is the bias, and $r$ is  the depth radius.

- $f_n$ is the number of feature maps.

- For example, if r = 2 and a neuron has a strong activation, it will inhibit the activation of the neurons located in the feature maps immediately above and below its own.

- In AlexNet, the hyperparameters are set as follows: $r = 2$, $\alpha = 0.00002$, $\beta = 0.75$, and $k = 1$. 

# GoogLeNet

- The GoogLeNet architecture was developed by Christian Szegedy et al. from Google Research.
- Increase in performace mainly comes from much deeper CNNs. 
- GoogLeNet uses sub-networks called inception modules(Think of it as an output feature maps that capture complex patterns at various scales), which allow GoogLeNet to use parameters much more efficiently than previous architectures:
    - GoogLeNet actually has 10 times fewer parameters than AlexNet.

Inception module. “3 × 3 + 2(S)” means that the layer uses a 3 × 3 kernel, stride 2, and SAME padding.
![inceptionmodule.png](imgs/inceptionmodule.png)


- The input signal to the inception module is copied and fed to four different layers. 
- All convolutional layers use the ReLU activation function.
- The second set of convolutional layers uses different kernel sizes (1 × 1, 3 × 3, and 5 × 5), allowing them to capture patterns at different scales. 
- Every single layer uses a stride of 1 and SAME padding , so their outputs all have the same height and width as their inputs. This makes it possible to concatenate all the outputs along the depth dimension in the final depth concat layer 

- The 1 × 1 kernels, serve two purposes:

    - Dimentionality reduction: They are configured to output many fewer feature maps than their inputs, and 
    - Second, each pair of convolutional layers (`[1 × 1, 3 × 3]` and `[1 × 1, 5 × 5]`) acts like a single, powerful convolutional layer, capable of capturing more complex patterns. 

- The GoogLeNet CNN includes nine inception modules that actually contain three layers each. 
- The six numbers in the inception modules represent the number of feature maps output by each convolutional layer in the module. All the convolutional layers use the ReLU activation function.


![googlelenet.png](imgs/googlelenet.png)


- The first two layers divide the image’s height and width by 4.
- Then the local response normalization layer ensures that the previous layers learn a wide variety of features.
- Two convolutional layers follow, where the first acts like a bottleneck layer.
- Next a max pooling layer reduces the image height and width by 2, 
- Then comes the tall stack of nine inception modules, interleaved with a couple max pooling layers to reduce dimensionality.
- Next, the average pooling layer uses a kernel the size of the feature maps with VALID padding, outputting 1 × 1 feature maps. This makes it unnecessary to have several fully connected layers at the top of the CNN, considerably reducing the number of parameters in the network and limiting the risk of overfitting.
- The last layers are: dropout for regularization, then a fully connected layer with a softmax activation function to output estimated class probabilities.

# ResNet

- Developed by Kaiming He et al.: Extremely deep CNN composed of 152 layers. 
- Some of the connections are skipped (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. 

- What the residual part? 
    - When training a neural network, the target function is $h(x)$.
    - If we add the input $x$ to the output, then the network will be forced to model $f(x) = h(x) – x$. 
    This is called residual learning.


![reslearning.png](imgs/reslearning.png)



- When a neural network is initialized, its weights are close to zero. If we add a skip connection, the resulting network just outputs a copy of its inputs;
- If the target function is  close to the identity function, this will speed up training.

- With the skip connections, the network can start making progress even if some layers have not started learning yet 

Deep network vs ResNet
![resvsdeep.png](imgs/resvsdeep.png)



- The networ starts and ends exactly like GoogLeNet, and in between a very deep stack of  residual units. 
- Each residual unit is composed of two convolutional layers, with Batch Normalization (BN) and ReLU activation.



![googleres.png](imgs/googleres.png)



- The number of feature maps is doubled every few residual units, at the same time as their height and width are halved. 


![skipconnection.png](imgs/skipconnection.png)


- ResNet-34 is the ResNet with 34 layers,
    - It contains: three residual units that output 64 feature maps, 4 RUs with 128 maps, 6 RUs with 256 maps, and 3 RUs with 512 maps.

- ResNet-152, use diferent residual units, which have three convolutional layers: 
    - first a 1 × 1 convolutional layer with just 64 feature maps ,
    - then a 3 × 3 layer with 64 feature maps, 
    - and finally another 1 × 1 convolutional layer with 256 feature maps
    - ResNet-152 contains three such RUs that output 256 maps, then 8 RUs with 512 maps, a  36 RUs with 1024 maps, and finally 3 RUs with 2,048 maps.

- Other architectures to consider: 
    - VGGNet13 (runner-up of the ILSVRC 2014 challenge) 
    - Inception-v414 (which merges the ideas of GoogLeNet and ResNet and achieves close to 3% top-5 error rate on ImageNet classification)

# MNIST Example

Note: instead of using the `fully_connected()`, `conv2d()` and `dropout()` functions from the `tensorflow.contrib.layers` module (as in the book), we now use the `dense()`, `conv2d()` and `dropout()` functions (respectively) from the `tf.layers` module, which did not exist when this chapter was written. This is preferable because anything in contrib may change or be deleted without notice, while `tf.layers` is part of the official API. As you will see, the code is mostly the same.



For all these functions:
* the `scope` parameter was renamed to `name`, and the `_fn` suffix was removed in all the parameters that had it (for example the `activation_fn` parameter was renamed to `activation`).

The other main differences in `tf.layers.dense()` are:
* the `weights` parameter was renamed to `kernel` (and the weights variable is now named `"kernel"` rather than `"weights"`),
* the default activation is `None` instead of `tf.nn.relu`

The other main differences in `tf.layers.conv2d()` are:
* the `num_outputs` parameter was renamed to `filters`,
* the `stride` parameter was renamed to `strides`,
* the default `activation` is now `None` instead of `tf.nn.relu`.

The other main differences in `tf.layers.dropout()` are:
* it takes the dropout rate (`rate`) rather than the keep probability (`keep_prob`). Of course, `rate == 1 - keep_prob`,
* the `is_training` parameters was renamed to `training`.

In [None]:
height = 28
width = 28
channels = 1
n_inputs = height * width

conv1_fmaps = 32
conv1_ksize = 3
conv1_stride = 1
conv1_pad = "SAME"

conv2_fmaps = 64
conv2_ksize = 3
conv2_stride = 2
conv2_pad = "SAME"

pool3_fmaps = conv2_fmaps

n_fc1 = 64
n_outputs = 10

reset_graph()

with tf.name_scope("inputs"):
    X = tf.placeholder(tf.float32, shape=[None, n_inputs], name="X")
    X_reshaped = tf.reshape(X, shape=[-1, height, width, channels])
    y = tf.placeholder(tf.int32, shape=[None], name="y")

conv1 = tf.layers.conv2d(X_reshaped, filters=conv1_fmaps, kernel_size=conv1_ksize,
                         strides=conv1_stride, padding=conv1_pad,
                         activation=tf.nn.relu, name="conv1")
conv2 = tf.layers.conv2d(conv1, filters=conv2_fmaps, kernel_size=conv2_ksize,
                         strides=conv2_stride, padding=conv2_pad,
                         activation=tf.nn.relu, name="conv2")

with tf.name_scope("pool3"):
    pool3 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")
    pool3_flat = tf.reshape(pool3, shape=[-1, pool3_fmaps * 7 * 7])

with tf.name_scope("fc1"):
    fc1 = tf.layers.dense(pool3_flat, n_fc1, activation=tf.nn.relu, name="fc1")

with tf.name_scope("output"):
    logits = tf.layers.dense(fc1, n_outputs, name="output")
    Y_proba = tf.nn.softmax(logits, name="Y_proba")

with tf.name_scope("train"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
    loss = tf.reduce_mean(xentropy)
    optimizer = tf.train.AdamOptimizer()
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

with tf.name_scope("init_and_save"):
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

In [None]:
n_epochs = 10
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

        save_path = saver.save(sess, "./my_mnist_model")

# CNN on Keras


Follow tutorial :


https://towardsdatascience.com/build-your-own-convolution-neural-network-in-5-mins-4217c2cf964f

In [None]:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

In [None]:
batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000,28,28,1)
x_test = x_test.reshape(10000,28,28,1)

print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

In [None]:
#Build the network:

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=(28,28,1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))



In [None]:
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# Additional Resources

- Convolutional Neural Networks w/ TF:
    - https://www.tensorflow.org/tutorials/images/deep_cnn

- Build a Convolutional Neural Network using Estimators
    - https://www.tensorflow.org/tutorials/estimators/cnn
    
- Keras conv networks:
    - https://keras.io/layers/convolutional/
    - https://keras.io/applications/
    
- Deep Learning book
    - Ian Goodfellow and Yoshua Bengio and Aaron Courville
    - https://www.deeplearningbook.org/contents/convnets.html
    