# Chapter 4
# Convolutional Neural Networks

In [0]:
import tensorflow as tf
import numpy as np

## Introduction to CNNs

- The fundamental difference between fully connected and convolutional neural networks
is the pattern of connections between consecutive layers. In the fully connected
case, as the name might suggest, each unit is connected to all of the units in the previous
layer.
- In a convolutional layer of a neural network, on the other hand, each unit is connected
to a (typically small) number of nearby units in the previous layer. Furthermore,
all units are connected to the previous layer in the same way, with the exact same
weights and structure.
- In a nutshell all it
means for us is applying a small “window” of weights (also known as filters) across an
image.
- Each convolutional layer
looks at an increasingly larger part of the image as we go deeper into the network.
Most commonly, this will be followed by fully connected layers that in the biologically
inspired analogy act as the higher levels of visual processing dealing with global
information.
- The second angle, more hard fact engineering–oriented, stems from the nature of
images and their contents. When looking for an object in an image, say the face of a
cat, we would typically want to be able to detect it regardless of its position in the
image. This reflects the property of natural images that the same content may be
found in different locations of an image. This is property is known as an invariance—invariances of this sort can also be expected with respect to (small) rotations, changing
lighting conditions, etc. Correspondingly, when building an object-recognition system, it should be invariant
to translation (and, depending on the scenario, probably also rotation and deformations
of many sorts, but that is another matter). Put simply, it therefore makes sense
to perform the same exact computation on different parts of the image. In this view, a
convolutional neural network layer computes the same features of an image, across
all spatial areas.
- Regularization is most often applied by adding implicit information
regarding the desired results (this could take the form of saying
we would rather have a smoother function, when searching a
function space). In the convolutional neural network case, we
explicitly state that we are looking for weights in a relatively lowdimensional
subspace corresponding to fixed-size convolutions.

---

## MNIST: Take II

### Convolution

In [0]:
# tf.nn.conv2d(x, W, strides=[1, 1, , 1, 1], padding='SAME')

- Here, x is the data—the input image, or a downstream feature map obtained further
along in the network, after applying previous convolution layers. 
- As discussed previously,
in typical CNN models we stack convolutional layers hierarchically, and feature
map is simply a commonly used term referring to the output of each such layer.
Another way to view the output of these layers is as processed images, the result of
applying a filter and perhaps some other operations. Here, this filter is parameterized
by W, the learned weights of our network representing the convolution filter.
- The output of this operation will depend on the shape of x and W, and in our case is
four-dimensional. The image data x will be of shape: [None, 28, 28, 1]. Meaning that we have an unknown number of images, each 28×28 pixels and with
one color channel (since these are grayscale images). The weights W we use will be of
shape: [5, 5, 1, 32] where the initial 5×5×1 represents the size of the small “window” in the image to be
convolved, in our case a 5×5 region. In images that have multiple color channels, we regard each image as a threedimensional
tensor of RGB values, but in this one-channel data they are just twodimensional,
and convolutional filters are applied to two-dimensional regions. Later,
when we tackle the CIFAR10 data, we’ll see examples of multiple-channel images and
how to set the size of weights W accordingly. The final 32 is the number of feature maps. In other words, we have multiple sets of
weights for the convolutional layer—in this case, 32 of them. Recall that the idea of a
convolutional layer is to compute the same feature along the image; we would simply
like to compute many such features and thus use multiple sets of convolutional filters.
- The strides argument controls the spatial movement of the filter W across the image
(or feature map) x. The value [1, 1, 1, 1] means that the filter is applied to the input in one-pixel
intervals in each dimension, corresponding to a “full” convolution. Other settings of
this argument allow us to introduce skips in the application of the filter—a common
practice that we apply later—thus making the resulting feature map smaller. 
- Finally, setting padding to 'SAME' means that the borders of x are padded such that
the size of the result of the operation is the same as the size of x.

### Pooling

In [0]:
# tf.nn.max_pool(x, ksize=[1, 2, 2, -1], strides=[1, 2, 2, 1], padding='SAME')

- The reasoning behind this is both technical and more theoretical. The technical
aspect is that pooling reduces the size of the data to be processed downstream. This
can drastically reduce the number of overall parameters in the model, especially if we
use fully connected layers after the convolutional ones.
- The more theoretical reason for applying pooling is that we would like our computed
features not to care about small changes in position in an image. For instance, a feature
looking for eyes in the top-right part of an image should not change too much if
we move the camera a bit to the right when taking the picture, moving the eyes
slightly to the center of the image. Aggregating the “eye-detector feature” spatially
allows the model to overcome such spatial variability between images, capturing
some form of invariance as discussed at the beginning of this chapter.
- Max pooling outputs the maximum of the input in each region of a predefined size
(here 2×2). The ksize argument controls the size of the pooling (2×2), and the
strides argument controls by how much we “slide” the pooling grids across x, just as in the case of the convolution layer. Setting this to a 2×2 grid means that the output of
the pooling will be exactly one-half of the height and width of the original, and in
total one-quarter of the size.

### Dropout

In [0]:
# tf.nn.dropout(layer, keep_prob=keep_prob)

The final element we will need for our model is dropout. This is a regularization trick
used in order to force the network to distribute the learned representation across all
the neurons. Dropout “turns off ” a random preset fraction of the units in a layer, by
setting their values to zero during training. These dropped-out neurons are random
—different for each computation—forcing the network to learn a representation that
will work even after the dropout. This process is often thought of as training an
“ensemble” of multiple networks, thereby increasing generalization. When using the
network as a classifier at test time (“inference”), there is no dropout and the full network
is used as is.

In order to be able to change this value (which we must do, since for testing we would
like this to be 1.0, meaning no dropout at all), we will use a tf.Variable and pass
one value for train (.5) and another for test (1.0).

### The Model

In [0]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], 
                          strides=[1, 2, 2, 1], padding='SAME')
    
def conv_layer(input, shape):
    W = weight_variable(shape)
    b = bias_variable([shape[3]])
    return tf.nn.relu(conv2d(input, W) + b)
    
def full_layer(input, size):
    in_size = int(input.get_shape()[1])
    W = weight_variable([in_size, size])
    b = bias_variable([size])
    return tf.matmul(input, W) + b

weight_variable()

    This specifies the weights for either fully connected or convolutional layers of the
    network. They are initialized randomly using a truncated normal distribution
    with a standard deviation of .1. This sort of initialization with a random normal
    distribution that is truncated at the tails is pretty common and generally produces
    good results (see the upcoming note on random initialization).

bias_variable()

    This defines the bias elements in either a fully connected or a convolutional layer.
    These are all initialized with the constant value of .1.
    conv2d()
    This specifies the convolution we will typically use. A full convolution (no skips)
    with an output the same size as the input.

max_pool_2×2

    This sets the max pool to half the size across the height/width dimensions, and in
    total a quarter the size of the feature map.

conv_layer()

    This is the actual layer we will use. Linear convolution as defined in conv2d, with
    a bias, followed by the ReLU nonlinearity.

full_layer()

    A standard full layer with a bias. Notice that here we didn’t add the ReLU. This
    allows us to use the same layer for the final output, where we don’t need the nonlinear
    part.

In [0]:
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

x_image = tf.reshape(x, [-1, 28, 28, 1])
conv1 = conv_layer(x_image, shape=[5, 5, 1, 32])
conv1_pool = max_pool_2x2(conv1)

conv2 = conv_layer(conv1_pool, shape=[5, 5, 32, 64])
conv2_pool = max_pool_2x2(conv2)

conv2_flat = tf.reshape(conv2_pool, [-1, 7*7*64])
full_1 = tf.nn.relu(full_layer(conv2_flat, 1024))

keep_prob = tf.placeholder(tf.float32)
full1_drop = tf.nn.dropout(full_1, keep_prob=keep_prob)

y_conv = full_layer(full1_drop, 10)

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


We start by defining the placeholders for the images and correct labels, x and y_,
respectively. Next, we reshape the image data into the 2D image format with size
28×28×1. Recall we did not need this spatial aspect of the data for our previous
MNIST model, since all pixels were treated independently, but a major source of
power in the convolutional neural network framework is the utilization of this spatial
meaning when considering images.

Next we have two consecutive layers of convolution and pooling, each with 5×5 convolutions
and 64 feature maps, followed by a single fully connected layer with 1,024
units. Before applying the fully connected layer we flatten the image back to a single
vector form, since the fully connected layer no longer needs the spatial aspect.

Notice that the size of the image following the two convolution and pooling layers is
7×7×64. The original 28×28 pixel image is reduced first to 14×14, and then to 7×7 in
the two pooling operations. The 64 is the number of feature maps we created in the
second convolutional layer. When considering the total number of learned parameters
in the model, a large proportion will be in the fully connected layer (going from
7×7×64 to 1,024 gives us 3.2 million parameters). This number would have been 16
times as large (i.e., 28×28×64×1,024, which is roughly 51 million) if we hadn’t used
max-pooling.

Finally, the output is a fully connected layer with 10 units, corresponding to the number
of labels in the dataset (recall that MNIST is a handwritten digit dataset, so the
number of possible labels is 10).

In [0]:
from tensorflow.examples.tutorials.mnist import input_data

DATA_DIR = '/tmp/data'
NUM_STEPS = 1000
MINIBATCH_SIZE = 100

mnist = input_data.read_data_sets(DATA_DIR, one_hot=True)

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_conv, labels=y_))

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for i in range(NUM_STEPS):
        batch = mnist.train.next_batch(50)

        if i % 100 == 0:
            train_accuracy = sess.run(accuracy, feed_dict={x: batch[0],
                                                           y_: batch[1],
                                                           keep_prob: 1.0})
            print("Step {}, training accuracy {}".format(i, train_accuracy))
        
        sess.run(train_step, feed_dict={x: batch[0], y_ : batch[1],
                                        keep_prob: 0.5})
        
    X = mnist.test.images.reshape(10, 1000, 784)
    Y = mnist.test.labels.reshape(10, 1000, 10)
    test_accuracy = np.mean([sess.run(accuracy, feed_dict={x:X[i], y_:Y[i],
                                                            keep_prob:1.0}) for i in range(10)])
print("Test accuracy: {}".format(test_accuracy))
    

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from t

---

## CIFAR 10

In [0]:
import tensorflow as tf
import keras
import numpy as np
import matplotlib.pyplot as plt

(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()

Using TensorFlow backend.


Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [0]:
print("X_train.shape: ", X_train.shape)
print("X_test.shape: ", X_test.shape)
print("y_train.shape: ", y_train.shape)
print("y_test.shape: ", y_test.shape)

X_train.shape:  (50000, 32, 32, 3)
X_test.shape:  (10000, 32, 32, 3)
y_train.shape:  (50000, 1)
y_test.shape:  (10000, 1)
