# Setup

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

In [None]:
sess = tf.InteractiveSession()

# MNIST

MNIST is a database of handwritten digits. The images are 28x28 greyscale images (encoded as 784-dimensional vectors in row-major order). There are 60,000 images in the training set, and there are 10,000 images in the test set.

Why is there a train-test split? We care about how our function generalizes, and so we want to benchmark its performance on a set of data that it hasn't seen before. Otherwise, a "perfect" learning algorithm could just memorize all the data points, but this algorithm wouldn't generalize well.

Let's see what one of the MNIST images looks like.

In [None]:
plt.imshow(mnist.test.images[0].reshape(28, 28), cmap='gray')

Let's check the label of this image. The MNIST labels are encoded in one-hot format. There are 10 possible labels, and the vector with label $i$ is the $i$-dimensional vector that has the entry $1$ in the $i$th position and $0$s elsewhere.

In [None]:
mnist.test.labels[0]

In [None]:
np.argmax(mnist.test.labels[0])

# Fully-connected neural network

Let's design a fully-connected neural network to classify MNIST digits. It should take a 784-dimensional input and give a 10-dimensional one-hot encoded probability distribution as output.

In [None]:
x = tf.placeholder(tf.float32, (None, 28*28)) # batch of inputs
y_ = tf.placeholder(tf.float32, (None, 10)) # batch of corresponding labels

In [None]:
# this corresponds with the model of a neuron in [ 02-01-notes ]
# except this is describing an entire layer, not a single neuron
# and we're not including the activation function inside here

def fully_connected(x, input_dimension, output_dimension):
    raise NotImplementedError

Let's start with a really simple neural network with only one fully-connected layer (the output layer) with 10 neurons.

See [ 02-04-notes ] for an architecture diagram.

In [None]:
# TODO
# y = ???

Above, $y$ is a 10-dimensional vector, but it's not a probability distribution. We can fix that by applying the softmax function to the logits $y$:

$$\sigma(y)_i = \frac{e^{y_i}}{\sum_j e^{y_j}}$$

And we can define loss as the cross entropy between the true probability distribution (the labels) $p$ and the predicted probability distribution $q$:

$$H(p, q) = - \sum_i p(x) \log q(x)$$

In TensorFlow, we can do both of these in a single step (also needed for numerical stability):

In [None]:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_)

## Training

In [None]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)

We have 60000 training points, so we'll be doing minibatch stochastic gradient descent to train our network (instead of computing gradients over all 60000 data points).

In [None]:
BATCH_SIZE = 100
ITERATIONS = 1000

sess.run(tf.global_variables_initializer())

for i in range(ITERATIONS):
    x_batch, y_batch = mnist.train.next_batch(BATCH_SIZE)
    sess.run(optimizer, {x: x_batch, y_: y_batch})

## Evaluation

Let's evaluate the accuracy of our network over the test set.

In [None]:
def accuracy(predictions, labels):
    return np.mean(np.argmax(predictions, 1) == np.argmax(labels, 1))

In [None]:
predictions = y.eval({x: mnist.test.images})
accuracy(predictions, mnist.test.labels)

# Deep fully-connected network

Will adding a ton of parameters help us find a better solution? Let's use a deep fully-connected network using layers with 2000, 1000, and 100 neurons in the hidden layers and then 10 neurons in the output layer. Let's use ReLU activation for all the hidden layers. See [ 02-05-notes ] for an architecture diagram.

In [None]:
# TODO
# y = ???

Our initial network had ~8,000 parameters. The above network has ~3.5 million parameters, which is over 400x the capacity of the first one.

In [None]:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_)

## Training

Let's use a fancier optimizer this time.

In [None]:
optimizer = tf.train.AdamOptimizer().minimize(loss)

In [None]:
BATCH_SIZE = 50
ITERATIONS = 5000 # this takes ~ 2 minutes on my laptop

sess.run(tf.global_variables_initializer())

for i in range(ITERATIONS):
    x_batch, y_batch = mnist.train.next_batch(BATCH_SIZE)
    l, _ = sess.run([loss, optimizer], {x: x_batch, y_: y_batch})
    if (i+1) % 100 == 0:
        print('iteration %d, batch loss %f' % (i+1, np.mean(l)))

## Evaluation

In [None]:
predictions = y.eval({x: mnist.test.images})
accuracy(predictions, mnist.test.labels)

# Convolutional neural network

Let's design a convolutional neural network to classify MNIST digits.

In [None]:
x_image = tf.reshape(x, (-1, 28, 28, 1)) # turn our 784-dimensional vector into a 28x28 image

In [None]:
# a convolutional layer

def convolve(x, kernel_height, kernel_width, input_channels, output_channels):
    w = tf.Variable(tf.random_normal((kernel_height, kernel_width, input_channels, output_channels)))
    b = tf.Variable(tf.random_normal((output_channels,)))
    
    return tf.nn.conv2d(x, w, strides=(1, 1, 1, 1), padding='SAME')

In [None]:
# a 2x2 max pooling layer

def pool(x):
    return tf.nn.max_pool(x, (1, 2, 2, 1), (1, 2, 2, 1), padding='SAME')

## Network architecture

Let's design an architecture with the following layers:

* convolution layer with 25 3x3 filters, relu activation
* 2x2 max pooling layer
* convolution layer with 50 3x3 filters, relu activation
* 2x2 max pooling layer
* fully-connected layer with 1000 neurons, relu activation
* fully-connected output layer (10 neurons)

See [ 02-06-notes ] for an architecture diagram.

In [None]:
# TODO
# y = ???

This network has approximately 2.5 million parameters. Note that this is about 1 million parameters _fewer_ than the deep fully-connected network.

In [None]:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_)

## Training

In [None]:
optimizer = tf.train.AdamOptimizer().minimize(loss)

In [None]:
BATCH_SIZE = 50
ITERATIONS = 5000 # this takes ~ 3.5 minutes on my laptop

sess.run(tf.global_variables_initializer())

for i in range(ITERATIONS):
    x_batch, y_batch = mnist.train.next_batch(BATCH_SIZE)
    l, _ = sess.run([loss, optimizer], {x: x_batch, y_: y_batch})
    if (i+1) % 100 == 0:
        print('iteration %d, batch loss %f' % (i+1, np.mean(l)))

## Evaluation

In [None]:
predictions = y.eval({x: mnist.test.images})
accuracy(predictions, mnist.test.labels)

## Visualization

Let's see what some intermediate activations look like

In [None]:
conv1_, conv2_ = sess.run([conv1, conv2], {x: mnist.test.images[0:1]})

In [None]:
fig, ax = plt.subplots(2, 2)

ax[0, 0].imshow(conv1_[0,:,:,0], cmap='gray')
ax[0, 1].imshow(conv1_[0,:,:,1], cmap='gray')
ax[1, 0].imshow(conv1_[0,:,:,2], cmap='gray')
ax[1, 1].imshow(conv1_[0,:,:,3], cmap='gray')

In [None]:
fig, ax = plt.subplots(2, 2)

ax[0, 0].imshow(conv2_[0,:,:,0], cmap='gray')
ax[0, 1].imshow(conv2_[0,:,:,1], cmap='gray')
ax[1, 0].imshow(conv2_[0,:,:,2], cmap='gray')
ax[1, 1].imshow(conv2_[0,:,:,3], cmap='gray')