In [2]:
import numpy as np
import tensorflow as tf

Getting the data
----------------
* `mnist.train` - 55k data points of training data 
* `mnist.test` - 10k data points of test data
* `mnist.validation` - 5k data points of validation data

In [5]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


Representing images as vectors
------------------------------
* `mnist.train.images` - 55000 x 784 matrix, each row is a vector representation of an image. Each element in the vector represents the intensity of each pixel in the image (28 x 28 = 784).
* `mnist.train.labels` - 55000 x 10 matrix, each row is a one hot encoding of the digit in the image vector (1-10)

<img src="mnist_100_digits.png">

In [96]:
print mnist.train.images[0]

[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.

The Multilayer Perceptron Algorithm
-----------------------------------
Our goal is to learn $f(x) = y$, where $x$ is an unseen (i.e. not in our training set) image and $y$ is a digit between 1 and 10. To accomplish this we'll use a type of feedforward artificial neural network called an MLP.

An MLP consists of the following components:

* Initial weights and biases (random)
* Hidden layers
   * Activation function (eg. Sigmoid, ReLu)
* Output layer = the size of the number of classes to predict from
* A cost function/optimizer
   * eg. Softmax with cross-entropy, L1/L2
   
<img src="tikz11.png">

Constructing initial weights and biases
---------------------------------------

Each layers contains:
* $W$ = initial random weights (between 0 and 1)
* $b$ = initial random biases (between 0 and 1)

Initial layer is always the size of our input vector. Output layer is the number of classes we want to predict from. Size of hidden layers is arbitrary!

In [63]:
n_hidden_1 = 256 
n_hidden_2 = 256 
n_input = 784 # MNIST data input (img shape: 28*28)
n_classes = 10 # MNIST total classes (0-9 digits)

W = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
b = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_classes]))
}

Constructing the layers and input
---------------------------------

In each pass of the network, an input example "propagates" through each layer. We compute $z_n$ for each node in the layer and apply the activation function $\delta$. We treat $a_n$ as the input value for the next layer.

$$z_1 = Wx + b$$

$$a_1 = \theta(z)$$

$$z_2 = Wa_1 + b$$ 

$$...$$

where $x$ is the "input" value at each layer.

Each layer depends on the last! This is the forward propogation algorithm.


In [73]:
x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32, [None, 10])

def multilayer_perceptron(_X, _weights, _biases):
    #Hidden layer with RELU activation
    layer_1 = tf.nn.relu(tf.add(tf.matmul(_X, _weights['h1']), _biases['b1'])) 
    #Hidden layer with RELU activation
    layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, _weights['h2']), _biases['b2'])) 
    return tf.matmul(layer_2, W['out']) + b['out']

pred = multilayer_perceptron(x, W, b)

Define our cost and optimizer
-------------------------

Recall that our goal is to learn $f(x) = y$. How do we know how good $f(x)$ is? When we train a network, we want some function $Cost(y, y') = \epsilon$ that tells us how off our predictions are ($y'$ is the predicted value).

A few examples:

* L2 loss with MSE:
$$Cost(y, y') = \frac{1}{n}\sum_{i = 1}^{n}(y'_i - y_i)^2$$
* Softmax with cross entropy:
$$Cost(y, y') = \sum_{i = 1}^{n} y_i \ln{(\frac{1}{y'_i})}$$

A few notes on softmax with cross entropy:
* "Average length of communicating an event from one distribution with the optimal code fon another distribution"
* Only compares the one non-zero output value!





In [94]:
learning_rate = 0.01
batch_size = 100

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y)) 
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Backpropagation
---------------
General plan of attack:

1. Compute the gradient at our output layer (derivative of the cost function w.r.t. activation function):

$$\delta = {Cost}'(y',y)$$

2. Update output layer:

    $$b_L = \delta$$
    $$W_L = \delta * a_{L}^T$$
  
    
3. Update layers $\{L-1, L-2, ... 0\}$:

  $$z_L = W_L + b_L$$

    $$\delta = W_L^T \delta * \theta'(z_{L-1})$$
    $$b_{L-1} = \delta$$
    $$W_{L-1} = \delta * a_{L-2}^T$$
  
    $$\delta = W_{L-1}^T \delta * \theta'(z_{L-2})$$
    $$b_{L-2} = \delta$$
    $$W_{L-2} = \delta * a_{L-3}^T$$
    
    $$...$$


Computes gradient values which are derived from our cost function

Intuition: "Penalize weights that caused error by the amount of error they caused"

Train the network and calculate accuracy
----------------------------------------
Each step of the loop, we get a "batch" of `batch_size` random data points from the training set. Each training step feeds in batch of data to replace the `placeholders` and updates the network (backpropagation) once for every batch. This is stochastic gradient descent!

* `correct_prediction` - evaluates to true if the prediction and a test label match up (uses `tf.arg_max` to find index of highest value across a dimension)
* `accuracy` - percentage of correct predictions (converts booleans to binary with `tf.cast`)

In [101]:
training_epochs = 10
display_step = 1

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples / batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            # Fit training using batch data
            sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys})
            # Compute average loss
            avg_cost += sess.run(cost, feed_dict={x: batch_xs, y: batch_ys})/total_batch
        # Display logs per epoch step
        if epoch % display_step == 0:
            print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost)

    print "Optimization Finished!"
    
    # Test model
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    print "Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels})

Epoch: 0001 cost= 35.008997646
Epoch: 0002 cost= 6.062125091
Epoch: 0003 cost= 3.722721570
Epoch: 0004 cost= 2.592354636
Epoch: 0005 cost= 1.893606476
Epoch: 0006 cost= 1.499699955
Epoch: 0007 cost= 1.195059822
Epoch: 0008 cost= 0.938370555
Epoch: 0009 cost= 0.806863983
Epoch: 0010 cost= 0.685367677
Optimization Finished!
Accuracy: 0.9294


In [100]:
 mnist.train.next_batch(100)[1].shape

(100, 10)