In [1]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

mnist = input_data.read_data_sets("/tmp/data", one_hot=True)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


mnist dataset is prepared as a 3-way split for training, testing and validation.

In [None]:
mnist.train.num_examples, mnist.test.num_examples, mnist.validation.num_examples

Each of these n-dimmensional arrays represents a tensor for tf. For Mnist, each tensor is a 1x784d representation of the 28x28 images. They have a corresponding label, for the 0-9 number drawn in the image. These labels use one-hot encoding

In [None]:
mnist.train.images  # a tensor, an n-dimmensional array(55000,784)
mnist.train.labels  # one-hot encoded labels

## softmax regression
The basic tutorial uses softmax or logistic regression to compute the probability for an image to be one of the numbers in our dataset.

Softmax has two steps:
-  processing the inputs. They provide the evidence for an image being in a class. Each input is multiplied by its associated weight, learnt during training, and is increased by a bias.
-  computing the probability. Then the summed values are converted to probabilities.

### Equations:
To process the inpus: 
$$\hat x_i = \sum_j\mathbf W_{i,j}x_j + b_i$$
To compute the softmax:
$$ softmax(x)_i = \frac{exp(\hat x_i)}{\sum_j exp(\hat x_j)}$$
It exponentiates the inputs and normalizes them. An alternate representation, for conceptualization purposes:
$$softmax(x) = normalize(exp(x))$$

## Implementation

### Inputs
First we need a placeholder to receive any of the values we will input to our model.

In [2]:
# None here = dimmension can be of any length
x = tf.placeholder(tf.float32, [None, 784])

### Variables
for the weights and bias, tensorflow has another structure to represent them, the Variable. These are modifiable tensors living in Tensorflow's graph of interacting operations. So, they can be used and modified by  the computation.

In [3]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

### The model

In [4]:
#y = tf.nn.softmax(tf.matmul(x,W) + b)

# to store the predicted values we use a placeholder:
y_ = tf.placeholder(tf.float32, [None, 10])

## Training
We need to define the cost function, used to learn the parameters for the model. For this example we use the cross-entropy. The cross-entropy basically meassures how well or bad our model is fitting the data.
$$\mathbf H_{y'}(y) = -\sum_i y'_i log(y_i)$$

It can be implemented with the following code:

In [5]:
#cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

This raw computation is said to be numerically unstable, so a prefered implementation uses tf.nn.softmax_cross_entropy_with_logit on the unnormalized logits, so we change our model:

In [6]:
y = tf.matmul(x, W) + b
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))

The aim of the model is to find the set of parameters which minimize the cost of modelling the data with our model. Doing so our model will be approximating the actual distribution which generated the data.

At this point we have defined the model and the cost function. We know what we want the model to do. But, how do we instruct it to do it?

A basic approach for minimization problems is GradientDescent.

In [7]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

Now we have deffined the complete scope of the computations and how to perform them. We need to initialize the variables and execute the model.

In [8]:
sess = tf.InteractiveSession()

sess.run(tf.initialize_all_variables())

In [9]:
for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Now the model has been trained and it has predicted the labels for the inputs given the parameters. To meassure our model performance we need to compare against the actual labels. We use tf.argmax, which gives the highes entry in a tensor along some axis. For us, tf.argmax(y,1), corresponds to the correct label for each sample and tf.argmax(y_,1) is the predicted label. So we compare to get the matches.

In [10]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

Since the results are boolean, we cast them to float in order to compute the mean.

In [11]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Now we are ready to evaluate the generalization performance of our model. For this we use the test dataset as input.

In [12]:
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9182


In [13]:
sess.close()

To improve the 92% accuracy score we implement a Multilayer Convolutional Network

## Multilayer Convolutional Network

We need to create a large number of weights and bias. It is a good practice to initialize the weights with noisy values to prevent zero gradients. Our model uses ReLU neurons https://en.wikipedia.org/wiki/Rectifier_(neural_networks) which should be initialized with a slightly positive bias to avoid "dead neurons".

We create a couple of functions for this.

In [14]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

### Convolution and pooling

This kind of NN have characteristic steps or processes each with their own details. Convolution, pooling, boundaries and stride size are key terms here.
For this basic tutorial the convolutions have a stride size of one and are zero paded, so the output size matches the input size. The pooling is simply max pooling over blocks of size 2 by 2.

We also create a pair of functions to perform these steps.

In [15]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                         strides=[1, 2, 2, 1], padding='SAME')

### First Convolutional Layer

This first layer will convolve, then followed by max pooling. The convolutional will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels. We will also have a bias vector with a component for each output channel.

In [17]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.

In [19]:
x_image = tf.reshape(x, [-1,28,28,1])

We then convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool.

In [20]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

### Second Convolutional layer

This second layer has the same type as the first one. This layer will have 64 features for each 5x5 patch

In [21]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

Now that the image size has been reduced to 7x7, we add a fully-connected layer with 1024 neurons to allow processing on the entire image. We reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU

In [22]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

### Dropout

To reduce overfitting, we will apply dropout before the readout layer. We create a placeholder for the probability that a neuron's output is kept during dropout. This allows us to turn dropout on during training, and turn it off during testing. TensorFlow's tf.nn.dropout op automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling.1

In [23]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### Readout Layer

Finally, we add a layer, just like for the one layer softmax regression above.

In [24]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

### Train and Evaluate the Model

How well does this model do? To train and evaluate it we will use code that is nearly identical to that for the simple one layer SoftMax network above.

The differences are that:

We will replace the steepest gradient descent optimizer with the more sophisticated ADAM optimizer.

We will include the additional parameter keep_prob in feed_dict to control the dropout rate.

We will add logging to every 100th iteration in the training process.

Feel free to go ahead and run this code, but it does 20,000 training iterations and may take a while (possibly up to half an hour), depending on your processor.

In [None]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv, y_))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())

for i in range(20000):
    batch = mnist.train.next_batch(50)
    if i%100 == 0:
        train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, train_accuracy))
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

step 0, training accuracy 0.12
