#  Deep convolutional MNIST classifier

#### Load MNIST Data hosted on http://yann.lecun.com/exdb/mnist/

The data is split into three parts: 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation). 

In [2]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data\train-images-idx3-ubyte.gz
Extracting MNIST_data\train-labels-idx1-ubyte.gz
Extracting MNIST_data\t10k-images-idx3-ubyte.gz
Extracting MNIST_data\t10k-labels-idx1-ubyte.gz


#### Start TensorFlow Interactive Session

In [3]:
import tensorflow as tf
sess = tf.InteractiveSession()

#### Building the computation graph by creating nodes for the input images and target output classes
x and y_ are each a placeholder: a value that we'll input when we ask TensorFlow to run a computation.

In [4]:
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

x and y_ will consist of a 2d tensors. 

Here we assign it a shape of [None, 784], where 784 is the dimensionality of a single flattened 28 by 28 pixel MNIST image, and None indicates that the first dimension, corresponding to the batch size, can be of any size.

y_ will also consist of a 2d tensor, where each row is a one-hot 10-dimensional vector indicating which digit class (zero through nine) the corresponding MNIST image belongs to.

#### Define the variables W and b (model parameters) as tensors full of zeros

In [5]:
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))

W is a 784x10 matrix (because we have 784 input features and 10 outputs) and b is a 10-dimensional vector (because we have 10 classes).

#### Initialize variables using that session

In [6]:
sess.run(tf.global_variables_initializer())

#### Regression model

In [7]:
y = tf.matmul(x,W) + b

#### Predicted Class and Loss Function
Loss indicates how bad the model's prediction was on a single example; we try to minimize that while training across all the examples. Loss function is the cross-entropy between the target and the softmax activation function applied to the model's prediction.

tf.nn.softmax_cross_entropy_with_logits internally applies the softmax on the model's unnormalized model prediction and sums across all classes, and tf.reduce_mean takes the average over these sums

In [8]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

#### Training the model
TensorFlow has a variety of built-in optimization algorithms. For this example, we will use steepest gradient descent, with a step length of 0.5, to descend the cross entropy.

In [9]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

Training the model can therefore be accomplished by repeatedly running train_step We load 100 training examples in each training iteration (1000).

We then run the train_step operation, using feed_dict to replace the placeholder tensors x and y_ with the training examples

In [10]:
for _ in range(1000):
  batch = mnist.train.next_batch(100)
  train_step.run(feed_dict={x: batch[0], y_: batch[1]})

#### Evaluate the Model

How well did our model do?

First we'll figure out where we predicted the correct label. tf.argmax is an extremely useful function which gives you the index of the highest entry in a tensor along some axis. For example, tf.argmax(y,1) is the label our model thinks is most likely for each input, while tf.argmax(y_,1) is the true label. We can use tf.equal to check if our prediction matches the truth.

In [11]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

That gives us a list of booleans. To determine what fraction are correct, we cast to floating point numbers and then take the mean. For example, [True, False, True, True] would become [1,0,1,1] which would become 0.75

In [12]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy: ", accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

Accuracy:  0.9173


# Building a Multilayer Convolutional Network

#### Weight initialization

Initialize weights with a small amount of noise for symmetry breaking, and to prevent 0 gradients. Since we're using ReLU neurons, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons". Instead of doing this repeatedly while we build the model, let's create two handy functions to do it for us.

In [13]:
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

tf.truncated_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32, seed=None, name=None)
Outputs random values from a truncated normal distribution.

#### Convolution and Pooling
- Pooling: Downsampling an image. MaxPooling = taking the largest number in the pixel's neighborhood (patch)
- Convolution = feature meatching.    Pooling = Best matched detection

Convolutions use stride of 1 and are 0 padded so that the output is the same size as the input. Pooling is max pooling over 2x2 blocks.

In [14]:
def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

#### First Convolutional Layer
##           Convolution >> ReLU >> maxPooling
#### First layer
- will consist of convolution
- followed by max pooling.

The convolution will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels. We will also have a bias vector with a component for each output channel.

In [15]:
depth_filter1 = 32
W_conv1 = weight_variable([5, 5, 1, depth_filter1])
b_conv1 = bias_variable([depth_filter1])

To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.



In [16]:
x_image = tf.reshape(x, [-1,28,28,1])

We then convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool. The max_pool_2x2 method will reduce the image size to 14x14.


In [17]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

#### Second Convolutional Layer
In order to build a deep network, we stack several layers of this type. The second layer will have 64 features for each 5x5 patch.

In [18]:
depth_filter2 = 64
W_conv2 = weight_variable([5, 5, depth_filter1, depth_filter2])
b_conv2 = bias_variable([depth_filter2])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

#### Densely Connected Layer
Now that the image size has been reduced to 7x7, we add a fully-connected layer with 1024 neurons to allow processing on the entire image. We reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU.



In [19]:
W_fc1 = weight_variable([7 * 7 * depth_filter2, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*depth_filter2])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

#### Dropout
To reduce overfitting, we will apply dropout before the readout layer. We create a placeholder for the probability that a neuron's output is kept during dropout. This allows us to turn dropout on during training, and turn it off during testing. TensorFlow's tf.nn.dropout op automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling.1



In [20]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

#### Readout Layer
Finally, we add a layer, just like for the one layer softmax regression.

In [21]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

#### Train and Evaluate the Model
- Optimizer : ADAM optimizer
- add logging to every 100th iteration in the training process.

In [23]:
import  timeit
startTrainingTime = timeit.default_timer()

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess.run(tf.global_variables_initializer())
epochs = 20000
for i in range(20000):
    batch = mnist.train.next_batch(50)
    if i%100 == 0:
        train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, train_accuracy))
    if i%2500 == 0:
        endTrainingTime = timeit.default_timer()
        print("Total training time: ", endTrainingTime - startTrainingTime)
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

step 0, training accuracy 0.08
Total training time:  0.5800943269464117
step 100, training accuracy 0.8
step 200, training accuracy 0.78
step 300, training accuracy 0.96
step 400, training accuracy 0.96
step 500, training accuracy 0.94
step 600, training accuracy 0.92
step 700, training accuracy 0.92
step 800, training accuracy 0.88
step 900, training accuracy 0.98
step 1000, training accuracy 0.98
step 1100, training accuracy 0.98
step 1200, training accuracy 0.96
step 1300, training accuracy 0.98
step 1400, training accuracy 1
step 1500, training accuracy 0.94
step 1600, training accuracy 0.98
step 1700, training accuracy 0.98
step 1800, training accuracy 1
step 1900, training accuracy 1
step 2000, training accuracy 1
step 2100, training accuracy 0.94
step 2200, training accuracy 0.96
step 2300, training accuracy 0.98
step 2400, training accuracy 1
step 2500, training accuracy 0.98
Total training time:  204.58306096239716
step 2600, training accuracy 0.98
step 2700, training accuracy