# TensorFlow Tutorial on MNIST dataset

We would like
to train a deep network to classify MNIST dataset into two classes, deciding whether a
digit in the image is greater (or equal to) or less than three (3), i.e. digit >=3 or digit<3.
### Additional Tasks
- Downsample the original images to 14 x 14 pixels
- Blurring the images

### Questions to be answered
- The network architecture must have 1-3 batch-normalized CNN layers (with 3 x 3 x N
Kernels, where N is an adjustable parameter) and 2x2 strides followed by a single fully
connected layer. 

- Any other layers or components could be added if required. We would
like infer how many CNN layers (1, 2 or 3) results in the best performance of the network
based on the database we have.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import matplotlib.pyplot as plt
import math


%matplotlib inline

  from ._conv import register_converters as _register_converters


In [2]:
mnist = input_data.read_data_sets("../MNIST_data/")
trX, trY, teX, teY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [3]:
trY[0:10]

array([7, 3, 4, 6, 1, 8, 1, 0, 9, 8], dtype=uint8)

In [4]:
trY2=np.zeros((trX.shape[0],2))
trY2[trY>=3,0]=1
trY2[0:10,:]

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [0., 0.],
       [0., 0.],
       [1., 0.],
       [1., 0.]])

In [5]:
trX = trX.reshape(-1, 28, 28, 1)  # 28x28x1 input img
teX = teX.reshape(-1, 28, 28, 1)  # 28x28x1 input img

In [6]:
teY2=np.zeros((teX.shape[0],2))
teY2[teY>=3,0]=1
teY2[0:10,:]

array([[1., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

# Blurring the image using Tensor flow
We multiply two independent 1D Gaussain random variables to obtain a 2D gaussian filter to blur the images. Since the images are small, blurring image filter size is 3x3. We define a tensor flow convolution layer with a predefined filter (gaussian kernel) of diemension 3x3x1x1 such that we do not learn any weights. In our computation graph we add it as an input to our graph.

In [7]:
s, k = 1, 1 #  generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
probs = [np.exp(-z*z/(2*s*s))/np.sqrt(2*np.pi*s*s) for z in range(-k,k+1)] 
gaussian_kernel = np.outer(probs, probs)
gaussian_kernel

array([[0.05854983, 0.09653235, 0.05854983],
       [0.09653235, 0.15915494, 0.09653235],
       [0.05854983, 0.09653235, 0.05854983]])

In [8]:
gaussian_kernel = np.expand_dims(np.expand_dims(gaussian_kernel, 2), 3)
gaussian_kernel.shape

(3, 3, 1, 1)

# Tensorflow Graph
We first begin by creating placeholders for the input data that will be fed into the model when running the session. We have four inputs: X(images),Y(labels), gaussian kernel and phase- an indicator function to tell the graph when batch normalisation is to be used. Batch normalisation works differnely during the training and testing phase of our model. We also initialize our weights to be learned from each layer. Note also that you will only initialize the weights/filters for the conv2d functions. TensorFlow initializes the layers for the fully connected part automatically.

In [9]:
tf.reset_default_graph()
X=tf.placeholder(tf.float32,[None,28,28,1])
g_filter = tf.placeholder(dtype=tf.float32, shape=(3, 3, 1, 1))
Y=tf.placeholder(tf.float32,[None,2])

phase = tf.placeholder(tf.bool, name='phase')

layers=2

W1 = tf.get_variable("W1",[3,3,1,16],initializer=tf.contrib.layers.xavier_initializer(seed=0))
W2 = tf.get_variable("W2",[3,3,16,32],initializer=tf.contrib.layers.xavier_initializer(seed=0))
W3 = tf.get_variable("W3",[3,3,32,64],initializer=tf.contrib.layers.xavier_initializer(seed=0))

W=[W1,W2,W3]

### Batch Norm
Tensor flow has its own batch norm function, but onky works with given parameters mean,var, alpha and beta. We compute the parameters as given in http://arxiv.org/abs/1502.03167 and feed it as arguments to tf.nn.batch_normalization()

In [10]:
from tensorflow.python.ops import control_flow_ops

def batch_norm(x, n_out, phase_train, scope='bn'):
    """
    Batch normalization on convolutional maps.
    Args:
        x:           Tensor, 4D BHWD input maps
        n_out:       integer, depth of input maps
        phase_train: boolean tf.Varialbe, true indicates training phase
        scope:       string, variable scope
    Return:
        normed:      batch-normalized maps
    """
    with tf.variable_scope(scope):
        beta = tf.Variable(tf.constant(0.0, shape=[n_out]),
                                     name='beta', trainable=True)
        gamma = tf.Variable(tf.constant(1.0, shape=[n_out]),
                                      name='gamma', trainable=True)
        batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments')
        ema = tf.train.ExponentialMovingAverage(decay=0.5)

        def mean_var_with_update():
            ema_apply_op = ema.apply([batch_mean, batch_var])
            with tf.control_dependencies([ema_apply_op]):
                return tf.identity(batch_mean), tf.identity(batch_var)

        mean, var = tf.cond(phase_train,
                            mean_var_with_update,
                            lambda: (ema.average(batch_mean), ema.average(batch_var)))
        normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
    return normed

### To generate random Mini batches

In [11]:

# GRADED FUNCTION: random_mini_batches

def random_mini_batches(X, Y, mini_batch_size, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[0]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[permutation]
    shuffled_Y = Y[permutation]

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch_Y = shuffled_Y[k*mini_batch_size:(k+1)*mini_batch_size]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[(k+1)*mini_batch_size:]
        mini_batch_Y = shuffled_Y[(k+1)*mini_batch_size:]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

# Layers
We define one layer of our model to consist of,
- A convolution layer
- Batch normalization layer
- Use relu activation
- Followed by Max pooling layer

In [12]:
def layer(X,w,s, phase):
    """
    One layer of our network
    Args: 
    X:     Tesnor, 4D inputs
    w:     Weights for the layer
    s:     Strides on our convolution layer
    phase: Indicates if we are training or testing
    Returns:
    l4:    Output from max pooling layer
    """
    l1=tf.nn.conv2d(X,w,strides=[1,s,s,1],padding="SAME")
    # l2=tf.contrib.layers.batch_norm(l1,center=True, scale=True, is_training=phase)
    l2 = batch_norm(l1, w.shape[3], phase)
    l3=tf.nn.relu(l2)
    l4=tf.nn.max_pool(l3,ksize=[1,s,s,1],strides=[1,s,s,1],padding='SAME')
    return l4
    

### Note: 
FULLYCONNECTED (FC) layer: We use a fully connected layer without an non-linear activation function. We do not call the softmax here. This will result in 2 neurons in the output layer, which then get passed later to a softmax. In TensorFlow, the softmax and cost function are lumped together into a single function, which we'll call in a different function when computing the cost. 

In [13]:
def model(X,W,layers, phase,g_filter):
    """
    Model
    Args:
    X:         input images as batches
    W:         All the weights of entire network as a list
    layers:    The nmber of layers in our network(1,2,3)
    phase:     Indicates if we are training or testing
    g_filter:  gaussain kernel
    Returns:
    out:       Output from the fully conneceted layer
    """
    X_resized=tf.image.resize_images(X,[14,14])
    X_blurred=tf.nn.conv2d(X_resized, filter=g_filter, strides=[1, 1, 1, 1], padding='SAME')
    
    for l in range(layers):
        print(W[l])
        p=layer(X_blurred,W[l],2, phase)
        X_blurred=p
    FC1=tf.contrib.layers.flatten(p)
    out=tf.layers.dense(FC1, units=2, activation=None) # fully_connected(FC1,10,activation_fn=None)
    
    return out

# The Network
We train our data on three models. The first model has one convolution layer, followed by batch normalization, Max pooling and finally a fully connected layer. The second model has two layers in repetition followed by a fully connected layer. The third model has three layers followed by a fully connected layer. Each of the model the final output is a softmax layer that gives us two classes. Since the data is downsampled to 14x14, the size of the images is small. We also have a skewed data set( more 1 than 0s). We need to make sure that we avoid overfitting as these are small images and fewer data points. So ideally we need small to medium sized network.

In [16]:
def final_model(trX,trY,teX,teY,layers,learning_rate=0.001/2,num_epochs=100,batch_size=128):
    tf.reset_default_graph()
    X=tf.placeholder(tf.float32,[None,28,28,1])
    g_filter = tf.placeholder(dtype=tf.float32, shape=(3, 3, 1, 1))
    Y=tf.placeholder(tf.float32,[None,2])

    phase = tf.placeholder(tf.bool, name='phase')


    W1 = tf.get_variable("W1",[3,3,1,8],initializer=tf.contrib.layers.xavier_initializer())
    W2 = tf.get_variable("W2",[3,3,8,16],initializer=tf.contrib.layers.xavier_initializer())
    W3 = tf.get_variable("W3",[3,3,16,32],initializer=tf.contrib.layers.xavier_initializer())

    W=[W1,W2,W3]

    num_minibatches=int(trX.shape[0]/batch_size)

    #Output from the model
    Y_hat=model(X,W,layers, phase,g_filter)

    # Define loss and optimizer
    cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Y_hat,labels=Y))
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

    # Evaluate Model
    predict_op = tf.argmax(Y_hat, 1)
    correct_prediction = tf.equal(predict_op, tf.argmax(Y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

    #Initializer
    init=tf.global_variables_initializer()
    

    with tf.Session() as sess:
        seed=3
        sess.run(init)
        for epoch in range(0,num_epochs):
            minibatch_cost = 0.
            num_minibatches = int(trX.shape[0] / batch_size)
            seed=seed+1
            minibatches=random_mini_batches(trX,trY2,128,seed)
            for minibatch in minibatches:
                minibatch_x,minibatch_y=minibatch
                _,temp_cost=sess.run([optimizer,cost],feed_dict={X:minibatch_x,Y:minibatch_y,phase:True,g_filter:gaussian_kernel})
                minibatch_cost += temp_cost / num_minibatches
            if epoch % 10 == 0:
                acc=accuracy.eval({X: trX, Y: trY2,phase:False,g_filter:gaussian_kernel})
                print ("Cost after epoch %i: %f and training accuracy: %f" % (epoch, minibatch_cost,acc))
            # Calculate accuracy on the test set
        
        train_accuracy = accuracy.eval({X: trX, Y: trY2,phase:False,g_filter:gaussian_kernel})
        test_accuracy = accuracy.eval({X: teX, Y: teY2,phase:False,g_filter:gaussian_kernel})
        print("Train Accuracy:", train_accuracy)
        print("Test Accuracy:", test_accuracy)
    

# With one Layer
We see that after epoch 0 the accuracy is high(skewed data set and the network might be predicting the same class), but then the accuracy decreases with epochs reaching a value around $\textbf{88%}$. Moreover the cost function does not always decrease, this could mean that with the given parameters (learning rate, batch size etc), the optimization is $\textbf{getting stuck}$ in a local minima. Thus the cost function is not steadily decreasing, leading to an acceptable performance. There is little difference between the training and testing accuracy, suggesting that the network is not overfitting, but has a high bias leading to an accuracy of $\textbf{88%}$.

In [17]:
final_model(trX,trY,teX,teY,1)

<tf.Variable 'W1:0' shape=(3, 3, 1, 8) dtype=float32_ref>
Cost after epoch 0: 0.162500 and training accuracy: 0.962127
Cost after epoch 10: 0.121251 and training accuracy: 0.865745
Cost after epoch 20: 0.160026 and training accuracy: 0.849600
Cost after epoch 30: 0.189570 and training accuracy: 0.891000
Cost after epoch 40: 0.210857 and training accuracy: 0.869891
Cost after epoch 50: 0.226409 and training accuracy: 0.868600
Cost after epoch 60: 0.239815 and training accuracy: 0.875400
Cost after epoch 70: 0.245049 and training accuracy: 0.858818
Cost after epoch 80: 0.245565 and training accuracy: 0.881145
Cost after epoch 90: 0.242607 and training accuracy: 0.880982
Train Accuracy: 0.87929094
Test Accuracy: 0.8801


# With Two Layers
We see that after epoch 0 the accuracy is high(skewed data set and the network might be predicting the same class), but then the accuracy decreases with epochs reaching a value around $\textbf{85%}$. Moreover the cost function does not always decrease, this could mean that with the given parameters (learning rate, batch size etc), the optimization is $\textbf{getting stuck}$ in a local minima. Thus the cost function is not steadily decreasing, leading to poor performance. There is little difference between the training and testing accuracy, suggesting that the network is not overfitting, but has a high bias leading to an accuracy of $\textbf{85%}$.

In [18]:
final_model(trX,trY,teX,teY,2)


<tf.Variable 'W1:0' shape=(3, 3, 1, 8) dtype=float32_ref>
<tf.Variable 'W2:0' shape=(3, 3, 8, 16) dtype=float32_ref>
Cost after epoch 0: 0.098090 and training accuracy: 0.874218
Cost after epoch 10: 0.215620 and training accuracy: 0.851527
Cost after epoch 20: 0.619365 and training accuracy: 0.767764
Cost after epoch 30: 0.734867 and training accuracy: 0.736564
Cost after epoch 40: 0.739156 and training accuracy: 0.759127
Cost after epoch 50: 0.709606 and training accuracy: 0.733673
Cost after epoch 60: 0.686162 and training accuracy: 0.765818
Cost after epoch 70: 0.652271 and training accuracy: 0.812873
Cost after epoch 80: 0.614144 and training accuracy: 0.808764
Cost after epoch 90: 0.574833 and training accuracy: 0.835691
Train Accuracy: 0.84434545
Test Accuracy: 0.8545


# With Three Layers
We see that after epoch 0 the accuracy is low, but then the accuracy increases with epochs reaching a value around $\textbf{99%}$. Moreover the cost function not always decrease, this could mean that with the given parameters (learning rate, batch size etc), the optimization is $\textbf{not getting stuck}$ in local minima. Thus the cost function is not steadily decreasing, leading to poor performance. There is little difference between the training and testing accuracy, suggesting that the network is not overfitting. This network is trained properly that gives the best result and leads to a high accuracy of $\textbf{99%}$. 

In [19]:
final_model(trX,trY,teX,teY,3)

<tf.Variable 'W1:0' shape=(3, 3, 1, 8) dtype=float32_ref>
<tf.Variable 'W2:0' shape=(3, 3, 8, 16) dtype=float32_ref>
<tf.Variable 'W3:0' shape=(3, 3, 16, 32) dtype=float32_ref>
Cost after epoch 0: 0.146631 and training accuracy: 0.784164
Cost after epoch 10: 0.010162 and training accuracy: 0.999709
Cost after epoch 20: 0.008077 and training accuracy: 0.992055
Cost after epoch 30: 0.122044 and training accuracy: 0.998600
Cost after epoch 40: 0.097594 and training accuracy: 0.993000
Cost after epoch 50: 0.078344 and training accuracy: 0.998255
Cost after epoch 60: 0.063438 and training accuracy: 0.992964
Cost after epoch 70: 0.054084 and training accuracy: 0.992400
Cost after epoch 80: 0.046595 and training accuracy: 0.995345
Cost after epoch 90: 0.040858 and training accuracy: 0.994945
Train Accuracy: 0.9914182
Test Accuracy: 0.9897


# The best Result
The best result is from Network 3 and the results are summarised below

$$Train Accuracy =99.1\%$$

$$Test Accuracy = 98.9\%$$