In [1]:
from __future__ import division, print_function, unicode_literals
import numpy as np
import tensorflow as tf

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# Tensorflow and Deep Learning

In this lab assignment, first you will learn how to build and train a neural network that recognises handwritten digits, and then you will build LeNet-5 CNN architecture, which is widely used for handwritten digit recognition. At the end of this lab assignment, you will make AlexNet CNN architecture, which won the 2012 ImageNet ILSVRC challenge.

---
# 1. Dataset
In the first part of the assignment, we use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. There are 70,000 images, and each image has 784 features. This is because each image is 28×28=784 pixels, and each feature simply represents one pixel's intensity, from 0 (white) to 255 (black). The following figure shows a few images from the MNIST dataset to give you a feel for the complexity of the classification task.

<img src="figs/1-mnist.png" style="width: 300px;"/>

To begin the assignment, first, use `mnist_data.read_data_sets` and download images and labels. It return two lists, called `mnist.test` with 10K images+labels, and `mnist.train` with 60K images+labels.

In [3]:
# TODO: Replace <FILL IN> with appropriate code

from tensorflow.examples.tutorials.mnist import input_data as mnist_data

mnist = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


---
# 2. A One-Layer Neural Network
<img src="figs/2-comic1.png" style="width: 500px;"/>

Let's start by building a one-layer neural network. Handwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a **one-layer neural network**. Each neuron in the network does a weighted sum of all of its inputs, adds a bias and then feeds the result through some non-linear activation function. Here we design a one-layer neural network with 10 output neurons since we want to classify digits into 10 classes (0 to 9).
<img src="figs/3-one_layer.png" style="width: 400px;"/>


For a classification problem, an *activation function* that works well is **softmax**. Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector.
<img src="figs/4-softmax.png" style="width: 300px;"/>

We can summarise the behaviour of this single layer of neurons into a simple formula using a *matrix multiply*. If we give input data into the network in *mini-batch* of 100 images, it produces 100 predictions as the output. We define the **weights matrix $W$** with 10 columns, in which each column indicates the weight of a one class (a single digit), from 0 to 9. Using the first column of $W$, we can compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron that points to the number 0. Using the second column of $W$, we do the same for the second neuron (number 1) and so on until the 10th neuron. We can then repeat the operation for the remaining 99 images in the mini-batch. If we call $X$ the matrix containing our 100 images (each row corresponds to one digit), all the weighted sums for our 10 neurons, computed on 100 images are simply $X.W$. Each neuron must now add its bias. Since we have 10 neurons, we have 10 bias constants. We finally apply the **softmax activation function** and obtain the formula describing a one-layer neural network, applied to 100 images.
<img src="figs/5-xw.png" style="width: 600px;"/>
<img src="figs/6-softmax2.png" style="width: 500px;"/>

Then, we need to use the **cross-entropy** to measure how good the predictions are, i.e., the distance between what the network tells us and what we know to be the truth. The cross-entropy is a function of weights, biases, pixels of the training image and its known label. If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases, we obtain a **gradient**, computed for a given image, label and present value of weights and biases. We can update weights and biases by a fraction of the gradient and do the same thing again using the next batch of training images.
<img src="figs/7-cross_entropy.png" style="width: 600px;"/>

### Define Variables and Placeholders
First we define TensorFlow **variables** and **placeholders**. *Variables* are all the parameters that you want the training algorithm to determine for you (e.g., weights and biases). *Placeholders* are parameters that will be filled with actual data during training (e.g., training images). The shape of the tensor holding the training images is [None, 28, 28, 1] which stands for:
  - 28, 28, 1: our images are 28x28 (784) pixels x 1 value per pixel (grayscale). The last number would be 3 for color images and is not really necessary here.
  - None: this dimension will be the number of images in the mini-batch. It will be known at training time.

We also need an additional placeholder for the training labels that will be provided alongside training images.

In [4]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 1 layer of 10 softmax neurons
#
# · · · · · · · · · ·       (input data, flattened pixels)       X [batch, 784] 
# \x/x\x/x\x/x\x/x\x/    -- fully connected layer (softmax)      W [784, 10]     b[10]
#   · · · · · · · ·                                              Y_hat [batch, 10]


# input X, place holder for image data: 28x28 grayscale images, 
# the first dimension (None) will index the images in the mini-batch, 
# the last dimension (1) is the number of channels
X = tf.placeholder(tf.float32, [None, 28, 28, 1])

# correct labels will go here
# the last dimension (10) is the number of neurons 
Y = tf.placeholder(tf.float32, shape=(None, 10))

# Weights W[784, 10], 784 = 28 * 28
# Selects random numbers from a normal distribution whose mean or std is close to 0 and values are close to normal distribution with specified mean and standard deviation
# In ML, it is desired to have weights close to 0
W = tf.Variable(tf.truncated_normal([784, 10],stddev=0.1))

# biases b[10]
b = tf.Variable(tf.constant(0.1,shape=[10]))

### Build The Model
Now, we can make a **model** for a one-layer neural network. The formula is the one we explained before, i.e., $\hat{Y} = softmax(X . W + b)$. You can use the `tf.nn.softmax` and `tf.matmul` to build the model. Here, we need to use the `tf.reshape` to transform our 28x28 images into single vectors of 784 pixels.

In [7]:
# flatten the 28*28 images into a single line/vector of 784 pixels
# -1 means no learning rate, images are 28x28 (784) pixels x 1 value per pixel (grayscale) or channels. 
# The last number would be 3 for color images and is not really necessary here.
# What happens if images are of varied dimensions e.g 27 by 30, how to convert them into 1 vector?
flattened_X = tf.reshape(X, [-1,28*28*1])
# The model
Y_hat = tf.nn.softmax(tf.matmul(flattened_X, W) + b)

### Define The Cost Function
Now, we have model predictions $\hat{Y}$ and correct labels $Y$, so for each instance $i$ (image) we can compute the cross-entropy as the **cost function**: $cross\_entropy = -\sum(Y_i * log(\hat{Y}i))$. You can use `reduce_mean` to add all the components in a tensor.

In [13]:
#cross_entropy = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(Y_hat)))
cross_entropy = tf.reduce_mean(tf.losses.softmax_cross_entropy(onehot_labels= Y, logits=Y_hat))
print(cross_entropy)

#Add components to tensor
cross_entropy = tf.reduce_mean(cross_entropy)
print(cross_entropy)

Tensor("Mean_5:0", shape=(), dtype=float32)
Tensor("Mean_6:0", shape=(), dtype=float32)


### Traine the Model
Now, select the gradient descent optimiser `GradientDescentOptimizer` and ask it to minimise the cross-entropy cost. In this step, TensorFlow computes the partial derivatives of the cost function relatively to all the weights and all the biases (the gradient). The gradient is then used to update the weights and biases. Set the learning rate is $0.005$.

In [14]:
# TODO: Replace <FILL IN> with appropriate code

optimizer = tf.train.GradientDescentOptimizer(0.005)
train_step = optimizer.minimize(cross_entropy)

### Execute the Model
It is time to run the training loop. All the TensorFlow instructions up to this point have been preparing a computation graph in memory but nothing has been computed yet. The computation requires actual data to be fed into the placeholders. This is supplied in the form of a Python dictionary, where the keys are the names of the placeholders. During the trainig print out the cost every 200 steps. Moreove, after training the model, print out the accurray of the model by testing it on the test data.

In [17]:
# init
init = tf.global_variables_initializer()
save_accuracy = []
n_epochs = 5000
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist.train.next_batch(100)
        #Error: Cannot feed value of shape (200, 784) for Tensor u'Placeholder_18:0', which has shape '(?, 28, 28, 1)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step,cross_entropy], feed_dict={X: epoch_imageX_reshapedbatch, Y: epoch_labelY_batch})
        epoch_loss = epoch_loss + loss
        
        #print loss after every 
        if epoch % 200 == 0:
            print('Epoch ',epoch, 'has a loss of ', loss)
            
    predictions = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_hat,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X:epoch_imageX_reshapedTestbatch, Y: mnist.test.labels}) * 100
    print(" ")
    print("The Accuracy for the model in One-Layer Neural Network model is: {} % ".format(acc) )


           
    

Epoch  0 has a loss of  2.290168
Epoch  200 has a loss of  2.2961574
Epoch  400 has a loss of  2.2897348
Epoch  600 has a loss of  2.2727816
Epoch  800 has a loss of  2.2614317
Epoch  1000 has a loss of  2.2513862
Epoch  1200 has a loss of  2.2192545
Epoch  1400 has a loss of  2.1954172
Epoch  1600 has a loss of  2.1516714
Epoch  1800 has a loss of  2.1395924
Epoch  2000 has a loss of  2.0917978
Epoch  2200 has a loss of  2.097378
Epoch  2400 has a loss of  2.1936624
Epoch  2600 has a loss of  2.0439634
Epoch  2800 has a loss of  2.076366
Epoch  3000 has a loss of  2.05493
Epoch  3200 has a loss of  2.071522
Epoch  3400 has a loss of  2.0187588
Epoch  3600 has a loss of  2.006071
Epoch  3800 has a loss of  1.9985417
Epoch  4000 has a loss of  2.0448534
Epoch  4200 has a loss of  2.07711
Epoch  4400 has a loss of  2.025236
Epoch  4600 has a loss of  2.0304399
Epoch  4800 has a loss of  2.013119
 
The Accuracy for the model in One-Layer Neural Network model is: 54.04000282287598 % 


<hr>

---
# 2. Add More Layers

<img src="figs/8-comic2.png" style="width: 500px;"/>

Now, let's improve the recognition accuracy by adding more layers to the neural network. The neurons in the second layer, instead of computing weighted sums of pixels will compute weighted sums of neuron outputs from the previous layer. We keep the softmax function as the activation function on the last layer, but on intermediate layers we will use the the **sigmoid** activation function. So, let's build a five-layer fully connected neural network with the following structure, and train the model with the trainging data and print out its accuracy on the test data.
<img src="figs/9-five_layer.png" style="width: 500px;"/>

In [18]:
from __future__ import division, print_function, unicode_literals
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
import numpy as np
import tensorflow as tf
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)



In [19]:
mnist_ml = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)

Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz


Five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
The Size of the filter/kernel is 5x5; Input channels is 1 (grayscale), 3 for coloured pics 
200 different feature maps meaning 32 different filters are applied on each image.
he output volume size of Conv1 would be 28x28x200). 
Filter / kernel tensor is of shape [filter_height, filter_width, in_channels, out_channels]

In [20]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with five layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200] B1 [200]

#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2 [100]

#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3 [60]

#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4 [30]

#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5 [10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################

# ML(Multi layer)
X_ML = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y_ML = tf.placeholder(tf.float32, shape=(None, 10))

# First Conv Layer with 28*28 input features and 200 ouput features
# W_Conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 200],stddev=0.1)) why doesn't work??
# Dimension 1 in both shapes must be equal, but are 28 and 5. 
# Shapes are [?,28] and [5,5]. for 'MatMul_5' (op: 'BatchMatMul') with input shapes: [?,28,28,1], [5,5,1,200].
W_Conv1 = tf.Variable(tf.truncated_normal([784, 200],stddev=0.1))
B_Conv1 = tf.Variable(tf.constant(0.1, shape=[200]))

# Second Conv Layer with 200 input features and 100 ouput features
W_Conv2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B_Conv2 = tf.Variable(tf.constant(0.1, shape=[100]))

# Third Conv Layer with 100 input features and 100 ouput features
W_Conv3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B_Conv3 = tf.Variable(tf.constant(0.1, shape=[60]))

# Fourth Conv Layer with 60 input features and 30 ouput features
W_Conv4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B_Conv4 = tf.Variable(tf.constant(0.1, shape=[30]))

#Fourth Conv Layer with 30 input features and 10 ouput features
W_Conv5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B_Conv5 = tf.Variable(tf.constant(0.1, shape=[10])) 




In [21]:
########################################
# build the model
########################################

#Flatten 28*28 images into a single line/vector of 784 pixels, 1 represents grayscale
flattened_X_ML = tf.reshape(X_ML, [-1, 28*28*1])
print ("W_Conv1", W_Conv1)
print ("flattened_X_ML", W_Conv1)
Y_hat_Conv1 = tf.nn.sigmoid(tf.matmul(flattened_X_ML, W_Conv1) + B_Conv1)
Y_hat_Conv2 = tf.nn.sigmoid(tf.matmul(Y_hat_Conv1, W_Conv2) + B_Conv2)
Y_hat_Conv3 = tf.nn.sigmoid(tf.matmul(Y_hat_Conv2, W_Conv3) + B_Conv3)
Y_hat_Conv4 = tf.nn.sigmoid(tf.matmul(Y_hat_Conv3, W_Conv4) + B_Conv4)
Y_hat_ML = tf.nn.softmax(tf.matmul(Y_hat_Conv4, W_Conv5) + B_Conv5)


W_Conv1 <tf.Variable 'Variable:0' shape=(784, 200) dtype=float32_ref>
flattened_X_ML <tf.Variable 'Variable:0' shape=(784, 200) dtype=float32_ref>


In [23]:
########################################
# define the cost function
########################################
#cross_entropy = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(Y_hat)))

cross_entropy_ml = tf.reduce_mean(tf.losses.softmax_cross_entropy(onehot_labels= Y_ML, logits=Y_hat_ML))
print(cross_entropy_ml)

#Add components to tensor
cross_entropy_ml = tf.reduce_mean(cross_entropy_ml)
print(cross_entropy_ml)

########################################
# define the optimizer
########################################
optimizer = tf.train.GradientDescentOptimizer(0.005)
train_step_ml = optimizer.minimize(cross_entropy_ml)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist_ml.train.next_batch(100)
        #Error: Cannot feed value of shape (200, 784) for Tensor u'Placeholder_18:0', which has shape '(?, 28, 28, 1)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step_ml,cross_entropy_ml], feed_dict={X_ML: epoch_imageX_reshapedbatch, Y_ML: epoch_labelY_batch})
        epoch_loss = epoch_loss + loss
        
        #print loss after every 
        if epoch % 200 == 0:
            print ('Epoch ',epoch, 'has a loss of ', epoch_loss)
            
    predictions = tf.equal(tf.argmax(Y_ML, 1), tf.argmax(Y_hat_ML,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist_ml.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X_ML:epoch_imageX_reshapedTestbatch, Y_ML: mnist_ml.test.labels}) * 100
    print(" ")
    print("The Accuracy for the model in Five-Layer Neural Network model is: {} % ".format(acc) )

Tensor("Mean_1:0", shape=(), dtype=float32)
Tensor("Mean_2:0", shape=(), dtype=float32)
Epoch  0 has a loss of  2.3039727210998535
Epoch  200 has a loss of  2.307284355163574
Epoch  400 has a loss of  2.3060741424560547
Epoch  600 has a loss of  2.306739568710327
Epoch  800 has a loss of  2.298088312149048
Epoch  1000 has a loss of  2.298781156539917
Epoch  1200 has a loss of  2.301783561706543
Epoch  1400 has a loss of  2.3015966415405273
Epoch  1600 has a loss of  2.297822952270508
Epoch  1800 has a loss of  2.302208185195923
Epoch  2000 has a loss of  2.3043437004089355
Epoch  2200 has a loss of  2.3037538528442383
Epoch  2400 has a loss of  2.3015990257263184
Epoch  2600 has a loss of  2.304655075073242
Epoch  2800 has a loss of  2.3003900051116943
Epoch  3000 has a loss of  2.299161911010742
Epoch  3200 has a loss of  2.3031258583068848
Epoch  3400 has a loss of  2.3017220497131348
Epoch  3600 has a loss of  2.3023767471313477
Epoch  3800 has a loss of  2.303291082382202
Epoch  40

---
# 3. Special Care for Deep Networks
As layers were added, neural networks tended to converge with more difficulties. For example, the accuracy could stuck at 0.1. Here, we want to apply some updates to the network we built in the previous part to improve its performance. 

### ReLU Activation Function
<img src="figs/10-comic3.png" style="width: 500px;"/>
The sigmoid activation function is actually quite problematic in deep networks. It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely. An alternative activation function is **ReLU** that shows better performance compare to sigmoid. It looks like as below:
<img src="figs/11-relu.png" style="width: 300px;"/>

### A Better Optimizer
In very high dimensional spaces like here, **saddle points** are frequent. These are points that are not local minima, but where the gradient is nevertheless zero and the gradient descent optimizer stays stuck there. One possible solution to tackle this probelm is to use better optimizers, such as Adam optimizer `tf.train.AdamOptimizer`.

### Random Initialisations
When working with ReLUs, the best practice is to initialise bias values to small positive values, so that neurons operate in the non-zero range of the ReLU initially.

### Learning Rate
<img src="figs/12-comic4.png" style="width: 500px;"/>
With two, three or four intermediate layers, you can now get close to 98% accuracy, if you push the iterations to 5000 or beyond. But, the results are not very consistent, and the curves jump up and down by a whole percent. A good solution is to start fast and decay the learning rate exponentially from $0.005$ to $0.0001$ for example. In order to pass a different learning rate to the `AdamOptimizer` at each iteration, you will need to define a new placeholder and feed it a new value at each iteration through `feed_dict`. Here is the formula for exponential decay: $learning\_rate = lr\_min + (lr\_max - lr\_min) * e^{\frac{-i}{2000}}$, where $i$ is the iteration number.

### NaN?
In the network you built in the last section, you might see accuracy curve crashes and the console outputs NaN for the cross-entropy. It may happen, because you are attempting to compute a $log(0)$, which is indeed Not A Number (NaN). Remember that the cross-entropy involves a log, computed on the output of the softmax layer. Since softmax is essentially an exponential, which is never zero, we should be fine, but with 32 bit precision floating-point operations, exp(-100) is already a genuine zero. TensorFlow has a handy function that computes the softmax and the cross-entropy in a single step, implemented in a numerically stable way. To use it, you will need to separate the weighted sum plus bias on the last layer, before softmax is applied and then give it with the true values to the function `tf.nn.softmax_cross_entropy_with_logits`.

In the code below, apply the following changes and show their impact on the accuracy of the model on training data, as well as the test data:
* Replace the sigmoid activation function with ReLU
* Use the Adam optimizer
* Initialize weights with small random values between -0.2 and +0.2, and make sure biases are initialised with small positive values, for example 0.1
* Update the learning rate in different iterations. Start fast and decay the learning rate exponentially from $0.005$ to $0.0001$, i.e., 
```
max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0
```
* Use `tf.nn.softmax_cross_entropy_with_logits` to prevent getting NaN in output.

In [44]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X_SC = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y_SC = tf.placeholder(tf.float32, shape=(None, 10))



# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W_Conv1 = tf.Variable(tf.truncated_normal([784, 200],stddev=0.1))
B_Conv1 = tf.Variable(tf.constant(0.1, shape=[200]))

# Second Conv Layer with 200 input features and 100 ouput features
W_Conv2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B_Conv2 = tf.Variable(tf.constant(0.1, shape=[100]))

# Third Conv Layer with 100 input features and 100 ouput features
W_Conv3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B_Conv3 = tf.Variable(tf.constant(0.1, shape=[60]))

# Fourth Conv Layer with 60 input features and 30 ouput features
W_Conv4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B_Conv4 = tf.Variable(tf.constant(0.1, shape=[30]))

#Fourth Conv Layer with 30 input features and 10 ouput features
W_Conv5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B_Conv5 = tf.Variable(tf.constant(0.1, shape=[10])) 

########################################
# build the model
########################################

flattened_X_SC = tf.reshape(X_SC, [-1, 28*28*1])
## ReLu goes through all outputs in Conv layer, wherever a negative number occurs, we swap it out for a 0
Y_hat_Conv1 = tf.nn.relu(tf.matmul(flattened_X_SC, W_Conv1) + B_Conv1)
Y_hat_Conv2 = tf.nn.relu(tf.matmul(Y_hat_Conv1, W_Conv2) + B_Conv2)
Y_hat_Conv3 = tf.nn.relu(tf.matmul(Y_hat_Conv2, W_Conv3) + B_Conv3)
Y_hat_Conv4 = tf.nn.relu(tf.matmul(Y_hat_Conv3, W_Conv4) + B_Conv4)
Y_hat_SC = tf.nn.softmax(tf.matmul(Y_hat_Conv4, W_Conv5) + B_Conv5)

########################################
# defining the cost function
########################################
cross_entropy_sc = tf.nn.softmax_cross_entropy_with_logits(logits=Y_hat_SC, labels=Y_SC)
cross_entropy_sc = tf.reduce_mean(cross_entropy_sc) * 100

########################################
# define the optimizer
########################################
decay_rate = 2000.0 #0.96
#global_steps = 1000
min_learning_rate = 0.0001
max_learning_rate = 0.005 #decays exponentially at every training step
decay_steps = 100
current_global_step = tf.Variable(tf.constant(0))
# Variable learning rate
learning_rate = tf.placeholder(tf.float32)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step_sc = optimizer.minimize(cross_entropy_sc)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
not_minimum_learning_rate = True
applied_learning_rate= max_learning_rate
n_epochs = 5000
import math
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist_ml.train.next_batch(100)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step_sc,cross_entropy_sc], feed_dict={X_SC: epoch_imageX_reshapedbatch, 
                                                                             Y_SC: epoch_labelY_batch, 
                                                                             learning_rate: applied_learning_rate}) 
        epoch_loss = epoch_loss + loss
        if not_minimum_learning_rate:
            applied_learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / 2000)) #Pattern Recognition and Computer Vision: First Chinese Conference pg 401
            if applied_learning_rate == min_learning_rate:
                not_minimum_learning_rate = False         
        #print loss after every 200 epochs
        if epoch % 200 == 0:
            print ('Epoch ',epoch, 'has a loss of ', epoch_loss)
            
    predictions = tf.equal(tf.argmax(Y_SC, 1), tf.argmax(Y_hat_SC,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist_ml.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X_SC:epoch_imageX_reshapedTestbatch, Y_SC: mnist_ml.test.labels}) * 100
    print(" ")
    print("The Accuracy for an fine tuned model in Five-Layer Neural Network model is: {} % ".format(acc))

Epoch  0 has a loss of  229.76202392578125
Epoch  200 has a loss of  163.44586181640625
Epoch  400 has a loss of  160.47909545898438
Epoch  600 has a loss of  160.53448486328125
Epoch  800 has a loss of  157.03350830078125
Epoch  1000 has a loss of  159.6430206298828
Epoch  1200 has a loss of  164.748046875
Epoch  1400 has a loss of  159.29676818847656
Epoch  1600 has a loss of  160.05653381347656
Epoch  1800 has a loss of  159.0186004638672
Epoch  2000 has a loss of  165.7208709716797
Epoch  2200 has a loss of  157.09127807617188
Epoch  2400 has a loss of  155.005126953125
Epoch  2600 has a loss of  163.22683715820312
Epoch  2800 has a loss of  151.09738159179688
Epoch  3000 has a loss of  148.61399841308594
Epoch  3200 has a loss of  150.1233673095703
Epoch  3400 has a loss of  149.11695861816406
Epoch  3600 has a loss of  146.13360595703125
Epoch  3800 has a loss of  147.90016174316406
Epoch  4000 has a loss of  147.92959594726562
Epoch  4200 has a loss of  148.68624877929688
Epoch 

<hr>
                                           
#### Tensorflow Inbult function gives bad performance when used with AdamOptimizer
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,100000, 0.96, staircase=True)
                                           
 global_step = tf.Variable(0, trainable=False)
 starter_learning_rate = 0.1
 learning_rate = tf.train.exponential_decay(starter_learning_rate,
                                             global_step,
                                             100, 0.9, staircase=True)
  optimizer = tf.train.AdamOptimizer(learning_rate)

Epoch  0 has a loss of  233.9381866455078
Epoch  200 has a loss of  232.11508178710938
Epoch  400 has a loss of  238.1150665283203
Epoch  600 has a loss of  242.11509704589844
Epoch  800 has a loss of  240.1150665283203
Epoch  1000 has a loss of  236.11508178710938
Epoch  1200 has a loss of  240.1150665283203
Epoch  1400 has a loss of  nan
Epoch  1600 has a loss of  nan
Epoch  1800 has a loss of  nan
Epoch  2000 has a loss of  nan
Epoch  2200 has a loss of  nan
Epoch  2400 has a loss of  nan
Epoch  2600 has a loss of  nan
Epoch  2800 has a loss of  nan
Epoch  3000 has a loss of  nan
Epoch  3200 has a loss of  nan
Epoch  3400 has a loss of  nan
Epoch  3600 has a loss of  nan
Epoch  3800 has a loss of  nan
Epoch  4000 has a loss of  nan
Epoch  4200 has a loss of  nan
Epoch  4400 has a loss of  nan
Epoch  4600 has a loss of  nan
Epoch  4800 has a loss of  nan
 
The Accuracy for an fine tuned model in Five-Layer Neural Network model is: 9.799999743700027 % 
<hr>

---
# 4. Overfitting and Dropout
<img src="figs/13-comic5.png" style="width: 500px;"/>
You will have noticed that cross-entropy curves for test and training data start disconnecting after a couple thousand iterations. The learning algorithm works on training data only and optimises the training cross-entropy accordingly. It never sees test data so it is not surprising that after a while its work no longer has an effect on the test cross-entropy which stops dropping and sometimes even bounces back up. 
<img src="figs/14-overfit.png" style="width: 500px;"/>
This disconnect is usually labeled **overfitting** and when you see it, you can try to apply a regularisation technique called **dropout**. In dropout, at each training iteration, you drop random neurons from the network. You choose a probability `pkeep` for a neuron to be kept, usually between 50% and 75%, and then at each iteration of the training loop, you randomly remove neurons with all their weights and biases. Different neurons will be dropped at each iteration. When testing the performance of your network of course you put all the neurons back (`pkeep = 1`).
<img src="figs/15-dropout.png" style="width: 500px;"/>
TensorFlow offers a dropout function to be used on the outputs of a layer of neurons. It randomly zeroes-out some of the outputs and boosts the remaining ones by `1 / pkeep`. You can add dropout after each intermediate layer in the network now. 

In the following code, use the dropout between each layer during the training, and set the probability `pkeep` once to $50%$ and another time to $75%$ and compare their results.

In [2]:
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
mnist_do = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [4]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X_DO = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y_DO = tf.placeholder(tf.float32, shape=(None, 10))

# variable learning rate
learning_rate_do = tf.placeholder(tf.float32)

# probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.75 at training time
pkeep = tf.placeholder(tf.float32, shape=[])

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W_Conv1 = tf.Variable(tf.truncated_normal([784, 200],stddev=0.1))
B_Conv1 = tf.Variable(tf.constant(0.1, shape=[200]))

# Second Conv Layer with 200 input features and 100 ouput features
W_Conv2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B_Conv2 = tf.Variable(tf.constant(0.1, shape=[100]))

# Third Conv Layer with 100 input features and 100 ouput features
W_Conv3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B_Conv3 = tf.Variable(tf.constant(0.1, shape=[60]))

# Fourth Conv Layer with 60 input features and 30 ouput features
W_Conv4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B_Conv4 = tf.Variable(tf.constant(0.1, shape=[30]))

#Fourth Conv Layer with 30 input features and 10 ouput features
W_Conv5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B_Conv5 = tf.Variable(tf.constant(0.1, shape=[10])) 

########################################
# build the model
########################################
flattened_X_DO = tf.reshape(X_DO, [-1,28*28*1])

Y_hat_Conv1 = tf.nn.relu(tf.matmul(flattened_X_DO, W_Conv1) + B_Conv1)
Y1_hat_Conv1_dropout = tf.nn.dropout(Y_hat_Conv1, keep_prob=pkeep)

Y_hat_Conv2 = tf.nn.relu(tf.matmul(Y1_hat_Conv1_dropout, W_Conv2) + B_Conv2)
Y_hat_Conv2_dropout = tf.nn.dropout(Y_hat_Conv2, keep_prob=pkeep)

Y_hat_Conv3 = tf.nn.relu(tf.matmul(Y_hat_Conv2_dropout, W_Conv3) + B_Conv3)
Y_hat_Conv3_dropout = tf.nn.dropout(Y_hat_Conv3, keep_prob=pkeep)

Y_hat_Conv4 = tf.nn.relu(tf.matmul(Y_hat_Conv3_dropout, W_Conv4) + B_Conv4)
Y_hat_Conv4_dropout = tf.nn.dropout(Y_hat_Conv4, keep_prob=pkeep)

Y_hat_DO = tf.nn.softmax(tf.matmul(Y_hat_Conv4_dropout, W_Conv5) + B_Conv5)


########################################
# define the cost function
########################################
cross_entropy_do = tf.nn.softmax_cross_entropy_with_logits(logits=Y_hat_DO, labels=Y_DO)
cross_entropy_do = tf.reduce_mean(cross_entropy_do) * 100

########################################
# define the optimizer
########################################
decay_speed = 2000
min_learning_rate = 0.0001
max_learning_rate = 0.005 #decays exponentially at every training step
current_global_step = tf.Variable(tf.constant(0))

optimizer = tf.train.AdamOptimizer(learning_rate_do)
train_step_do = optimizer.minimize(cross_entropy_do)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
dropout_rates = [0.5,0.7]
def execute_model(dropout_rate):  
    not_minimum_learning_rate = True
    applied_learning_rate = max_learning_rate
    print(" DropOut Rate ",dropout_rate)
    n_epochs = 5000
    import math
    with tf.Session() as sess:
        sess.run(init)
        #For every iteration i
        for epoch in range(n_epochs):
            epoch_loss = 0
            epoch_imageX_batch, epoch_labelY_batch = mnist_do.train.next_batch(100)
            epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])

            _, loss = sess.run([train_step_do,cross_entropy_do], feed_dict={X_DO: epoch_imageX_reshapedbatch, 
                                                                            Y_DO: epoch_labelY_batch, 
                                                                            learning_rate_do: applied_learning_rate,
                                                                            pkeep: dropout_rate}) 

            epoch_loss = epoch_loss + loss
            if not_minimum_learning_rate:
                applied_learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / decay_speed)) #Pattern Recognition and Computer Vision: First Chinese Conference pg 401
                if applied_learning_rate == min_learning_rate:
                    not_minimum_learning_rate = False         
            #print loss after every 200 epochs
            if epoch % 200 == 0:
                print ('Epoch ',epoch, 'has a loss of ', epoch_loss)

        predictions = tf.equal(tf.argmax(Y_DO, 1), tf.argmax(Y_hat_DO,1))
        accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
        epoch_imageX_reshapedTestbatch = np.reshape(mnist_do.test.images, [-1,28,28,1])
        acc = accuracy.eval(session=sess,feed_dict={X_DO:epoch_imageX_reshapedTestbatch, 
                                                    Y_DO: mnist_do.test.labels, 
                                                    learning_rate_do: applied_learning_rate,
                                                    pkeep: dropout_rate }) * 100

        return acc
perfomance_results = [execute_model(dropout_rate) for dropout_rate in dropout_rates]
print(" ")
print("The Accuracy for 5-Layer Neural Network Model with Dropout is: {} % ".format(perfomance_results))   

 DropOut Rate  0.5
Epoch  0 has a loss of  230.99819946289062
Epoch  200 has a loss of  177.94859313964844
Epoch  400 has a loss of  177.68589782714844
Epoch  600 has a loss of  171.83956909179688
Epoch  800 has a loss of  171.4212646484375
Epoch  1000 has a loss of  171.4695281982422
Epoch  1200 has a loss of  166.45745849609375
Epoch  1400 has a loss of  165.58737182617188
Epoch  1600 has a loss of  160.1772918701172
Epoch  1800 has a loss of  162.64515686035156
Epoch  2000 has a loss of  161.71612548828125
Epoch  2200 has a loss of  163.24974060058594
Epoch  2400 has a loss of  158.32669067382812
Epoch  2600 has a loss of  162.86239624023438
Epoch  2800 has a loss of  159.2679901123047
Epoch  3000 has a loss of  154.70919799804688
Epoch  3200 has a loss of  161.9462432861328
Epoch  3400 has a loss of  160.61312866210938
Epoch  3600 has a loss of  160.8137969970703
Epoch  3800 has a loss of  161.47116088867188
Epoch  4000 has a loss of  156.34494018554688
Epoch  4200 has a loss of  1

---
# 6. Convolutional Network
<img src="figs/16-comic6.png" style="width: 500px;"/>
In the previous sections, all pixels of images flattened into a single vector, which was a really bad idea. Handwritten digits are made of shapes and we discarded the shape information when we flattened the pixels. However, we can use **convolutional neural networks (CNN)** to take advantage of shape information. CNNs apply *a series of filters* to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for classification. CNNs contains three components:
  - **Convolutional layers**: apply a specified number of convolution filters to the image. For each subregion, the layer performs a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a ReLU activation function to the output to introduce nonlinearities into the model.
  - **Pooling layers**: downsample the image data extracted by the convolutional layers to reduce the dimensionality of the feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
  - **Dense (fully connected) layers**: perform classification on the features extracted by the convolutional layers and downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
  
Typically, a CNN is composed of a *stack of **convolutional modules*** that perform feature extraction. Each *module* consists of a *convolutional layer* followed by a *pooling layer*. The last convolutional module is followed by one or more dense layers that perform classification. The final dense layer in a CNN contains a single neuron for each target class in the model, with a softmax activation function to generate a value between 0-1 for each neuron. We can interpret the softmax values for a given image as relative measurements of how likely it is that the image falls into each target class.

Now, let us build a convolutional network for handwritten digit recognition. In this assignment, we will use the architecture shown in the following figure that has three convolutional layers, one fully-connected layer, and one softmax layer. Notice that the second and third convolutional layers have a stride of two that explains why they bring the number of output values down from 28x28 to 14x14 and then 7x7. A convolutional layer requires a weights tensor like `[4, 4, 3, 2]`, in which the first two numbers define the size of a filter (map), the third number shows the *depth* of the filter that is the number of *input channel*, and the last number shows the number of *output channel*. The output channel defines the number of times that we repeat the same thing with a different set of weights in one layer. In our implementation, we assume the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected layer is 200.
<img src="figs/17-arch1.png" style="width: 600px;"/>

Convolutional layers can be implemented in TensorFlow using the `tf.nn.conv2d` function, which performs the scanning of the input image in both directions using the supplied weights. This is only the weighted sum part of the neuron. You still need to add a bias and feed the result through an activation function. The padding strategy that works here is to copy pixels from the sides of the image. All digits are on a uniform background so this just extends the background and should not add any unwanted shapes.

In [3]:
from __future__ import division, print_function, unicode_literals
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
import numpy as np
import tensorflow as tf
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
mnist_cnn = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)


Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz


In [10]:
# TODO: Replace <FILL IN> with appropriate code

# · · · · · · · · · ·      (input data, 1-deep)               X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @   -- conv. layer 5x5x1=>4 stride 1      W1 [5, 5, 1, 4]        B1 [4]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 4]
#   @ @ @ @ @ @ @ @     -- conv. layer 5x5x4=>8 stride 2      W2 [5, 5, 4, 8]        B2 [8]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 8]
#     @ @ @ @ @ @       -- conv. layer 4x4x8=>12 stride 2     W3 [4, 4, 8, 12]       B3 [12]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 12] => reshaped to YY [batch, 7*7*12]
#      \x/x\x\x/        -- fully connected layer (relu)       W4 [7*7*12, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/         -- fully connected layer (softmax)    W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
# ML(Multi layer)
X_CNN = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y_CNN = tf.placeholder(tf.float32, shape=(None, 10))
learning_rate_cnn = tf.placeholder(tf.float32)

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected
# layer is 200
W_Conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 4],stddev=0.1))
B_Conv1 = tf.Variable(tf.constant(0.1, shape=[4]))

W_Conv2 = tf.Variable(tf.truncated_normal([5, 5, 4, 8],stddev=0.1))
B_Conv2 = tf.Variable(tf.constant(0.1, shape=[8]))

W_Conv3 = tf.Variable(tf.truncated_normal([4, 4, 8, 12],stddev=0.1))
B_Conv3 = tf.Variable(tf.constant(0.1, shape=[12]))

W_Conv4 = tf.Variable(tf.truncated_normal([7*7*12, 200],stddev=0.1))
B_Conv4 = tf.Variable(tf.constant(0.1, shape=[200]))

W_Conv5 = tf.Variable(tf.truncated_normal([200, 10],stddev=0.1))
B_Conv5 = tf.Variable(tf.constant(0.1, shape=[10]))


########################################
# build the model
########################################
# Shape of input X, [batch, in_height, in_width, in_channels] ==> X = [batch_size,28 ,28, 1]
# Shape of filter / kernel, [filter_height, filter_width, in_channels, out_channels] ==> W = [5, 5, 1, 32]
# Shape of stride,  [batch, height, width, channels] ==> stride for 4D tensor = [1, 1, 1, 1], stride for 2D tensor [height, width] 
stride = 1  # output is 28x28

Y_hat_Conv1 = tf.nn.relu(tf.nn.conv2d(X_CNN, W_Conv1, strides=[1, stride, stride, 1], padding='SAME') + B_Conv1 ) # [batch, height, width, channels]

stride_conv2 = 2 # output is 14x14
Y_hat_Conv2 = tf.nn.relu(tf.nn.conv2d(Y_hat_Conv1, W_Conv2, strides=[1, stride_conv2, stride_conv2, 1], padding='SAME') + B_Conv2 )

stride_conv3 = 2  # output is 7x7
Y_hat_Conv3 = tf.nn.relu(tf.nn.conv2d(Y_hat_Conv2, W_Conv3, strides=[1, stride_conv3, stride_conv3, 1], padding='SAME') + B_Conv3 )

# reshape the output from the third convolution for the fully connected layer
flattened_matrix_Conv3= tf.reshape(Y_hat_Conv3, [-1, 7*7*12])
Y_hat_Conv4 = tf.nn.relu(tf.matmul(flattened_matrix_Conv3, W_Conv4) + B_Conv4)
Y_hat_CNN = tf.nn.softmax(tf.matmul(Y_hat_Conv4, W_Conv5) + B_Conv5)


########################################
# define the cost function
########################################

cross_entropy_cnn = tf.nn.softmax_cross_entropy_with_logits(logits=Y_hat_CNN, labels=Y_CNN)
cross_entropy_cnn = tf.reduce_mean(cross_entropy_cnn) * 100

########################################
# define the optmizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate_cnn)
train_step_cnn = optimizer.minimize(cross_entropy_cnn)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
not_minimum_learning_rate = True
max_learning_rate = 0.005
min_learning_rate = 0.0001
applied_learning_rate = max_learning_rate
import math
n_epochs = 5000
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist_cnn.train.next_batch(100)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step_cnn,cross_entropy_cnn], feed_dict={X_CNN: epoch_imageX_reshapedbatch, 
                                                                             Y_CNN: epoch_labelY_batch, 
                                                                             learning_rate_cnn: applied_learning_rate}) 
        epoch_loss = epoch_loss + loss
        if not_minimum_learning_rate:
            applied_learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / 2000)) #Pattern Recognition and Computer Vision: First Chinese Conference pg 401
            if applied_learning_rate == min_learning_rate:
                not_minimum_learning_rate = False         
        #print loss after every 200 epochs
        if epoch % 200 == 0:
            print ('Epoch ',epoch, 'has a loss of ', epoch_loss)
            
    predictions = tf.equal(tf.argmax(Y_CNN, 1), tf.argmax(Y_hat_CNN,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist_cnn.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X_CNN:epoch_imageX_reshapedTestbatch, Y_CNN: mnist_cnn.test.labels}) * 100
    print(" end")
    print("The Accuracy for an fine tuned model in Five-Layer CNN Neural Network model is: {} % ".format(acc)) 

Epoch  0 has a loss of  230.14256286621094
Epoch  200 has a loss of  172.47821044921875
Epoch  400 has a loss of  169.77420043945312
Epoch  600 has a loss of  160.7787322998047
Epoch  800 has a loss of  157.1842803955078
Epoch  1000 has a loss of  151.1025390625
Epoch  1200 has a loss of  148.56292724609375
Epoch  1400 has a loss of  150.09075927734375
Epoch  1600 has a loss of  155.11459350585938
Epoch  1800 has a loss of  150.1013641357422
Epoch  2000 has a loss of  149.15672302246094
Epoch  2200 has a loss of  147.86929321289062
Epoch  2400 has a loss of  153.7568817138672
Epoch  2600 has a loss of  146.11526489257812
Epoch  2800 has a loss of  149.02096557617188
Epoch  3000 has a loss of  147.13235473632812
Epoch  3200 has a loss of  147.11509704589844
Epoch  3400 has a loss of  149.1150665283203
Epoch  3600 has a loss of  146.22799682617188
Epoch  3800 has a loss of  147.1150665283203
Epoch  4000 has a loss of  146.1150665283203
Epoch  4200 has a loss of  149.5115966796875
Epoch  

# 7. Improve The Performance
A good approach to sizing your neural networks is to implement a network that is a little too constrained, then give it a bit more degrees of freedom and add dropout to make sure it is not overfitting. This ends up with a fairly optimal network for your problem. In the above model, we set the output channel to 4 in the first convolutional layer, which means that we repeat the same filter shape (but with different weights) four times. If we assume that those filters evolve during training into shape recognisers, you can intuitively see that this might not be enough for our problem. Handwritten digits are made from more than 4 elemental shapes. So let us bump up the filter sizes a little, and also increase the number of filters in our convolutional layers from 4, 8, 12 to 6, 12, 24 and then add dropout on the fully-connected layer. The following figure shows the new architecture you should build. Please complete the following code based on the given architecture and dropout technique.
<img src="figs/18-arch2.png" style="width: 600px;"/>

In [3]:
from __future__ import division, print_function, unicode_literals
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
import numpy as np
import tensorflow as tf
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
mnist_cnn = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [15]:
# TODO: Replace <FILL IN> with appropriate code

# · · · · · · · · · ·    (input data, 1-deep)                 X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @ -- conv. layer 6x6x1=>6 stride 1        W1 [5, 5, 1, 6]        B1 [6]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 6]
#   @ @ @ @ @ @ @ @   -- conv. layer 5x5x6=>12 stride 2       W2 [5, 5, 6, 12]        B2 [12]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 12]
#     @ @ @ @ @ @     -- conv. layer 4x4x12=>24 stride 2      W3 [4, 4, 12, 24]       B3 [24]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 24] => reshaped to YY [batch, 7*7*24]
#      \x/x\x\x/ ✞    -- fully connected layer (relu+dropout) W4 [7*7*24, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/       -- fully connected layer (softmax)      W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y = tf.placeholder(tf.float32, shape=(None, 10))
learning_rate = tf.placeholder(tf.float32)
pkeep = tf.placeholder(tf.float32, shape=[])

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 6, 12, 24, and the size of fully connected
# layer is 200
W_Conv1 = tf.Variable(tf.truncated_normal([5, 5, 1, 6],stddev=0.1))
B_Conv1 = tf.Variable(tf.constant(0.1, shape=[6]))

W_Conv2 = tf.Variable(tf.truncated_normal([5, 5, 6, 12],stddev=0.1))
B_Conv2 = tf.Variable(tf.constant(0.1, shape=[12]))

W_Conv3 = tf.Variable(tf.truncated_normal([4, 4, 12, 24],stddev=0.1))
B_Conv3 = tf.Variable(tf.constant(0.1, shape=[24]))

W_Conv4 = tf.Variable(tf.truncated_normal([7*7*24, 200],stddev=0.1))
B_Conv4 = tf.Variable(tf.constant(0.1, shape=[200]))

W_Conv5 = tf.Variable(tf.truncated_normal([200, 10],stddev=0.1))
B_Conv5 = tf.Variable(tf.constant(0.1, shape=[10]))

########################################
# build the model
########################################
stride = 1  # output is 28x28

Y_hat_Conv1 = tf.nn.relu(tf.nn.conv2d(X, W_Conv1, strides=[1, stride, stride, 1], padding='SAME') + B_Conv1 ) # [batch, height, width, channels]

stride_conv2 = 2 # output is 14x14
Y_hat_Conv2 = tf.nn.relu(tf.nn.conv2d(Y_hat_Conv1, W_Conv2, strides=[1, stride_conv2, stride_conv2, 1], padding='SAME') + B_Conv2 )

stride_conv3 = 2  # output is 7x7
Y_hat_Conv3 = tf.nn.relu(tf.nn.conv2d(Y_hat_Conv2, W_Conv3, strides=[1, stride_conv3, stride_conv3, 1], padding='SAME') + B_Conv3 )



# reshape the output from the third convolution for the fully connected layer
flattened_matrix_Conv3 = tf.reshape(Y_hat_Conv3, [-1, 7*7*24])
Y_hat_Conv4 = tf.nn.relu(tf.matmul(flattened_matrix_Conv3, W_Conv4) + B_Conv4)
Y_hat_Conv4_dropout = tf.nn.dropout(Y_hat_Conv4, keep_prob=pkeep)

Y_hat_CNN2 = tf.nn.softmax(tf.matmul(Y_hat_Conv4_dropout, W_Conv5) + B_Conv5)

########################################
# define the Loss function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=Y_hat_CNN2, labels=Y )
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# traini the model
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
not_minimum_learning_rate = True
max_learning_rate = 0.005
min_learning_rate = 0.0001
dropout_rate = 0.7
applied_learning_rate = max_learning_rate
import math
n_epochs = 5000
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist_cnn.train.next_batch(100)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step,cross_entropy], feed_dict={X: epoch_imageX_reshapedbatch, 
                                                                          Y: epoch_labelY_batch, 
                                                                          learning_rate: applied_learning_rate,
                                                                          pkeep:dropout_rate}) 
        epoch_loss = epoch_loss + loss
        if not_minimum_learning_rate:
            applied_learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / 2000)) #Pattern Recognition and Computer Vision: First Chinese Conference pg 401
            if applied_learning_rate == min_learning_rate:
                not_minimum_learning_rate = False         
        #print loss after every 200 epochs
        if epoch % 200 == 0:
            print ('Epoch ',epoch, 'has a loss of ', epoch_loss)
            
    predictions = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_hat_CNN2,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist_cnn.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X:epoch_imageX_reshapedTestbatch, Y: mnist_cnn.test.labels, pkeep:dropout_rate}) * 100
    print(" ")
    print("The Accuracy for an fine tuned model in Five-Layer CNN Neural Network model is: {} % ".format(acc)) 

Epoch  0 has a loss of  230.3008270263672
Epoch  200 has a loss of  159.20826721191406
Epoch  400 has a loss of  148.1158905029297
Epoch  600 has a loss of  151.12091064453125
Epoch  800 has a loss of  146.20285034179688
Epoch  1000 has a loss of  150.10943603515625
Epoch  1200 has a loss of  151.36172485351562
Epoch  1400 has a loss of  151.14804077148438
Epoch  1600 has a loss of  147.1650390625
Epoch  1800 has a loss of  148.1150665283203
Epoch  2000 has a loss of  151.1748809814453
Epoch  2200 has a loss of  148.1139373779297
Epoch  2400 has a loss of  150.51304626464844
Epoch  2600 has a loss of  150.11508178710938
Epoch  2800 has a loss of  147.56494140625
Epoch  3000 has a loss of  147.1151885986328
Epoch  3200 has a loss of  148.1151123046875
Epoch  3400 has a loss of  148.11331176757812
Epoch  3600 has a loss of  149.1130828857422
Epoch  3800 has a loss of  148.11505126953125
Epoch  4000 has a loss of  150.0897979736328
Epoch  4200 has a loss of  147.12501525878906
Epoch  4400

---
# 8. Tensorflow Layers Module
The TensorFlow **layers** `tf.layers` module provides a high-level API that makes it easy to construct a neural network. It provides methods that facilitate: (i) the creation of dense (fully connected) layers and convolutional layers, (ii) adding activation functions, and (iii) applying dropout regularization. In this section use the module `tf.layers` to build the network you made in section 7.

In [1]:
from __future__ import division, print_function, unicode_literals
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
import numpy as np
import tensorflow as tf
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
mnist = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [3]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X_LM = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y_LM= tf.placeholder(tf.float32, shape=(None, 10))
learning_rate = tf.placeholder(tf.float32)
pkeep = tf.placeholder(tf.float32, shape=[])


########################################
# Create the layers
########################################
# Computes features using a 5x5 filter.
# Padding is added to preserve width and height.
Y_hat_conv1 = tf.layers.conv2d( inputs=X_LM, filters=6, kernel_size=[5, 5], strides=1,padding="same",
                               bias_initializer= tf.constant_initializer(0.1), activation=tf.nn.relu)

Y_hat_conv2 = tf.layers.conv2d( inputs=Y_hat_conv1, filters=12, kernel_size=[5, 5], strides=2, padding="same",
                               bias_initializer= tf.constant_initializer(0.1), activation=tf.nn.relu)

Y_hat_conv3 = tf.layers.conv2d( inputs=Y_hat_conv2, filters=24, kernel_size=[4, 4], strides=2, padding="same",
                               bias_initializer= tf.constant_initializer(0.1), activation=tf.nn.relu)

Y_hat_conv4 = tf.layers.dense(inputs=tf.reshape(Y_hat_conv3, [-1, 7 * 7 * 24]), units=200, activation=tf.nn.relu,
                              bias_initializer=tf.constant_initializer(0.1))
Y_hat_conv4_dropout = tf.layers.dropout(inputs=Y_hat_conv4, rate=0.75)

Y_hat_conv5 = tf.layers.dense(inputs=Y_hat_conv4_dropout, units=10,
                         bias_initializer=tf.constant_initializer(0.1))

########################################
# define the Loss function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=Y_hat_conv5, labels=Y_LM)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# train the model
########################################
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################  
init = tf.global_variables_initializer()
not_minimum_learning_rate = True
max_learning_rate = 0.005
min_learning_rate = 0.0001
applied_learning_rate = max_learning_rate
import math
n_epochs = 5000
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist.train.next_batch(100)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step,cross_entropy], feed_dict={X_LM: epoch_imageX_reshapedbatch, 
                                                                          Y_LM: epoch_labelY_batch, 
                                                                          learning_rate: applied_learning_rate}) 
        epoch_loss = epoch_loss + loss
        if not_minimum_learning_rate:
            applied_learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / 2000)) #Pattern Recognition and Computer Vision: First Chinese Conference pg 401
            if applied_learning_rate == min_learning_rate:
                not_minimum_learning_rate = False         
        #print loss after every 200 epochs
        if epoch % 200 == 0:
            print ('Epoch ',epoch, 'has a loss of ', epoch_loss)
            
    predictions = tf.equal(tf.argmax(Y_LM, 1), tf.argmax(Y_hat_conv5,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X_LM:epoch_imageX_reshapedTestbatch, Y_LM: mnist.test.labels}) * 100
    print(" ")
    print("The Accuracy for an fine tuned model in Five-Layer CNN Neural Network model is: {} % ".format(acc)) 

Epoch  0 has a loss of  232.1500244140625
Epoch  200 has a loss of  13.473519325256348
Epoch  400 has a loss of  9.194537162780762
Epoch  600 has a loss of  1.3236289024353027
Epoch  800 has a loss of  0.9247559309005737
Epoch  1000 has a loss of  1.562678575515747
Epoch  1200 has a loss of  3.673323154449463
Epoch  1400 has a loss of  0.5194148421287537
Epoch  1600 has a loss of  5.202116966247559
Epoch  1800 has a loss of  6.246393203735352
Epoch  2000 has a loss of  0.39227771759033203
Epoch  2200 has a loss of  0.8949481248855591
Epoch  2400 has a loss of  0.7956770062446594
Epoch  2600 has a loss of  11.737939834594727
Epoch  2800 has a loss of  1.3762552738189697
Epoch  3000 has a loss of  1.116493821144104
Epoch  3200 has a loss of  2.229557514190674
Epoch  3400 has a loss of  2.3982293605804443
Epoch  3600 has a loss of  0.43825939297676086
Epoch  3800 has a loss of  0.0859948992729187
Epoch  4000 has a loss of  0.01970566250383854
Epoch  4200 has a loss of  0.6534491777420044


---
# 9. Keras
Keras is a high-level API to build and train deep learning models. It's used for fast prototyping, advanced research, and production. `tf.keras` is TensorFlow's implementation of the Keras API specification. To work with Keras, you need to import `tf.keras` as part of your TensorFlow program setup.
```
import tensorflow as tf
from tensorflow.keras import layers
```
#### Build a model
In Keras, you assemble **layers** to build a model, i.e., a graph of layers. The most common type of model is a stack of layers: the `tf.keras.Sequential` model. For example, the following code builds a simple, fully-connected network (i.e., multi-layer perceptron):
```
model = tf.keras.Sequential()
# adds a densely-connected layer with 64 units to the model:
model.add(layers.Dense(64, activation='relu'))
# add another
model.add(layers.Dense(64, activation='relu'))
# add a softmax layer with 10 output units:
model.add(layers.Dense(10, activation='softmax'))
```
There are many `tf.keras.layers` available with some common constructor parameters:
* `activation`: set the activation function for the layer, which is specified by the name of a built-in function or as a callable object.
* `kernel_initializer` and `bias_initializer`: the initialization schemes that create the layer's weights (weight and bias).
* `kernel_regularizer` and `bias_regularizer`: the regularization schemes that apply the layer's weights (weight and bias), such as L1 or L2 regularization.

#### Train and evaluate
After you construct a model, you can configure its learning process by calling the `compile` method:
```
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
```
The method `tf.keras.Model.compile` takes three important arguments:
* `optimizer`: it specifies the training procedure, e.g., `tf.train.AdamOptimizer` and `tf.train.GradientDescentOptimizer`.
* `loss`: the cost function to minimize during optimization, e.g., mean square error (mse), categorical_crossentropy, and binary_crossentropy.
* `metrics`: used to monitor training, e.g., `accuracy`.

The next step after confiuring the model is to train it by calling the `model.fit` method and giving it training data as its input. After training the model you can call `tf.keras.Model.evaluate` and `tf.keras.Model.predict` methods to evaluate the inference-mode loss and metrics for the data provided or predict the output of the last layer in inference for the data provided, respectively.

You can read more about Keras [here](https://www.tensorflow.org/guide/keras).

In this task, please use Keras to rebuild the network you made in section 7.

### I'm unable to run this section of the code to get an output. I have conducted an online research  and as it turns outthe problem is cause by different functions in different version keras. I experienced memory issues and hardware problems in HopWorks platform, I have been using a cognitiveclass.ai plaform to run my code therefore I do not have control over which version of keras to install and use.  

In [26]:
from tensorflow.python.keras.layers import Dropout, Dense, Flatten
from tensorflow.python.keras import layers
from keras.objectives import categorical_crossentropy
from keras.callbacks import LearningRateScheduler
from keras.datasets import mnist
from keras.utils import to_categorical
import tensorflow as tf
import keras
import numpy as np
import math
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# to reset the Tensorflow default graph
reset_graph()
########################################
# define variables and placeholders
########################################
# Split mnist data into train and test sets
(X_training_data, Y_training_labels), (X_testing_data, Y_testing_labels) = mnist.load_data()

# reshape data to fit model
X_training_data = X_training_data.reshape(-1, 28, 28, 1)
X_testing_data = X_testing_data.reshape(-1, 28, 28, 1)

# one-hot encoding for the target column e.g a column with digit 5 will be replaced with binary representation of 1 then the rest of colums with 0s
Y_one_hot_training_labels = to_categorical(Y_training_labels)
Y_one_hot_testing_labels = to_categorical(Y_testing_labels)

########################################
# Build the network model |Functional model
########################################

model = tf.keras.Sequential()
model.add(layers.Conv2D(filters=6, kernel_size=5, strides=1, 
                        kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.1, seed=None), 
                        bias_initializer = keras.initializers.Constant(value=0.1),
                        padding='same', activation="relu", input_shape=(28, 28, 1)))
model.add(layers.Conv2D(filters=12, kernel_size=5, strides=2, 
                        kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.1, seed=None), 
                        bias_initializer = keras.initializers.Constant(value=0.1),
                        padding='same', activation="relu", input_shape=(28, 28, 6)))
model.add(layers.Conv2D(filters=24, kernel_size=4, strides=2, 
                        kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.1, seed=None), 
                        bias_initializer = keras.initializers.Constant(value=0.1),
                        padding='same', activation="relu", input_shape=(14, 14, 12)))
model.add(Flatten())

# fully-connected layer with 200 units and ReLU activation
model.add(layers.Dense(200, activation="relu", kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.1, seed=None)
                       ,bias_initializer= tf.keras.initializers.Constant(value=0.1)))
model.add(Dropout(0.75))

# Output layer with 10 units and a softmax activation
model.add(tf.keras.layers.Dense(10, kernel_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.1, seed=None),
                                activation="softmax", bias_initializer = tf.keras.initializers.Constant(value=0.1)))

########################################
# Compile the model
########################################

# learning rate schedule
def lr_exponential_decay(epoch, lr):
    # step_decay:lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    # exponential decay: 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒=𝑙𝑟_𝑚𝑖𝑛+(𝑙𝑟_𝑚𝑎𝑥−𝑙𝑟_𝑚𝑖𝑛)∗𝑒−𝑖2000, where 𝑖 is the iteration number.
    max_learning_rate = 0.005
    min_learning_rate = 0.0001
    epochs_drop = 2
    lr = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / 2000))
    return lr

# learning schedule callback
callbacks = [tf.keras.callbacks.LearningRateScheduler(lr_exponential_decay, verbose=0)]
    
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

########################################
# train and execute the model
########################################
# model.fit(X_training_data, Y_training_labels, validation_data=(X_testing_data , Y_testing_labels), callbacks=callbacks, epochs=5, batch_size=100,verbose=1)
model.fit(X_training_data, Y_one_hot_training_labels, validation_data=(X_testing_data , Y_one_hot_testing_labels),epochs=5, batch_size=100,verbose=1)
########################################
# execute the model
########################################  
model.predict(X_testing_data)
########################################
# execute the model
########################################  
model.predict(Y_testing_labels)

#Y_one_hot_training_labels = to_categorical(Y_training_labels)
#Y_one_hot_testing_labels = to_categorical(Y_testing_labels)
#print(" ")
#print("The Accuracy for an fine tuned model in Five-Layer CNN Neural Network model is: {} % ".format(acc)) 

TypeError: __call__() got an unexpected keyword argument 'partition_info'

---
# 10. Implement LeNet-5
In this section, you should implement **LeNet-5** either using Tensorflow or Keras. Please take a look at its [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) before starting to implement it.
The LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for handwritten digit recognition (MNIST). It is composed of the layers shown in the following table.
<img src="figs/19-letnet5.png" style="width: 600px;"/>
There are a few extra details to be noted:
* MNIST images are 28×28 pixels, but they are zero-padded to 32×32 pixels and normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.
* The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient and adds a learnable bias term, then finally applies the activation function.
* Most neurons in layer C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps). See table 1 in the [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) for details.
* The output layer is a bit special: instead of computing the dot product of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross-entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and thus converging faster.

In [7]:
# TODO: Build the LetNet-5 model, and test it on MNIST
from __future__ import division, print_function, unicode_literals
from tensorflow.examples.tutorials.mnist import input_data as mnist_data
import numpy as np
import tensorflow as tf
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
mnist = mnist_data.read_data_sets("MNIST_DATA/", one_hot=True)
# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X_LN5 = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y_LN5 = tf.placeholder(tf.float32, shape=(None, 10))
learning_rate_ln5 = tf.placeholder(tf.float32)
pkeep_ln5 = tf.placeholder(tf.float32, shape=[])


########################################
# Create the layers
########################################
# Computes 64 features using a 5x5 filter.
# Padding is added to preserve width and height.
Y_hat_conv1 = tf.layers.conv2d(inputs=X_LN5, filters=6, kernel_size=[5, 5], strides=1,padding="same",
                               bias_initializer= tf.constant_initializer(0.1), activation=tf.nn.tanh)
S2 = tf.nn.avg_pool(Y_hat_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # avg_pool_2x2

Y_hat_conv3 = tf.layers.conv2d(inputs=S2, filters=16, kernel_size=[5, 5], strides=2, padding="same",
                               bias_initializer= tf.constant_initializer(0.1), activation=tf.nn.tanh)

S4 = tf.nn.avg_pool(Y_hat_conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # avg_pool_2x2

Y_hat_conv5 = tf.layers.conv2d(inputs=S4, filters=120, kernel_size=[5, 5], strides=2, padding="same",
                               bias_initializer= tf.constant_initializer(0.1), activation=tf.nn.tanh)

F6 = tf.layers.dense(inputs=tf.reshape(Y_hat_conv5, [-1, 20*20*120]), units=84, activation=tf.nn.tanh,
                              bias_initializer=tf.constant_initializer(0.1))

F6_dropout = tf.layers.dropout(inputs=F6, rate=0.75)

out  = tf.layers.dense(inputs=F6_dropout, units=10,
                         bias_initializer=tf.constant_initializer(0.1))

########################################
# define the Loss function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=out, labels=Y_LN5)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# train the model
########################################
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_ln5)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################  
init = tf.global_variables_initializer()
not_minimum_learning_rate = True
max_learning_rate = 0.005
min_learning_rate = 0.0001
applied_learning_rate = max_learning_rate
import math
n_epochs = 5000
with tf.Session() as sess:
    sess.run(init)
    #For every iteration i
    for epoch in range(n_epochs):
        epoch_loss = 0
        epoch_imageX_batch, epoch_labelY_batch = mnist.train.next_batch(100)
        epoch_imageX_reshapedbatch = np.reshape(epoch_imageX_batch, [-1,28,28,1])
        _, loss = sess.run([train_step,cross_entropy], feed_dict={X_LN5: epoch_imageX_reshapedbatch, 
                                                                          Y_LN5: epoch_labelY_batch, 
                                                                          learning_rate_ln5: applied_learning_rate}) 
        epoch_loss = epoch_loss + loss
        if not_minimum_learning_rate:
            applied_learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * math.exp(-(epoch / 2000)) #Pattern Recognition and Computer Vision: First Chinese Conference pg 401
            if applied_learning_rate == min_learning_rate:
                not_minimum_learning_rate = False         
        #print loss after every 200 epochs
        if epoch % 200 == 0:
            print ('Epoch ',epoch, 'has a loss of ', epoch_loss)
            
    predictions = tf.equal(tf.argmax(Y_LN5, 1), tf.argmax(out,1))
    accuracy = tf.reduce_mean(tf.cast(predictions, tf.float32))
    epoch_imageX_reshapedTestbatch = np.reshape(mnist.test.images, [-1,28,28,1])
    acc = accuracy.eval(session=sess,feed_dict={X_LN5:epoch_imageX_reshapedTestbatch, Y_LN5: mnist.test.labels}) * 100
    print(" ")
    print("The Accuracy for an fine tuned model in Five-Layer CNN Neural Network model is: {} % ".format(acc)) 

Extracting MNIST_DATA/train-images-idx3-ubyte.gz
Extracting MNIST_DATA/train-labels-idx1-ubyte.gz
Extracting MNIST_DATA/t10k-images-idx3-ubyte.gz
Extracting MNIST_DATA/t10k-labels-idx1-ubyte.gz


InvalidArgumentError: Input to reshape is a tensor with 1000 values, but the requested shape has 10
	 [[Node: gradients/softmax_cross_entropy_with_logits_sg/Reshape_grad/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradients/softmax_cross_entropy_with_logits_sg_grad/mul, gradients/softmax_cross_entropy_with_logits_sg/Reshape_grad/Shape)]]

Caused by op 'gradients/softmax_cross_entropy_with_logits_sg/Reshape_grad/Reshape', defined at:
  File "/home/jupyterlab/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/jupyterlab/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 505, in start
    self.io_loop.start()
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/home/jupyterlab/conda/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/jupyterlab/conda/lib/python3.6/asyncio/base_events.py", line 1434, in _run_once
    handle._run()
  File "/home/jupyterlab/conda/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/gen.py", line 1233, in inner
    self.run()
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 357, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 267, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 534, in execute_request
    user_expressions, allow_stdin,
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 294, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2819, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2845, in _run_cell
    return runner(coro)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 67, in _pseudo_sync_runner
    coro.send(None)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3020, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3185, in run_ast_nodes
    if (yield from self.run_code(code, result)):
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-6c9b442bbe2b>", line 59, in <module>
    train_step = optimizer.minimize(cross_entropy)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 414, in minimize
    grad_loss=grad_loss)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 526, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 494, in gradients
    gate_gradients, aggregation_method, stop_gradients)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 636, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 385, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 636, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 521, in _ReshapeGrad
    return [array_ops.reshape(grad, array_ops.shape(op.inputs[0])), None]
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6113, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'softmax_cross_entropy_with_logits_sg/Reshape', defined at:
  File "/home/jupyterlab/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
[elided 25 identical lines from previous traceback]
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-6c9b442bbe2b>", line 52, in <module>
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=out, labels=Y_LN5)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 250, in new_func
    return func(*args, **kwargs)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1959, in softmax_cross_entropy_with_logits
    labels=labels, logits=logits, dim=dim, name=name)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1866, in softmax_cross_entropy_with_logits_v2
    precise_logits = _flatten_outer_dims(precise_logits)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1612, in _flatten_outer_dims
    output = array_ops.reshape(logits, array_ops.concat([[-1], last_dim_size], 0))
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6113, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/jupyterlab/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 1000 values, but the requested shape has 10
	 [[Node: gradients/softmax_cross_entropy_with_logits_sg/Reshape_grad/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradients/softmax_cross_entropy_with_logits_sg_grad/mul, gradients/softmax_cross_entropy_with_logits_sg/Reshape_grad/Shape)]]


---
# 11. Implement AlexNet
In the last section, you should implement **AlexNet** either using Tensorflow or Keras. Again, please take a look at its [paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) before start to implement it.
The AlexNet CNN architecture won the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2012/) in 2012 by a large margin. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. The following table presents this architecture.
<img src="figs/20-alexnet.png" style="width: 600px;"/>
To train the model, we need a big dataset, however, in this assignment you are going to to assign the pretrained weights to your model, using `tf.Variable.assign`. You can download the pretrained weights from [bvlc_alexnet.npy](https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/bvlc_alexnet.npy). This file is a NumPy array file created by the python. After you read this file, you will receive a python dictionary with a <key, value> pair for each layer. Each key is one of the layers names, e.g., `conv1`, and each value is a list of two values: (1) weights, and (2) biases of that layer. Part of the function to load the weights and biases to your model is given, and you need to complete it.

Here is what you see if you read and print the shape of each layer from the file:
```
weight_dic = np.load("bvlc_alexnet.npy", encoding="bytes").item()
for layer in weights_dic:
    print("-" * 20)
    print(layer)
    for wb in weights_dic[layer]:
        print(wb.shape)

#--------------------
# fc8
# (4096, 1000) # weights
# (1000,) # bias
#--------------------
# fc7
# (4096, 4096) # weights
# (4096,) # bias
#--------------------
# fc6
# (9216, 4096) # weights
# (4096,) # bias
#--------------------
# conv5
# (3, 3, 192, 256) # weights
# (256,) # bias
#--------------------
# conv4
# (3, 3, 192, 384) # weights
# (384,) # bias
#--------------------
# conv3
# (3, 3, 256, 384) # weights
# (384,) # bias
#--------------------
# conv2
# (5, 5, 48, 256) # weights
# (256,) # bias
#--------------------
# conv1
# (11, 11, 3, 96) # weights
# (96,) # bias
```


In [14]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

# build the AlexNet model
<FILL IN> :)

# load inital weights and biases to the model
def load_initial_weights(self, session):
    # load the weights into memory
    weights_dic = np.load('bvlc_alexnet.npy', encoding='bytes').item()

    # loop over all layer names stored in the weights dict
    for layer in weights_dict:
        with tf.variable_scope(layer, reuse=True):
            # loop over list of weights/biases and assign them to their corresponding tf variable
            for wb in weights_dict[layer]:
                # biases
                if len(wb.shape) == 1:
                    bias = tf.get_variable(<FILL IN>)
                    session.run(bias.assign(wb))
                # weights
                else:
                    weight = tf.get_variable(<FILL IN>)
                    session.run(weight.assign(wb))
                

An error was encountered:
Session 1716 unexpectedly reached final status 'error'. See logs:
stdout: 
2018-12-24 04:50:57,091 WARN  NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-24 04:50:57,835 INFO  RMProxy: Connecting to ResourceManager at /10.0.104.193:8032
2018-12-24 04:50:58,270 INFO  Client: Requesting a new application from cluster with 30 NodeManagers
2018-12-24 04:50:58,403 INFO  Client: Verifying our application has not requested more than the maximum memory capability of the cluster (216000 MB per container)
2018-12-24 04:50:58,414 INFO  Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
2018-12-24 04:50:58,415 INFO  Client: Setting up container launch context for our AM
2018-12-24 04:50:58,432 INFO  Client: Setting up the launch environment for our AM container
2018-12-24 04:50:58,445 INFO  Client: Preparing resources for our AM container
2018-12-24 04:51:00,145 W

#### Test the model
After building the AlexNet model, you can test it on different images and present the accuracy of the model. To do so, first you need to use **OpenCV** library to make the images ready to give as input to the model. OpenCV is a library used for image processing. Below you can see how to read an image file and pre-process it using OpenCV to give it to the model. However, you need to complete the code and test the accuracy of your model. The teset images (shown below) are available in the `test_images` folder.
<table width="100%">
<tr>
<td><img src="test_images/test_image1.jpg" style="width:200px;"></td>
<td><p align="center"><img src="test_images/test_image2.jpg" style="width:200px;"></td>
<td align="right"><img src="test_images/test_image3.jpg" style="width:200px;"></td>
</tr>

In [15]:
# TODO: Replace <FILL IN> with appropriate code
# test the AlexNet model on the given images

import cv2

#get list of all images
current_dir = os.getcwd()
image_path = os.path.join(current_dir, 'test_images')
img_files = [os.path.join(image_path, f) for f in os.listdir(image_path) if f.endswith('.jpg')]

#load all images
imgs = []
for f in img_files:
    imgs.append(cv2.imread(f))

with tf.Session() as sess:
    <FILL IN>
    
    # loop over all images
    for i, image in enumerate(imgs):
        # convert image to float32 and resize to (227x227)
        img = cv2.resize(image.astype(np.float32), (227, 227))
        
        # subtract the ImageNet mean
        # Mean subtraction per channel was used to center the data around zero mean for each channel (R, G, B).
        # This typically helps the network to learn faster since gradients act uniformly for each channel.
        imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)
        img -= imagenet_mean
        
        # reshape as needed to feed into model
        img = img.reshape((1, 227, 227, 3))
        
        <FILL IN>

An error was encountered:
Session 1716 unexpectedly reached final status 'error'. See logs:
stdout: 
2018-12-24 04:50:57,091 WARN  NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-24 04:50:57,835 INFO  RMProxy: Connecting to ResourceManager at /10.0.104.193:8032
2018-12-24 04:50:58,270 INFO  Client: Requesting a new application from cluster with 30 NodeManagers
2018-12-24 04:50:58,403 INFO  Client: Verifying our application has not requested more than the maximum memory capability of the cluster (216000 MB per container)
2018-12-24 04:50:58,414 INFO  Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
2018-12-24 04:50:58,415 INFO  Client: Setting up container launch context for our AM
2018-12-24 04:50:58,432 INFO  Client: Setting up the launch environment for our AM container
2018-12-24 04:50:58,445 INFO  Client: Preparing resources for our AM container
2018-12-24 04:51:00,145 W