# Assignment 4: Benchmarking Fashion-MNIST with Deep Neural Nets

### CS 4501 Machine Learning - Department of Computer Science - University of Virginia
"The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others." - **Zalando Research, Github Repo.**"

Fashion-MNIST is a dataset from the Zalando's article. Each example is a 28x28 grayscale image, associated with a label from 10 classes. They intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms.

![Here's an example how the data looks (each class takes three-rows):](https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png)

In this assignment, you will attempt to benchmark the Fashion-MNIST using Neural Networks. You must use it to train some neural networks on TensorFlow and predict the final output of 10 classes. For deliverables, you must write code in Python and submit this Jupyter Notebook file (.ipynb) to earn a total of 100 pts. You will gain points depending on how you perform in the following sections.


In [1]:
# You might want to use the following packages
import numpy as np
import os
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) #reduce annoying warning messages
from functools import partial

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


---
## 1. PRE-PROCESSING THE DATA (10 pts)

You can load the Fashion MNIST directly from Tensorflow. **Partition of the dataset** so that you will have 50,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing. Also, make sure that you platten out each of examples so that it contains only a 1-D feature vector.

Write some code to output the dimensionalities of each partition (train, validation, and test sets).



In [2]:
# Loading in data 
fmnist = tf.keras.datasets.fashion_mnist.load_data();

# Creating train and test data sets 
(X_train, y_train), (X_test, y_test) = fmnist

# Creating specific training, validation, and test sets 
x_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
x_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.float32)
y_test = y_test.astype(np.float32)
x_valid, x_train = x_train[:10000], x_train[10000:]
y_valid, y_train = y_train[:10000], y_train[10000:]


In [3]:
import matplotlib.pyplot as plt

plt.figure()
plt.imshow(X_train[0])
plt.colorbar()
plt.grid(False)
plt.show()

<Figure size 640x480 with 2 Axes>

- - -
## 2. CONSTRUCTION PHASE (30 pts)

In this section, define at least three neural networks with different structures. Make sure that the input layer has the right number of inputs. The best structure often is found through a process of trial and error experimentation:
- You may start with a fully connected network structure with two hidden layers.
- You may try a few settings of the number of nodes in each layer.
- You may try a few activation functions to see if they affect the performance.

**Important Implementation Note:** For the purpose of learning Tensorflow, you must use low-level TensorFlow API to construct the network. Usage of high-level tools (ie. Keras) is not permited. 

In [4]:
# Your code goes here
reset_graph()

# Set some configuration here
n_inputs = 28*28  # Fashion-MNIST
n_outputs = 10
learning_rate = 0.01

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [5]:
def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

In [6]:
# n_hidden1 = 784
# n_hidden2 = 196

# with tf.name_scope("dnn1"):
#     hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
#     hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
#     logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [7]:
# nhidden1 = 784
# nhidden2 = 196
# nhidden3 = 49

# with tf.name_scope("dnn2"):
#     hidden_1 = tf.layers.dense(X, nhidden1, activation=leaky_relu, name="hidden_1")
#     hidden_2 = tf.layers.dense(hidden_1, nhidden2, activation=leaky_relu, name="hidden_2")
#     hidden_3 = tf.layers.dense(hidden_2, nhidden3, activation=leaky_relu, name="hidden_3")
#     logits_2 = tf.layers.dense(hidden_3, n_outputs, name="outputs_2")

In [8]:
# BEST NETWORK 


# nhidden1 = 784
# nhidden2 = 392
# nhidden3 = 196
# nhidden4 = 98


# with tf.name_scope("dnn3"):
#     hidden_one = tf.layers.dense(X, nhidden1, activation=leaky_relu, name="hidden_one")
#     hidden_two = tf.layers.dense(hidden_one, nhidden2, activation=leaky_relu, name="hidden_two")
#     hidden_three = tf.layers.dense(hidden_two, nhidden3, activation=leaky_relu, name="hidden_three")
#     hidden_four = tf.layers.dense(hidden_three, nhidden4, activation=leaky_relu, name="hidden_four")
#     logits_3 = tf.layers.dense(hidden_three, n_outputs, name="outputs_3")

In [9]:
# he_init = tf.variance_scaling_initializer()
# training = tf.placeholder_with_default(False, shape=(), name='training')
# dropout_rate = 0.5  # == 1 - keep_prob
# X_drop = tf.layers.dropout(X, dropout_rate, training=training)
# nhidden1 = 784
# nhidden2 = 392
# nhidden3 = 196
# nhidden4 = 98

# with tf.name_scope("dnnBenchmark"):
#     my_batch_norm_layer = partial(
#             tf.layers.batch_normalization,
#             training=training,
#             momentum=0.9)

#     my_dense_layer = partial(
#             tf.layers.dense,
#             kernel_initializer=he_init)

#     hidden1 = my_dense_layer(X_drop, nhidden1, name="hidden1")
#     bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
#     hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
#     hidden2 = my_dense_layer(hidden1_drop, nhidden2, name="hidden2")
#     bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
#     hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
#     hidden3 = my_dense_layer(hidden2_drop, nhidden3, name="hidden3")
#     bn3 = tf.nn.elu(my_batch_norm_layer(hidden3))
#     hidden3_drop = tf.layers.dropout(hidden3, dropout_rate, training=training)
#     hidden4 = my_dense_layer(hidden3_drop, nhidden4, name="hidden4")
#     bn4 = tf.nn.elu(my_batch_norm_layer(hidden4))
#     hidden4_drop = tf.layers.dropout(hidden4, dropout_rate, training=training)
#     logits_before_bn = my_dense_layer(hidden4_drop, n_outputs, name="outputs")
#     logits = my_batch_norm_layer(logits_before_bn)


In [10]:
he_init = tf.variance_scaling_initializer()
training = tf.placeholder_with_default(False, shape=(), name='training')
# dropout_rate = 0.5  # == 1 - keep_prob
# X_drop = tf.layers.dropout(X, dropout_rate, training=training)
nhidden1 = 784
nhidden2 = 392
nhidden3 = 196
nhidden4 = 98

with tf.name_scope("dnnBenchmark"):
    my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=0.9)

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)

    hidden1 = my_dense_layer(X, nhidden1, name="hidden1")
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
    hidden2 = my_dense_layer(bn1, nhidden2, name="hidden2")
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
    hidden3 = my_dense_layer(bn2, nhidden3, name="hidden3")
    bn3 = tf.nn.elu(my_batch_norm_layer(hidden3))
    hidden4 = my_dense_layer(bn3, nhidden4, name="hidden4")
    bn4 = tf.nn.elu(my_batch_norm_layer(hidden4))
    logits_before_bn = my_dense_layer(bn4, n_outputs, name="outputs")
    logits = my_batch_norm_layer(logits_before_bn)


In [11]:
with tf.name_scope("loss"):
    xentropy1 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy1, name="loss")

In [12]:
# with tf.name_scope("loss2"):
#     xentropy2 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits_2)
#     loss2 = tf.reduce_mean(xentropy2, name="loss2")

In [13]:
# with tf.name_scope("loss3"):
#     xentropy3 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits_3)
#     loss3 = tf.reduce_mean(xentropy3, name="loss3")

In [14]:
with tf.name_scope("train"):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [15]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    print(accuracy)

Tensor("eval/Mean:0", shape=(), dtype=float32)


- - -
## 3. EXECUTION PHASE (30 pts)

After you construct the three models of neural networks, you can compute the performance measure as the class accuracy. You will need to define the number of epochs and size of the training batch. You also might need to reset the graph each time your try a different model. To save time and avoid retraining, you should save the trained model and load it from disk to evaluate a test set. Pick the best model and answer the following:
- Which model yields the best performance measure for your dataset? Provide a reason why it yields the best performance.
- Why did you pick this many hidden layers?
- Provide some justifiable reasons for selecting the number of neurons per hidden layers. 
- Which activation functions did you use?

In the next session you will get a chance to finetune it further .



In [16]:
# Your code goes here
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 40
batch_size = 50


# shuffle_batch() shuffle the examples in a batch before training
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


In [17]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(x_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if epoch % 5 == 0:
            acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
            acc_valid = accuracy.eval(feed_dict={X: x_valid, y: y_valid})
            print(epoch, "Batch accuracy:", acc_batch, "Validation accuracy:", acc_valid)
    saver.save(sess, "./my_dnn_model.ckpt")
    acc_test = accuracy.eval(feed_dict={X: x_test, y: y_test})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))


0 Batch accuracy: 0.94 Validation accuracy: 0.8259
5 Batch accuracy: 0.94 Validation accuracy: 0.8657
10 Batch accuracy: 0.88 Validation accuracy: 0.875
15 Batch accuracy: 0.96 Validation accuracy: 0.889
20 Batch accuracy: 0.96 Validation accuracy: 0.8837
25 Batch accuracy: 0.86 Validation accuracy: 0.8846
30 Batch accuracy: 0.88 Validation accuracy: 0.8354
35 Batch accuracy: 0.86 Validation accuracy: 0.8956
Final test accuracy: 81.34%


## Analysis

Several different combinations of hidden layers and activation types were tested. There was some discrepencies in the number of neurons per layer, but I kept that fairly constant throughout all three deep neural networks. 

DNN #1: Tested 2 layers with RELU, ELU, and Leaky ELU activations with 784 in the first hidden layer and 196 in the second hidden layer. I picked these numbers arbitrarily with 784 being the number of inputs (28 * 28). I thought that was a good place to start with the total number of neurons. RELU activation yielded the highest accuracy of 88.08%.

DNN #2: Tested 3 layers with RELU, ELU, and Leaky ELU activations with 784 in the first hidden layer, 196 in the second hidden layer, and 49 in the third hidden layer. These numbers are just factors of 784, but once again just chosen arbitrarily. Once again RELU activation yielded the highest accuracy of 88.70%. 

DNN #3: Tested 4 layers with RELU, ELU, and Leaky ELU activations with 784 in the first hidden layer, 392 in the second hidden layer, 196 in the third hidden layer, and 98 in the fourth hidden layer. Leaky RELU activation yielded the highest accuracy of 88.85%. This is the best model thus far, and will be tweaked in part four. 

In [18]:
# DNN1: 88.08%
# DNN2: 88.70%
# DNN3: 88.85%

# Best final accuracy is for DNN 2: 88.85%

- - -
## 4. FINETUNING THE NETWORK (25 pts)

The best performance on the Fashion MNIST of a non-neural-net classifier is the Support Vector Classifier {"C":10,"kernel":"poly"} with 0.897 accuracy. In this section, you will see how close you can get to that accuracy, or (better yet) beat it! You will be able to see the performance of other ML methods below:
http://fashion-mnist.s3-website.eu-central-1.amazonaws.com

Use the best model from the previous section and see if you can improve it further. To improve the performance of your model, You must make some modifications based upon the practical guidelines discuss in class. Here are a few decisions about the recommended network configurations you have to make:
1. Initialization: Use He Initialization for your model
2. Activation: Add ELU as the activation function throughout your hidden layers
3. Normalization: Incorporate the batch normalization at every layer
4. Regularization: Configure the dropout policy at 50% rate
5. Optimization: Change Gradient Descent into Adam Optimization
6. Your choice: make any other changes in 1-5 you deem necessary

Keep in mind that the execution phase is essentially the same, so you can just run it from the above. See how much you gain in classification accuracy. Provide some justifications for the gain in performance. 






## Analysis 

The way I executed finetuning the network was through trial and error. I tested each new network configuration by adding it to the previous configurations. If the final accuracy went down, then other combinations of the network configurations were tested. 

Adding ELU activation brought down the accuracy a little, so I tested all following network configurations with all three types of activations: ELU, RELU, and Leaky RELU. 

Batch normalization really helped the accuracy, while regularization brought the overall accuracy down by a little. Adding Adam Optimization too brought the accuracy down by a lot. I took out regularization to just test batch normalization and Adam Optimization, and was met with a little higher accuracy than previous runs. 

I found that the best deep neural network had the configurations of Initialization + RELU Activation + Batch Normalization, because it yielded an accuracy of 88.91%. 

In [19]:
# 1. Initialization: 88.38%
# 2. Initialization + ELU Activation: 87.09%

# 3a. Initialization + ELU Activation + Batch Normalization: 87.83%
# 3b. Initialization + RELU Activation + Batch Normalization: 88.91%
# 3c. Initialization + Leaky Activation + Batch Normalization: 88.57%

# 4a. Initialization + ELU Activation + Batch Normalization + Regularization: 83.80%
# 4b. Initialization + RELU Activation + Batch Normalization + Regularization: 83.80%
# 4c. Initialization + Leaky Activation + Batch Normalization + Regularization: 83.66%

# 5a. Initialization + ELU Activation + Batch Normalization + Regularization + Adam Optimization: 79.40%
# 5b. Initialization + RELU Activation + Batch Normalization + Regularization + Adam Optimization: 79.40%
# 5c. Initialization + Leaky Activation + Batch Normalization + Regularization + Adam Optimization: 73.60%

# 5c. Initialization + ELU Activation + Batch Normalization + Adam Optimization: 81.34%
# 5d. Initialization + RELU Activation + Batch Normalization + Adam Optimization: 86.62%
# 5e. Initialization + Leaky Activation + Batch Normalization + Adam Optimization: 87.62%

- - -
## 5. OUTLOOK (5 pts)

Plan for the outlook of your system: This may lead to the direction of your future project:
- Did your neural network outperform other "traditional ML technique? Why/why not?
    - My neural network did not outperform other "traditional ML techniques" as the Support Vector Classifier had an accuracy of 89.2%. My best neural network model had an overall accuracy of 88.91%, which is fairly close to the best accuracy but still falls short. I feel as though neural networks are so complicated, and can be changed in so many different ways to yield different models. Obviously, for the scope of this assignment and with the time given, I was not able to change and test variations of all the different combinations of variables and parameters. 

- Does your model work well? If not, which model should be further investigated?
    - My model works very well. My best tweaked model had an overall accuracy of 88.91%. I would love to further investigate this model by using different combinations of variables from the number of neurons per hidden layer to changing the batch momentum factor. Since training and testing each model takes so much time, it would have been too tedious to implement each individual change with all the different combinations of other parameters. If I had more time, I would definitely try to investigate my third neural network with RELU activation, He intialization, and Batch normalization. 
    
- Do you satisfy with your system? What do you think needed to improve?
    - I am not satisfied with my system. I think the notion of creating 3 arbitrary neural networks, and changing all the different parameters is a tedious job. It takes forever to test and train each changing neural network, and then with each added change the process starts over. I think the way to truly train the best neural network is to create a deep neural network through Grid Search, which will yield the best values for all the parameters in a neural network. That would allow us to test so many different combinations of variables, without having to manually change values and press run every single time. Grid Search would give us the best parameter values, and then we would just create our deep neural network off those values. This would be ideal. 

- - - 
### NEED HELP?

In case you get stuck in any step in the process, you may find some useful information from:

 * Consult my lectures and/or the textbook
 * Talk to the TA, they are available and there to help you during OH
 * Come talk to me or email me <nn4pj@virginia.edu> with subject starting "CS4501 Assignment 4:...".
 * More on the Fashion-MNIST to be found here: https://hanxiao.github.io/2018/09/28/Fashion-MNIST-Year-In-Review/

Best of luck and have fun!