# IEOR4742 Deep Learning Problem Set

# Problem 6 & 7

## Author: Hao Li

## Problem 6

**Problem 6 (Convolutional Neural Networks)**: In the sample code example `CNN MNIST.jpynb`

**(a) add one more convolutional layer with max pooling and assess the impact of extra convolutional layer on accuracy**

### Import

In [1]:
import os

#import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
print(tf.__version__)


import numpy as np
import scipy.misc

import matplotlib
import matplotlib.pyplot as plt 

import time

2.15.0


In [2]:
# !pip install --force-reinstall -v "tensorflow==2.15.0"

In [3]:
#load data. labels are in one-hot-encoding format
#generate original training and test data
img_size = 28
n_classes = 10

#global_step = 
input_size = 784
output_size = 10

print('\nLoading MNIST')

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = np.reshape(x_train, [-1, img_size*img_size])
x_train = x_train.astype(np.float32)/255

print(x_train.shape)

x_test = np.reshape(x_test, [-1, img_size*img_size])
x_test = x_test.astype(np.float32)/255

print(x_test.shape)

to_categorical = tf.keras.utils.to_categorical 
y_train = to_categorical(y_train)
y_test  = to_categorical(y_test)

print('\nSpliting data')

ind = np.random.permutation(x_train.shape[0])
x_train, y_train = x_train[ind], y_train[ind]

# 10% for validation 
validatationPct = 0.1
n = int(x_train.shape[0] * (1-validatationPct))
x_valid = x_train[n:]
x_train = x_train[:n]
#
y_valid = y_train[n:]
y_train = y_train[:n]

train_num_examples = x_train.shape[0]
valid_num_examples = x_valid.shape[0]
test_num_examples  = x_test.shape[0]

print(train_num_examples, valid_num_examples, test_num_examples)


Loading MNIST
(60000, 784)
(10000, 784)

Spliting data
54000 6000 10000


### Parameters

In [4]:
# The length of window in the pooling layer
k = 2

# Parameters
learning_rate = 0.005
training_epochs = 50
batch_size = 200
display_step = 1

### Define 2-d Convolution Function

In [5]:
def module_conv2d(x, weight_shape, bias_shape):
    """
    https://www.tensorflow.org/api_docs/python/tf/nn/conv2d
    Computes a 2 dimentional convolution given the 4d input and filter
    input:
        x: [batch, in_height, in_width, in_channels]
        weight: [filter_height, filter_width, in_channels, out_channels]
        bias: [out_channels]
    output:
        The relu activation of convolution
    """
    print([weight_shape[0], weight_shape[1], weight_shape[2], weight_shape[3]])
    sizeIn = weight_shape[0] * weight_shape[1] * weight_shape[2]
    
    # initialize weights with data generated from a normal distribution.
    # Sometimes, a smaller stddev can improve the accuracy significantly. Take some trials by yourself.
    weight_init = tf.random_normal_initializer(stddev=(2.0/sizeIn)**0.5)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    
    # initialize bias with zeros
    bias_init = tf.constant_initializer(value=0)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    
    # Specify the stride length to be one in all directions.
    # padding='SAME': pad enough so the output has the same dimensions as the input tensor.
    return tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME'), b))

### Define Layer Function

In [6]:
def layer(x, weight_shape, bias_shape):
    """
    Defines the network layers
    input:
        - x: input vector of the layer
        - weight_shape: shape the the weight maxtrix
        - bias_shape: shape of the bias vector
    output:
        - output vector of the layer after the matrix multiplication and transformation
    """
    
    weight_init = tf.random_normal_initializer(stddev=(2.0/weight_shape[0])**0.5)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    
    bias_init = tf.constant_initializer(value=0)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    
    return tf.nn.relu(tf.matmul(x, W) + b)

### Define Pooling Function

In [7]:
def pooling(x, k):
    """
    Extracts the main information of the conv layer by performs the max pooling on the input x.
    input:
        x: A 4-D Tensor. [batch, in_height, in_width, in_channels]
        k: The length of window
    """
    
    #value: A 4-D Tensor of the format specified by data_format. That is x in this case.
    #ksize: A 1-D int Tensor of 4 elements. The size of the window for each dimension of input
    #strides: A 1-D int Tensor of 4 elements. The stride of the sliding window for each dimension of input
    #padding: A string, either 'VALID' or 'SAME'. Difference of 'VALID' and 'SAME' in tf.nn.max_pool:
    #https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

### Define Inference (Modified)

In [8]:
def inference(x, keep_prob):
    """
    define the structure of the whole network
    input:
        - x: a batch of pictures 
        (input shape = (batch_size*image_size))
        - rate:  The probability that each element is dropped. For example, setting rate=0.1 would drop 10% of input elements.
    output:
        - a batch vector corresponding to the logits predicted by the network
        (output shape = (batch_size*output_size)) 
    """

    # Reshape the input into Nx28x28x1 (N # of examples & 1 due to Black-White)
    # flatten 
    x = tf.reshape(x, shape=[-1, 28, 28, 1])
    
    with tf.variable_scope("convolutional_layer_1"):

        # convolutional layer with 32 filters and spatial extent e = 5
        # this causes in taking an input of volume with depth of 1 and producing an output tensor with 32 channels.
        convolutional_1 = module_conv2d(x, [5, 5, 1, 32], [32])
        
        # output in passed to max-pooling to be compressed (k=2 non-overlapping).
        pooling_1 = pooling(convolutional_1, k)

    with tf.variable_scope("convolutional_layer_2"):
        
        # convolutional layer with 64 filters with spatial extent e = 5
        # taking an input tensor with depth of 32 and 
        # producing an output tensor with depth 64
        convolutional_2 = module_conv2d(pooling_1, [5, 5, 32, 64], [64])
        
        # output in passed to max-pooling to be compressed (k=2 non-overlapping).
        pooling_2 = pooling(convolutional_2, k)
    
    # new comvolution layer 3
    with tf.variable_scope("convolutional_layer_3"):
    
        # convolutional layer with a certain number of filters and kernel size
        # since the previous layer outputs 64 channels, this layer should take 64 as input channels
        # you can choose the number of filters (e.g., 128) and kernel size (e.g., 3x3 or 5x5)
        convolutional_3 = module_conv2d(pooling_2, [5, 5, 64, 128], [128])

        # max pooling for the third convolutional layer
        pooling_3 = pooling(convolutional_3, k)
#         print("DEBUG", pooling_3.get_shape())

    with tf.variable_scope("fully_connected"):
        
        # pass the output of max-pooling into a Fully_Connected layer
        # use reshape to flatten the tensor
        # We have 128 filters
        # To find the height & width after max-pooling:
        # roundup((16-5)/2) + 1 = 7
        # TensorFlow rounds down the dimensions for pooling, but here we want roundup as suggsted
        # So, if the size after the second pooling is 7x7, it will become 4x4 after the third pooling.
        pool_3_flat = tf.reshape(pooling_3, [-1, 4*4*128]) # change here
        
        # after reshaping, use fully-connected layer to compress
        # the flattened representation into a hidden layer of size 784 (28*28)?
        # each feature map has a height & width of 3
        fc_1 = layer(pool_3_flat, [4*4*128, 784], [784]) # change here
        
        # apply dropout. You may try to add drop out after every pooling layer.
        # outputs the input element scaled up by 1/keep_prob
        # The scaling is so that the expected sum is unchanged
        # fc_1_drop = tf.nn.dropout(fc_1, keep_prob)
        fc_1_drop = tf.nn.dropout(fc_1, rate=1 - keep_prob)

    with tf.variable_scope("output"):
        output = layer(fc_1_drop, [784, 10], [10])

    return output

### Define Loss Function

In [9]:
def loss(output, y):
    """
    Computes softmax cross entropy between logits and labels and then the loss 
    
    intput:
        - output: the output of the inference function 
        - y: true value of the sample batch
        
        the two have the same shape (batch_size * num_of_classes)
    output:
        - loss: loss of the corresponding batch (scalar tensor)
    
    """
    xentropy = tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y)  
    #xentropy = tf.compat.v1.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=output)
    loss = tf.reduce_mean(xentropy)
    return loss

### Define the Optimizer and Training Target

In [10]:
def training(cost, global_step):
    """
    defines the necessary elements to train the network
    
    intput:
        - cost: the cost is the loss of the corresponding batch
        - global_step: number of batch seen so far, it is incremented by one each time the .minimize() function is called
    """
    tf.summary.scalar("cost", cost)
    
    # using Adam Optimizer 
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(cost, global_step=global_step)
    return train_op

### Define evaluation method

In [11]:
def evaluate(output, y):
    """
    evaluates the accuracy on the validation set 
    input:
        -output: prediction vector of the network for the validation set
        -y: true value for the validation set
    output:
        - accuracy: accuracy on the validation set (scalar between 0 and 1)
    """
    #correct prediction is a binary vector which equals one when the output and y match
    #otherwise the vector equals 0
    #tf.cast: change the type of a tensor into another one
    #then, by taking the mean of the tensor, we directly have the average score, so the accuracy
    
    correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    tf.summary.scalar("validation_error", (1.0 - accuracy))
    return accuracy

### Main Function

In [12]:
#x = tf.placeholder("float", [None, 784])
global_step = tf.Variable(0, name='global_step', trainable=False)

In [13]:
if __name__ == '__main__':
    
    start_time = time.time()
    
    if not os.path.isdir('./logs/'):
        os.makedirs('./logs/')
    log_files_path = './logs/'
    
    with tf.Graph().as_default():

        with tf.variable_scope("MNIST_convoultional_model"):
            
            #neural network definition
            
            #the input variables are first define as placeholder 
            # a placeholder is a variable/data which will be assigned later 
            # MNIST data image of shape 28*28=784
            x = tf.placeholder("float", [None, 784]) 
            # 0-9 digits recognition
            y = tf.placeholder("float", [None, 10])  
            
            # dropout probability
            keep_prob = tf.placeholder(tf.float32) 
            
            #the network is defined using the inference function defined above in the code
            output = inference(x, keep_prob)
            cost = loss(output, y)
            
            #initialize the value of the global_step variable 
            # recall: it is incremented by one each time the .minimise() is called
            global_step = tf.Variable(0, name='global_step', trainable=False)
            
            
            train_op = training(cost, global_step)
            
            
            #evaluate the accuracy of the network (done on a validation set)
            eval_op = evaluate(output, y)
            summary_op = tf.summary.merge_all()
            saver = tf.train.Saver()
            sess = tf.Session()
            
            summary_writer = tf.summary.FileWriter(log_files_path, sess.graph)
            init_op = tf.global_variables_initializer()
            sess.run(init_op)
            
            # Training cycle
            for epoch in range(training_epochs):

                avg_cost = 0.0
                
                total_batch = int((train_num_examples+batch_size-1) / batch_size)
                
                # Loop over all batches
                for i in range(total_batch):
                    
                    start = i * batch_size
                    end = min(train_num_examples, start + batch_size)
                    minibatch_x = x_train[start:end]
                    minibatch_y = y_train[start:end]
                    
                    # Fit training using batch data
                    sess.run(train_op, feed_dict={x: minibatch_x, y: minibatch_y, keep_prob: 0.25})
                    
                    # Compute average loss
                    avg_cost += sess.run(cost, feed_dict={x: minibatch_x, y: minibatch_y, keep_prob: 0.25})/total_batch
                
                
                # Display logs per epoch step
                if epoch % display_step == 0:
                    
                    print("Epoch:", '%04d' % (epoch+1), "cost =", "{:0.9f}".format(avg_cost))
                    
                    #probability dropout of 1 during validation
                    accuracy = sess.run(eval_op, feed_dict={x: x_valid, y:y_valid, keep_prob: 1})
                    print("Validation Error:", (1 - accuracy))
                    
                    # probability dropout of 0.25 during training
                    summary_str = sess.run(summary_op, feed_dict={x: minibatch_x, y: minibatch_y, keep_prob: 0.25})
                    summary_writer.add_summary(summary_str, sess.run(global_step))
                    
                    saver.save(sess, log_files_path + 'model-checkpoint', global_step=global_step)
                    
            print("Optimization Done")
                    
            accuracy = sess.run(eval_op, feed_dict={x: x_test, y: y_test, keep_prob: 1})
            print("Test Accuracy:", accuracy)
                    
        elapsed_time = time.time() - start_time
        print('Execution time was %0.3f' % elapsed_time)

[5, 5, 1, 32]
[5, 5, 32, 64]
[5, 5, 64, 128]
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



2024-01-02 04:20:31.259659: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled


Epoch: 0001 cost = 1.575854381
Validation Error: 0.24150002002716064
Epoch: 0002 cost = 0.520235928
Validation Error: 0.06983333826065063
Epoch: 0003 cost = 0.308748039
Validation Error: 0.030166685581207275
Epoch: 0004 cost = 0.116773811
Validation Error: 0.02033334970474243
Epoch: 0005 cost = 0.050859410
Validation Error: 0.015999972820281982
Epoch: 0006 cost = 0.046319137
Validation Error: 0.016499996185302734
Instructions for updating:
Use standard file APIs to delete files with this prefix.
Epoch: 0007 cost = 0.045119742
Validation Error: 0.014833331108093262
Epoch: 0008 cost = 0.041093259
Validation Error: 0.012333333492279053
Epoch: 0009 cost = 0.037754454
Validation Error: 0.014500021934509277
Epoch: 0010 cost = 0.037692089
Validation Error: 0.015999972820281982
Epoch: 0011 cost = 0.038551400
Validation Error: 0.013166666030883789
Epoch: 0012 cost = 0.033574530
Validation Error: 0.01466667652130127
Epoch: 0013 cost = 0.041678301
Validation Error: 0.01583331823348999
Epoch: 0014

#### Evaluation

1. **Accuracy Comparison**:

* **Two Convolutional Layers**: The model achieved a high test accuracy of approximately 98.82%. The validation error decreased steadily over epochs, indicating that the model was learning effectively.

* **Three Convolutional Layers**: With the addition of the third layer, the model also achieved a high test accuracy of approximately 98.76%. The validation error similarly decreased over time, but the overall training process showed higher initial error rates compared to the two-layer model.

2. **Training Dynamics**:

* **Learning Curve**: The three-layer model initially had higher validation errors and required more epochs to reach a low error rate. This suggests that the added complexity of the model might have made the initial learning process slightly more challenging.

* **Convergence**: Both models eventually converged to a low validation error, but the three-layer model showed a bit more fluctuation in validation error, which might indicate sensitivity to training parameters or a higher propensity for overfitting.

3. **Overfitting and Model Stability**:

* **Potential Overfitting**: Deeper models like the three-layer CNN have more parameters and are more prone to overfitting. This might explain the occasional instances of extremely low accuracy in some training runs. Overfitting occurs when the model learns the noise in the training data instead of generalizing from the patterns.

* **Result Stability**: The variability in results across different runs suggests that the three-layer model's performance is less stable compared to the two-layer model. In some cases, I got results that yields the accuracy with onlg single digits, which is extremely low. This instability could be due to the more complex decision boundaries that the model is trying to learn.

4. **Impact on Accuracy**:

* **Performance Gain**: The addition of the third layer did not result in a significant improvement in accuracy. Both models achieved similar high accuracies, indicating that for the MNIST dataset, a two-layer model is quite sufficient.

* **Computational Efficiency**: The three-layer model is more computationally intensive due to the additional layer. Without a corresponding increase in accuracy, the extra computational cost may not be justified.

5. **Conclusion**:

* **Model Complexity vs. Dataset Complexity**: The MNIST dataset, being relatively simple, may not require very deep networks. The additional layer did not contribute significantly to accuracy improvement, suggesting that the two-layer model is already quite capable of capturing the relevant features in the data.

* **Randomness in Training**: Deep learning models are subject to randomness in weight initialization, mini-batch selection during training, etc. This randomness can lead to variability in training results, especially in more complex models.

* **Recommendation**: For similar tasks with comparable dataset complexity, starting with a simpler model (like the two-layer CNN) might be more efficient. Deeper models can be considered if the task complexity increases or if the simpler models plateau in performance.

------------------

**(b) what are the number of parameters we are trying to learn in the original code and the new one with an extra layer?**

## Parameters in a Convolutional Layer:
   - For a convolutional layer, the number of parameters is determined by the size of the filters (or kernels), the number of filters, and the number of input channels. The formula is:
   - $\text{Parameters} =(\text{Filter Height} × \text{Filter Width} × \text{Input Channels} + 1) × \text{Number of Filters}$
   - The "+1" accounts for the bias term for each filter.

## Original CNN (Two Convolutional Layers):

1. **First Convolutional Layer**: 
   - Filters: 32, Filter Size: 5x5, Input Channels: 1 (grayscale image)
   - Parameters: $(5 \times 5 \times 1 + 1) \times 32 = 832$

2. **Second Convolutional Layer**: 
   - Filters: 64, Filter Size: 5x5, Input Channels: 32 (output from first layer)
   - Parameters: $(5 \times 5 \times 32 + 1) \times 64 = 51,264$

3. **First Fully Connected Layer**: 
   - Assuming the image size reduces to 7x7 after pooling, and there are 64 output channels from the last convolutional layer
   - Input Units: $7 \times 7 \times 64$, Output Units: 784 (as per your model)
   - Parameters: $(7 \times 7 \times 64 \times 784) + 784 = 2,459,408$

4. **Output Layer**: 
   - Input Units: 784, Output Units: 10 (number of classes for MNIST)
   - Parameters: $784 \times 10 + 10 = 7850$
   
5. **Total Parameters in the Original CNN**:
   - $\text{Total} = 832 + 51,264 + 2,459,408 + 7,850 = 2,519,354$

## Modified CNN (Three Convolutional Layers):

1. **First and Second Convolutional Layers**: 
   - The calculations remain the same as the original model.

2. **Third Convolutional Layer**: 
   - Filters: 128, Filter Size: 5x5, Input Channels: 64
   - Parameters: $(5 \times 5 \times 64 + 1) \times 128 = 204,928$

3. **First Fully Connected Layer** (adjusted for the output of the third convolutional layer):
   - Assuming the image size reduces to 4x4 after the third pooling layer
   - Input Units: $4 \times 4 \times 128$, Output Units: 784
   - Parameters: $(4 \times 4 \times 128 \times 784) + 784 = 1,606,416$

4. **Output Layer**: 
   - The calculation remains the same as the original model.
   - Input Units: 784, Output Units: 10 (number of classes for MNIST)
   - Parameters: $784 \times 10 + 10 = 7850$

5. **Total Parameters in the New CNN**:

   - The total number of parameters for each model can be calculated by summing up the parameters from all the convolutional layers and fully connected layers.
   - $\text{Total} = 832 + 51,264 + 204,928 + 1,606,416 + 7,850 = 1,871,290$


-------------

In [17]:
(5*5*1+1)*32

832

In [18]:
(5*5*32+1)*64

51264

In [19]:
(7*7*64*784)+784

2459408

In [20]:
784*10+10

7850

In [21]:
832+51264+2459408+7850

2519354

----------------------------

In [22]:
(5*5*64+1)*128

204928

In [23]:
(4*4*128*784)+784

1606416

In [24]:
784*10+10

7850

In [26]:
832+51264+204928+1606416+7850

1871290

--------------

## Problem 7 (Batch Normalization)

**Problem 7: For Problem 6, assess the impact of batch normalization on learning (speed & accuracy)?**