<a href="https://colab.research.google.com/github/fyfserena/Pratical-DL-Sys-Performance-Robustness-Security/blob/main/HW3_yf2549.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Yingfei Fan (yf2549) COMS E6998 section 012 HW3

In [7]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
import time 

from keras.datasets import cifar10

## Problem 1 - Adaptive Learning Rate Methods, CIFAR-10

We will consider five methods, AdaGrad, RMSProp, RMSProp+Nesterov, AdaDelta, Adam, and study their convergence using CIFAR-10 dataset. 
We will use multi-layer neural network model with two fully connected hidden layers with 1000 hidden units each and ReLU activation with minibatch size of 128.

1. Write the weight update equations for the five adaptive learning rate methods. Explain each term
clearly. What are the hyperparameters in each policy ? Explain how AdaDelta and Adam are different
from RMSProp. 


**Solution:**

1.AdaGrad: 
$$\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{diag(G_{t}) + \epsilon I}}  g_{t}$$

* $\theta$: parameter we are training
* $g_t$ is the gradient at time step $t$
* diag($G_t$) is a diagonal matrix where each diagonal element $i$, $i$ is the sum of squares of the gradients with respect to $\theta$ up to time step $t$.
* Hyperparameters:
  -  $\eta$: the learning rate
  -  $\epsilon$ is a smoothing term that prevents division by $0$.
* Adagrad uses a different learning rate for every parameter $\theta_i$ at every time step $t$.

2.RMSProp: 
$$\begin{align} 
\begin{split} 
E[g^2]_t &= 0.9 E[g^2]_{t-1} + 0.1 g^2_t \\ 
\theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t} 
\end{split} 
\end{align}$$

* The terms here share the same meaning metioned above.

3.RMSProp+Nesterov: $$\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \dfrac{(1 - \beta_1) g_t}{1 - \beta^t_1})$$

* $\gamma$ is the momentum decay term. $\beta_1$ is the decay rate.

4.AdaDelta: 
$$\begin{align} 
\begin{split} 
\Delta \theta_t &= - \dfrac{RMS[\Delta \theta]_{t-1}}{RMS[g]_{t}} g_{t} \\ 
\theta_{t+1} &= \theta_t + \Delta \theta_t 
\end{split} 
\end{align}$$

* RMS is the root mean squared errors. 
* Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients.

5.Adam: 
$$\theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

* $m_t$: the estimate of the first moment of the gradients.
* $v_t$: the estimate of the second moment of the gradients.



6. How AdaDelta and Adam are different from RMSProp?

* Adagrad is an algorithm for gradient-based optimization that does just this:  
  * It adapts the learning rate to the parameters, performing smaller updates
  (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. 
  * For this reason, it is well-suited for dealing with sparse data.
  * Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.

* RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates.

* Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size $w$.

* RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. 

* Adam:
In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum. 

Reference: https://ruder.io/optimizing-gradient-descent/index.html#adagrad


2. Train the neural network using all the five methods with L2-regularization for 200 epochs each and plot
the training loss vs number of epochs. Which method performs best (lowest training loss) ?

In [3]:

(X_train, y_train), (X_test, y_test) = cifar10.load_data()

y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255.0
X_test /= 255.0

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [4]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((50000, 32, 32, 3), (50000, 10), (10000, 32, 32, 3), (10000, 10))

In [8]:
#Model
model = Sequential()
model.add(Flatten())
model.add(Dense(1000, activation='relu', kernel_regularizer='l2', kernel_initializer='HeNormal'))
model.add(Dense(1000, activation='relu', kernel_regularizer='l2', kernel_initializer='HeNormal'))
model.add(Dense(10, activation='softmax'))

In [None]:
#Adagrad
tf.keras.optimizers.Adagrad(learning_rate=0.01, initial_accumulator_value=0.1, epsilon=1e-07, name='Adagrad')
model.compile(loss='categorical_crossentropy', optimizer='Adagrad', metrics=['accuracy'])
s_time = time.time()
history_adagrad = model.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

Epoch 1/200
782/782 - 49s - loss: 6.2790 - accuracy: 0.4649 - val_loss: 6.0938 - val_accuracy: 0.4439
Epoch 2/200
782/782 - 48s - loss: 5.8510 - accuracy: 0.4699 - val_loss: 5.7413 - val_accuracy: 0.4189
Epoch 3/200
782/782 - 48s - loss: 5.4702 - accuracy: 0.4710 - val_loss: 5.3224 - val_accuracy: 0.4550
Epoch 4/200
782/782 - 49s - loss: 5.1280 - accuracy: 0.4728 - val_loss: 5.0068 - val_accuracy: 0.4447
Epoch 5/200
782/782 - 49s - loss: 4.8225 - accuracy: 0.4755 - val_loss: 4.7096 - val_accuracy: 0.4535
Epoch 6/200
782/782 - 49s - loss: 4.5456 - accuracy: 0.4755 - val_loss: 4.4419 - val_accuracy: 0.4572
Epoch 7/200
782/782 - 49s - loss: 4.2958 - accuracy: 0.4784 - val_loss: 4.2108 - val_accuracy: 0.4586
Epoch 8/200
782/782 - 48s - loss: 4.0705 - accuracy: 0.4779 - val_loss: 4.0272 - val_accuracy: 0.4447
Epoch 9/200
782/782 - 49s - loss: 3.8670 - accuracy: 0.4805 - val_loss: 3.8010 - val_accuracy: 0.4622
Epoch 10/200
782/782 - 48s - loss: 3.6819 - accuracy: 0.4816 - val_loss: 3.6211 - 

In [None]:
#RMSProp
tf.keras.optimizers.RMSprop(learning_rate=0.01, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False, name="RMSprop")
model.compile(loss='categorical_crossentropy', optimizer='RMSProp', metrics=['accuracy'])
s_time = time.time()
history_RMSProp = model.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#RMSProp + Nesterov
tf.keras.optimizers.Nadam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Nadam")
model.compile(loss='categorical_crossentropy', optimizer='Nadam', metrics=['accuracy'])
s_time = time.time()
history_RMSProp_Nesterov = model.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#AdaDelta
tf.keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95, epsilon=1e-07, name="Adadelta")
model.compile(loss='categorical_crossentropy', optimizer='Adadelta', metrics=['accuracy'])
s_time = time.time()
history_AdaDelta = model.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#Adam
tf.keras.optimizers.Adam( learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name="Adam")
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
s_time = time.time()
history_Adam = model.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#Plot
fig = plt.figure(figsize=(10,8))
plt.plot(history_adagrad.history['loss'], label='AdaGrad')
plt.plot(history_RMSProp.history['loss'], label='RMSProp')
plt.plot(history_RMSProp_Nesterov.history['loss'], label='RMSProp+Nesterov')
plt.plot(history_AdaDelta.history['loss'], label='AdaDelta')
plt.plot(history_Adam.history['loss'], label='Adam')

plt.title('Epoch vs training loss no dropout')
plt.ylabel('loss')
plt.xlabel('epoch')

plt.legend(loc=0)
plt.show()

3. Add dropout (probability 0.2 for input layer and 0.5 for hidden layers) and train the neural network again using all the five methods for 200 epochs. Compare the training loss with that in part 2. Which method performs the best ? For the five methods, compare their training time (to finish 200 epochs with dropout) to the training time in part 2 (to finish 200 epochs without dropout).

In [None]:
#model_dropout
model_dropout = Sequential()
model_dropout.add(Flatten())
model_dropout.add(Dropout(0.2))
model_dropout.add(Dense(1000, activation='relu', kernel_regularizer='l2', kernel_initializer='HeNormal'))
model_dropout.add(Dropout(0.5))
model_dropout.add(Dense(1000, activation='relu', kernel_regularizer='l2', kernel_initializer='HeNormal'))
model_dropout.add(Dropout(0.5))
model_dropout.add(Dense(10, activation='softmax'))

In [None]:
#Adagrad
tf.keras.optimizers.Adagrad(learning_rate=0.01, initial_accumulator_value=0.1, epsilon=1e-07, name='Adagrad')
model_dropout.compile(loss='categorical_crossentropy', optimizer='Adagrad', metrics=['accuracy'])
s_time = time.time()
history_adagrad_dropout = model_dropout.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#RMSProp
tf.keras.optimizers.RMSprop(learning_rate=0.01, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False, name="RMSprop")
model_dropout.compile(loss='categorical_crossentropy', optimizer='RMSProp', metrics=['accuracy'])
s_time = time.time()
history_RMSProp_dropout = model_dropout.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#RMSProp + Nesterov
tf.keras.optimizers.Nadam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Nadam")
model_dropout.compile(loss='categorical_crossentropy', optimizer='Nadam', metrics=['accuracy'])
s_time = time.time()
history_RMSProp_Nesterov_dropout = model_dropout.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#AdaDelta
tf.keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95, epsilon=1e-07, name="Adadelta")
model_dropout.compile(loss='categorical_crossentropy', optimizer='Adadelta', metrics=['accuracy'])
s_time = time.time()
history_AdaDelta_dropout = model_dropout.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#Adam
tf.keras.optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name="Adam")
model_dropout.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
s_time = time.time()
history_Adam_dropout = model_dropout.fit(X_train, y_train, batch_size=64, epochs=200, validation_data = (X_test, y_test), shuffle=True, verbose=2)
e_time = time.time()
print('Training Time:', e_time - s_time)

In [None]:
#Plot
fig = plt.figure(figsize=(10,8))
plt.plot(history_adagrad_dropout.history['loss'], label='AdaGrad')
plt.plot(history_RMSProp_dropout.history['loss'], label='RMSProp')
plt.plot(history_RMSProp_Nesterov_dropout.history['loss'], label='RMSProp+Nesterov')
plt.plot(history_AdaDelta_dropout.history['loss'], label='AdaDelta')
plt.plot(history_Adam_dropout.history['loss'], label='Adam')

plt.title('Epoch vs training loss no dropout')
plt.ylabel('loss')
plt.xlabel('epoch')

plt.legend(loc=0)
plt.show()

4. Compare test accuracy of trained model for all the five methods from part 2 and part 3. Note that to
calculate test accuracy of model trained using dropout you need to appropriately scale the weights (by
the dropout probability). 

## Solution

Test accuracy for model without dropout: 
* Adagrad: 
* RMSProp: 
* RMSProp+N: 
* AdaDelta: 
* Adam: 

Test accuracy for model with dropout: 
* Adagrad:
* RMSProp:
* RMSProp+N: 
* AdaDelta: 
* Adam: 

* The test accuracy for non-dropout is better.

## Problem 2 - Learning Rate, Batch Size, FashionMNIST

Recall cyclical learning rate policy discussed in Lecture 4. The learning rate changes in cyclical manner between lrmin and lrmax, which are hyperparameters that need to be specified. For this problem you first need to read carefully the article referenced below as you will be making use of the code there (in Keras) and modifying it as needed. For those who want to work in Pytorch there are open source implementations of this policy available which you can easily search for and build over them. You will work with FashionM-NIST dataset and MiniGoogLeNet (described in reference). If you cannot get MiniGoogleNet code from the reference you can do this question using LeNet.

1. Summarize FashionMNIST dataset, total dataset size, training set size, validation set size, number of
classes, number of images per class. Show any 3 representative images from any 3 classes in the dataset.

In [None]:
(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(np.unique(y_train,return_counts=True))
print(np.unique(y_test,return_counts=True))

In [None]:
fig = plt.figure(figsize=(10, 8))
plt.subplot(1,3,1)
plt.imshow(X_train[0])
plt.colorbar()
plt.subplot(1,3,2)
plt.imshow(X_train[10])
plt.colorbar()
plt.subplot(1,3,3)
plt.imshow(X_train[20])
plt.colorbar()

plt.show()

2. Fix batch size to 64 and start with 10 candidate learning rates between $10^{-9}$ and $10^1$ and train your
model for 5 epochs. Plot the training loss as a function of learning rate. You should see a curve like
Figure 3 in reference below. From that figure identify the values of $lr_{min}$ and $lr_{max}$.