# Optimizers

## Momentum

One disadvantage of the SGD method is that its update direction depends entirely on the current batch, so its update is very unstable. A simple way to solve this problem is to introduce momentum.

**Momentum is momentum**, which simulates the inertia of an object when it is moving, that is, the direction of the previous update is retained to a certain extent during the update, while the current update gradient is used to fine-tune the final update direction. In this way, you can increase the stability to a certain extent, so that you can learn faster, and also have the ability to get rid of local optimization.

 ![alt](imgo/sgd1.png)
 
 **<center>Figure :- SGD without Momentum &&&  SGD with Momentum</center>**
 
 
 ### Algorithm - 
 
 $1.\ \textbf{m} \leftarrow \beta \textbf{m} + \eta \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$
 
 $2.\ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \textbf{m}$

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

## NAG

### Algorithm -

 
 $1.\ \textbf{m} \leftarrow \beta \textbf{m} + \eta \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta} - \beta \textbf{m})$
 
 $2.\ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \textbf{m}$

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

## Adagrad

Adagrad is an algorithm for gradient-based optimization which adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with infrequent features. 

So, it is well-suited for dealing with sparse data.

But the same update rate may not be suitable for all parameters. For example, some parameters may have reached the stage where only fine-tuning is needed, but some parameters need to be adjusted a lot due to the small number of corresponding samples.

Adagrad proposed this problem, an algorithm that adaptively assigns different learning rates to various parameters among them. The implication is that for each parameter, as its total distance updated increases, its learning rate also slows.

>**GloVe word embedding uses adagrad where infrequent words required a greater update and frequent words require smaller updates.**

>**Adagrad eliminates the need to manually tune the learning rate.**


### Algorithm 

1. $ \textbf{s} \leftarrow  \textbf{s} +  \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \otimes  \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$

2. $ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}- \eta \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})   \oslash \sqrt{\textbf{s} + \epsilon}$

In [None]:
optimizer = keras.optimizers.Adagrad(lr=0.001)

## RMSProp

The full name of RMSProp algorithm is called **Root Mean Square Prop**, which is an adaptive learning rate optimization algorithm proposed by Geoff Hinton. 


>RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient.


Adagrad will accumulate all previous gradient squares, and RMSprop just calculates the corresponding average value, so it can alleviate the problem that the learning rate of the Adagrad algorithm drops quickly.

The difference is that RMSProp calculates the **differential squared weighted average of the gradient** . This method is beneficial to eliminate the direction of large swing amplitude, and is used to correct the swing amplitude, so that the swing amplitude in each dimension is smaller. On the other hand, it also makes the network function converge faster. 


>In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.

>RMSProp divides the learning rate by the average of the exponential decay of squared gradients


### Algorithm 
1. $  \textbf{s} \leftarrow  \beta \textbf{s} +  (1 - \beta)\triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \otimes  \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$

2. $ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}- \eta \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})   \oslash \sqrt{\textbf{s} + \epsilon}$

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

## Adam

**Adaptive Moment Estimation (Adam)** is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop.

>Adam also keeps an exponentially decaying average of past gradients, similar to momentum.

>Adam can be viewed as a combination of Adagrad and RMSprop,(Adagrad) which works well on sparse gradients and (RMSProp) which works well in online and nonstationary settings repectively.

>Adam implements the **exponential moving average of the gradients** to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients.

>Adam is computationally efficient and has very less memory requirement.

>Adam optimizer is one of the most popular and famous gradient descent optimization algorithms.


### Algorithm

1. $\textbf{m} \leftarrow \beta_1 \textbf{m} + (1 - \beta_1)\triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$

2. $\textbf{s} \leftarrow  \beta_2 \textbf{s} +  (1 - \beta_2)\triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \otimes  \triangledown_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$

3. $\hat{\textbf{m}} \leftarrow \frac{\textbf{m}}{1 - \beta_1^t}$

4. $\hat{\textbf{s}} \leftarrow \frac{\textbf{s}}{1 - \beta_2^t}$

5. $ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}- \eta \hat{\textbf{m}}  \oslash \sqrt{\hat{\textbf{s}} + \epsilon}$

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

<!-- ## Comparisions -->

<!-- ![alt](https://ml-cheatsheet.readthedocs.io/en/latest/_images/optimizers.gif)

**<center>Figure :- SGD optimization on loss surface contours</center>**

![alt](https://miro.medium.com/max/1628/1*SjtKOauOXFVjWRR7iCtHiA.gif)

**<center>Figure :- SGD optimization on saddle point</center>** -->



# How to choose optimizers?

- If the data is sparse, use the self-applicable methods, namely Adagrad, Adadelta, RMSprop, Adam.

- RMSprop, Adadelta, Adam have similar effects in many cases.

- Adam just added bias-correction and momentum on the basis of RMSprop,

- As the gradient becomes sparse, Adam will perform better than RMSprop.

**Overall, Adam is the best choice.**

>SGD is used in many papers, without momentum, etc. Although SGD can reach a minimum value, it takes longer than other algorithms and may be trapped in the saddle point.

- If faster convergence is needed, or deeper and more complex neural networks are trained, an adaptive algorithm is needed.

Optimizer | Convergence Speed | Convergence quality
:-:|:-:|:-:
SGD | * | ***
momentum | ** | ***
NAG | ** | ***
Adagrad | *** | * (stops too early)
RMSprop | *** | ** to ***
Adam | *** | ** to ***
Nadam | *** | ** to ***
AdaMax | *** | ** to ***

In [None]:
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
%load_ext tensorboard

Let's train a neural network on Fashion MNIST using the Leaky ReLU:

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [None]:
tf.random.set_seed(42)
np.random.seed(42)

LAYERS = [ tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")]


model = tf.keras.models.Sequential(LAYERS)

In [None]:
optimizer = tf.keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               235500    
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               30100     
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
__________________________________________________

In [None]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid), verbose=2)

Epoch 1/10
1719/1719 - 4s - loss: 0.4943 - accuracy: 0.8208 - val_loss: 0.3800 - val_accuracy: 0.8670
Epoch 2/10
1719/1719 - 4s - loss: 0.3834 - accuracy: 0.8595 - val_loss: 0.4006 - val_accuracy: 0.8586
Epoch 3/10
1719/1719 - 4s - loss: 0.3547 - accuracy: 0.8682 - val_loss: 0.3400 - val_accuracy: 0.8798
Epoch 4/10
1719/1719 - 5s - loss: 0.3328 - accuracy: 0.8777 - val_loss: 0.3331 - val_accuracy: 0.8806
Epoch 5/10
1719/1719 - 5s - loss: 0.3167 - accuracy: 0.8824 - val_loss: 0.3222 - val_accuracy: 0.8808
Epoch 6/10
1719/1719 - 5s - loss: 0.3037 - accuracy: 0.8886 - val_loss: 0.3530 - val_accuracy: 0.8778
Epoch 7/10
1719/1719 - 5s - loss: 0.2954 - accuracy: 0.8911 - val_loss: 0.3342 - val_accuracy: 0.8808
Epoch 8/10
1719/1719 - 5s - loss: 0.2829 - accuracy: 0.8939 - val_loss: 0.3497 - val_accuracy: 0.8812
Epoch 9/10
1719/1719 - 5s - loss: 0.2751 - accuracy: 0.8977 - val_loss: 0.3367 - val_accuracy: 0.8860
Epoch 10/10
1719/1719 - 5s - loss: 0.2638 - accuracy: 0.9011 - val_loss: 0.3186 - 