# SGD
syntax: tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.0, nesterov=False, name="SGD", **kwargs)

Stochastic gradient descent optimizer is based on the gradient descent algorithm. It updates the weights of a layer each time a neural network is trained with the data.

The update rule is given by:

1) when momentum is 0

w=w-learning_rate*g

where,

w= weight 

learning_rate = alpha (mostly)

g= the gradient dw/dt

2) When momentum is larger than 0

velocity = momentum*velocity - learning_rate * g

w = w + velocity

3) when nesterov=True, then
velocity = momentum*velocity - learning_rate * g

w = w + momentum*velocity - learning_rate * g


# Arguments

1. learning_rate - learning rate is not hing but no. of steps taken by the function('s') curve to reach it optimum (minimum) value. It is a floating point value between 0.0001 to 0.1 that is passed as an argument to the optimizer.


                            
2. momentum - [float] (hyperparameter>=0) that accelerates the gradient descent in proper direction by dampening the oscillations. Its default value is 0.



3. nesterov - [boolean] whether to apply nesterov momentum or not.



4. name - optional name prefix for operations created when applying gradients.



5. **kwargs - keyword arguments.

In [4]:
#Example Usage

import tensorflow as tf


opt = tf.keras.optimizers.SGD(learning_rate=0.1)
var = tf.Variable(1.0)
loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
step_count = opt.minimize(loss, [var]).numpy()
# Step is `- learning_rate * grad`  
var.numpy()

0.9

# RMSprop 

syntax: 

tf.keras.optimizers.RMSprop(learning_rate=0.001,rho=0.9,momentu=0.0,epsilon=1e-07,centered=False,name="RMSprop",**kwargs)

1. Optimizer that implements the RMSprop algorithm.


2. The gist of RMSprop is to:

    Maintain a moving (discounted) average of the square of gradients
    Divide the gradient by the root of this average


3. This implementation of RMSprop uses plain momentum, not Nesterov momentum.


4. The centered version additionally maintains a moving average of the gradients, and uses that average to estimate the variance.

# Arguments 

1. learning_rate-  learning rate is not hing but no. of steps taken by the function('s') curve to reach it optimum (minimum) value. It is a floating point value between 0.0001 to 0.1 that is passed as an argument to the optimizer.


2. rho- discounting factor for the gradient. Defaults to 0.9.


3. momentum- A scalar or a scalar tensor. Defaults to 0.0


4. epsilon - small constnt for numerical stability. 


5. centered - [boolean] If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.


In [3]:
#examples
import tensorflow as tf
opt = tf.keras.optimizers.RMSprop(learning_rate=0.1)
var1 = tf.Variable(10.0)
loss = lambda: (var1 ** 2) / 2.0    # d(loss) / d(var1) = var1
step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

9.683772


# Adam

syntax: tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name="Adam")

1. Optimizer that implements the Adam algorithm.

2. Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.    

# Arguments


1. learning_rate: A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule, or a callable that takes no arguments and returns the actual value to use, The learning rate. Defaults to 0.001.
    
    
2. beta_1: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use. The exponential decay rate for the 1st moment estimates. Defaults to 0.9.
    
    
3.beta_2: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use, The exponential decay rate for the 2nd moment estimates. Defaults to 0.999.
    
    
4.epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to 1e-7.
    
    
5.amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". Defaults to False.
    
    
6.name: Optional name for the operations created when applying gradients. Defaults to "Adam".
    
    
7. **kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.



In [8]:
import tensorflow as tf

opt= tf.keras.optimizers.Adam(learning_rate=0.1)
var1=tf.Variable(10.0)
loss= lambda: (var1**2)/2.0

step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

9.9

# Adadelta

syntax: tf.keras.optimizers.Adadelta(learning_rate=0.001, rho=0.95, epsilon=1e-07, name="Adadelta", **kwargs)
   
1. This is based on the stochastic gradient descent method with adaptive learning rate per dimension



2. This algorithm addresses two drawbacks:
    
- The continual decay of learning rates throughout training

- The need for a manually selected global learning rate


3. Adadeltas adaptive learning approach allows it to learn even after updates have been done. Compared to adagrad, it is more robust and allows setting the initiali learning rate. It adapts lerning rates based on moving window of gradient updates rather than accumulating all past gradients.


4. Near the gradient, In order to avoid overfitting and to converge,the epsilon constant in the numerator and denominator dominate past gradients and parameter updates which converge the learning rate to 1.

the step size is small 


5. Generally, ADADELTA converges faster than ADAGRAD.


# Arguments



1. learning_rate: A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule. The learning rate. To match the exact form in the original paper use 1.0.

    
2.rho: A Tensor or a floating point value. The decay rate.

    
3.epsilon: A Tensor or a floating point value. A constant epsilon used to better conditioning the grad update.

    
4.name: Optional name prefix for the operations created when applying gradients. Defaults to "Adadelta".
    
    
5.**kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.
  
    

In [9]:
import tensorflow as tf

opt= tf.keras.optimizers.Adadelta(learning_rate=0.1)
var1=tf.Variable(10.0)
loss= lambda: (var1**2)/2.0

step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

9.999859

# Adagrad class

syntax: tf.keras.optimizers.Adagrad(learning_rate=0.001,initial_accumulator_value=0.1,epsilon=1e-07,name="Adagrad",**kwargs)

1. Adagrad is an optimizer with parameter-specific learning rates, i.e., relative to the frequency of a parameter getting updated during training. The more updates the smaller they are.



# Arguments

1. learning_rate: A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule. The learning rate.

    
2. initial_accumulator_value: A floating point value. Starting value for the accumulators, must be non-negative.

    
3. epsilon: A small floating point value to avoid zero denominator.

    
4. name: Optional name prefix for the operations created when applying gradients. Defaults to "Adagrad".

    
5. **kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.

In [15]:
import tensorflow as tf

opt= tf.keras.optimizers.Adagrad(learning_rate=0.1, epsilon=1e-7)
var1=tf.Variable(10.0)
loss= lambda: (var1**2)/2.0

step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

9.90005

# Adamax 

syntax: tf.keras.optimizers.Adamax(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Adamax", **kwargs)
    
1. It is a variant of Adam based on infinity norm. For models with embeddings, Adamax gives better performance than Adam.

# Algorithm 

# Initialization  

m = 0  # Initialize initial 1st moment vector

v = 0  # Initialize the exponentially weighted infinity norm

t = 0  # Initialize timestep

# Update rule

t += 1

m = beta1 * m + (1 - beta) * g

v = max(beta2 * v, abs(g))

current_lr = learning_rate / (1 - beta1 ** t)

w = w - current_lr * m / (v + epsilon)

# Key Points

1. Similarly to Adam, the epsilon is added for numerical stability (especially to get rid of division by zero when v_t == 0).


2. In contrast to Adam, the sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of tf.gather or an embedding lookup in the forward pass) only updates variable slices and corresponding m_t, v_t terms when that part of the variable was used in the forward pass. This means that the sparse behavior is contrast to the dense behavior (similar to some momentum implementations which ignore momentum unless a variable slice was actually used).

# Arguments


1. learning_rate: A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule. The learning rate.

    
2. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.

    
3. beta_2: A float value or a constant float tensor. The exponential decay rate for the exponentially weighted infinity norm.

    
4. epsilon: A small constant for numerical stability.

    
5. name: Optional name for the operations created when applying gradients. Defaults to "Adamax".
    
    
6. **kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.


In [20]:
#Example

import tensorflow as tf

opt= tf.keras.optimizers.Adamax(learning_rate=0.1)
var1=tf.Variable(10.0)
loss= lambda: (var1**2)/2.0

step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

9.9

# Nadam 

syntax: tf.keras.optimizers.Nasam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Nadam", **kwargs)

1. Nadam is Adam with Nesterov momentum.


# Arguments


1. learning_rate: A Tensor or a floating point value. The learning rate.
    
    
2. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.
    
    
3. beta_2: A float value or a constant float tensor. The exponential decay rate for the exponentially weighted infinity norm.

    
4. epsilon: A small constant for numerical stability.

    
5. name: Optional name for the operations created when applying gradients. Defaults to "Nadam".

    
6. **kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.

#Example

import tensorflow as tf

opt= tf.keras.optimizers.Nadam(learning_rate=0.1)
var1=tf.Variable(10.0)
loss= lambda: (var1**2)/2.0

step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

In [25]:
#Example

import tensorflow as tf

opt= tf.keras.optimizers.Nadam(learning_rate=0.1, beta_1=0.9)
var1=tf.Variable(10.0)
loss= lambda: (var1**2)/2.0

step_count = opt.minimize(loss, [var1]).numpy()
var1.numpy()

9.894355

# FTRL

syntax: 

**tf.keras.optimizers.Ftrl(
    learning_rate=0.001,
    learning_rate_power=-0.5,
    initial_accumulator_value=0.1,
    l1_regularization_strength=0.0,
    l2_regularization_strength=0.0,
    name="Ftrl",
    l2_shrinkage_regularization_strength=0.0,
    beta=0.0,
    **kwargs)**

    
# Arguments


1. learning_rate: A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule. The learning rate.


2. learning_rate_power: A float value, must be less or equal to zero. Controls how the learning rate decreases during training. Use zero for a fixed learning rate.
    
    
3. initial_accumulator_value: The starting value for accumulators. Only zero or positive values are allowed.
    
    
4. l1_regularization_strength: A float value, must be greater than or equal to zero.
    
    
5. l2_regularization_strength: A float value, must be greater than or equal to zero.
    
    
6. name: Optional name prefix for the operations created when applying gradients. Defaults to "Ftrl".
    
    
7. l2_shrinkage_regularization_strength: A float value, must be greater than or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. When input is sparse shrinkage will only happen on the active weights.
    
    
8. beta: A float value, representing the beta value from the paper.
    
    
9. **kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.
