**Optimizers** are techniques or algorithms used to decrease loss (an error) by tuning various parameters and weights, hence minimizing the loss function, providing better accuracy of model faster.

In [None]:
import tensorflow as tf

In [None]:
#tf.train.Optimizer - Tensorflow version 1.x

In [None]:
#tf.compat.v1.train.Optimizer - Tensorflow version 2.x

This is how optimizers work: they iteratively adjust the parameters of a model (like
𝑥
) to reduce the error (loss function) by following the direction of the gradient.

In [None]:
var = tf.Variable(2.0)

# Use the experimental SGD optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

SGD updates the model's parameters by moving them in the direction opposite to the gradient of the loss function. It uses a constant learning rate, which you set manually

In [None]:
with tf.GradientTape() as tape:
    loss = var**2  # simple loss function

In [None]:
gradients = tape.gradient(loss, [var])

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=SGD/iteration>

In [None]:
var.numpy()

1.96

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

**Adam (Adaptive Moment Estimation):**

 Adam computes individual adaptive learning rates for different parameters based on estimates of first and second moments of the gradients.

 Adam is often the default choice for many neural networks due to its adaptability and generally good performance across a variety of tasks.

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=adam/iteration>

In [None]:
var.numpy()

1.9500002

In [None]:
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.01)

**RMSprop (Root Mean Square Propagation):**

RMSprop works by dividing the learning rate by an exponentially decaying average of squared gradients.


RMSprop is especially useful for RNNs (Recurrent Neural Networks) and other deep networks where you might encounter vanishing or exploding gradients.

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=rmsprop/iteration>

In [None]:
var.numpy()

1.9183774

In [None]:
optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.01)

**Adagrad (Adaptive Gradient Algorithm):**

Adagrad adapts the learning rate for each parameter based on the sum of the squares of the past gradients.

Adagrad is useful for sparse data and natural language processing tasks where certain features appear infrequently.

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=adagrad/iteration>

In [None]:
var.numpy()

1.9084085

In [None]:
optimizer = tf.keras.optimizers.Adadelta(learning_rate=0.01)

**Adadelta:**

Adadelta is an extension of Adagrad designed to reduce its aggressive, monotonically decreasing learning rate.

Adadelta is effective when you need to avoid the diminishing learning rate problem that Adagrad faces.
It’s particularly useful in deep learning where you might not want to set a specific learning rate.

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=adadelta/iteration>

In [None]:
var.numpy()

1.9083943

In [None]:
optimizer = tf.keras.optimizers.Adamax(learning_rate=0.01)

**Adamax:**

Adamax is a variant of Adam based on the infinity norm. It can be more stable than Adam in certain cases because it uses the maximum absolute value of the gradients to update the parameters.


Adamax can be useful in situations where the standard Adam optimizer is not performing well, particularly when dealing with very large datasets or models.

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=adamax/iteration>

In [None]:
var.numpy()

1.8983943

In [None]:
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)

**Nadam (Nesterov-accelerated Adaptive Moment Estimation):**

Nadam combines Adam with Nesterov momentum, which looks ahead at the next anticipated position before making the parameter update. This can lead to more stable and faster convergence.



Nadam is beneficial when you need the stability and performance of Adam with the added benefit of momentum, especially in cases where models are sensitive to parameter updates.

In [None]:
optimizer.apply_gradients(zip(gradients, [var]))

<KerasVariable shape=(), dtype=int64, path=nadam/iteration>

In [None]:
var.numpy()

1.8877665