# Faster Optimizers

## Momentum Optimization
It cares a great deal about what previous gradients were: at each iteration, it adds the local gradient to the momentum vector m, and it updates the weights by simply subtracting this momentum vector. To prevent the momentum from growing too large, the algorithm introduces a new hyperparameter $\beta$ simply called the momentum (0- high friction, 1- no friction). A typical value is 0.9.

$\mathbf{m} \gets \beta \mathbf{m} - \eta \nabla_\mathbf{\theta}J(\mathbf{\theta})$

You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the maximum size of the weight updates) is equal to that gradient multiplied by the learning rate η multiplied by $ \frac{1}{1 - \beta} $

In [7]:
import tensorflow as tf

optimizer = tf.train.MomentumOptimizer(learning_rate = 0.1, 
                                       momentum = 0.9)

# Nesterov Accelerated Gradient
It is a small variant to the Momentum optimization. The idea is to measure the gradient of the cost function not a the local position but slightly ahead in the direction of the momentum.

$\mathbf{m} \gets \beta \mathbf{m} - \eta \nabla_\mathbf{\theta}J(\mathbf{\theta} + \beta \mathbf{m})$

In [6]:
optimizer = tf.train.MomentumOptimizer(learning_rate = 0.1, 
                                       momentum = 0.9, 
                                       use_nesterov = True)

## AdaGrad
It corrects the gradient descent direction to point a bit more toward the global optimum. It achieves this by scaling down the gradient vector along the steepest dimensions.

$\mathbf{s} \gets \mathbf{s} + \nabla_\mathbf{\theta}J(\mathbf{\theta}) \otimes \nabla_\mathbf{\theta}J(\mathbf{\theta})$

$\mathbf{\theta} \gets \mathbf{\theta} - \eta \, \nabla_\mathbf{\theta}J(\mathbf{\theta}) \oslash {\sqrt{\mathbf{s} + \epsilon}}$

The first step accumaletes the square of the gradients into the vector ($\otimes=element-wise-moltiplication$). In other words, each s, accumulates the squares of the partial derivative of the cost function with regards to the parameter $\theta_i$



The second step is almost identical to gradient descent, but with one big difference: the gradient vector is scaled down by a factor of $\sqrt{\mathbf{s} + \epsilon}  --   (\otimes=element-wise-division)$


The algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called adaptive learning rate.

In [8]:
optimizer = tf.train.AdagradOptimizer(learning_rate = 0.1)

## RMSProp
AdaGrad slows down too fast and ends up never converging the optimal solution, the RMSProp algorithm fixes this by accumulating only the fradients from the most recent iterations.

$\mathbf{s} \gets \beta \mathbf{s} + (1 - \beta ) \nabla_\mathbf{\theta}J(\mathbf{\theta}) \otimes \nabla_\mathbf{\theta}J(\mathbf{\theta})$

$\mathbf{\theta} \gets \mathbf{\theta} - \eta \, \nabla_\mathbf{\theta}J(\mathbf{\theta}) \oslash {\sqrt{\mathbf{s} + \epsilon}}$

In [10]:
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.1,
                                      momentum=0.9, decay=0.9, epsilon=1e-10)

## Adam Optimization
Adaptive Momentum Estimatio, combines the ideas of Momentum Optimization and RMSProp

1. $\mathbf{m} \gets \beta_1 \mathbf{m} - (1 - \beta_1) \nabla_\mathbf{\theta}J(\mathbf{\theta})$

2. $\mathbf{s} \gets \beta_2 \mathbf{s} + (1 - \beta_2) \nabla_\mathbf{\theta}J(\mathbf{\theta}) \otimes \nabla_\mathbf{\theta}J(\mathbf{\theta})$

3. $\mathbf{m} \gets \left(\dfrac{\mathbf{m}}{1 - {\beta_1}^T}\right)$

4. $\mathbf{s} \gets \left(\dfrac{\mathbf{s}}{1 - {\beta_2}^T}\right)$

5. $\mathbf{\theta} \gets \mathbf{\theta} + \eta \, \mathbf{m} \oslash {\sqrt{\mathbf{s} + \epsilon}}$

In [12]:
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)

## Learning Rate Scheduling

In [13]:
tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

In [14]:
with tf.name_scope("train"):
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name="global_step")
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
                                               decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)

In [19]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [21]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [22]:
n_epochs = 5
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.9561
1 Test accuracy: 0.972
2 Test accuracy: 0.9742
3 Test accuracy: 0.9778
4 Test accuracy: 0.9824
