In [8]:
import tensorflow as tf
import inspect

## Stochastic Gradient Descent (SGD)

We have to implement the "stochastic" part, because tensorflow doesn't provide this level of detail. "stochastic" can be either one-sample-based, or minibatch-based. Namely, $dW$ in the following is calculated for one sample, or for one batch of samples.

$$W = W - \alpha \cdot dW$$

In [15]:
tf.train.GradientDescentOptimizer
inspect.signature(tf.train.GradientDescentOptimizer)

<Signature (learning_rate, use_locking=False, name='GradientDescent')>

## Momentum

$$
\begin{cases}
v = \beta v + (1 - \beta)dW \qquad \text{(velocity, averaging over previous gradients)} \\
W = W - \alpha \cdot v
\end{cases}
$$
Hyperparameters: $\alpha, \beta$. 

$\beta = 0.9$ is recommended by the literature, and normally practitioners don't tune this parameter. We only need to fine-tune the best learning rate $\alpha$. By the way, $\beta = 0.9$ essentially means we average over the previous 10 gradients, because earlier gradients have too small weights, and thus it is equivalent to a **sliding window** size of 10.

Two more side notes. First, sometimes we see 
$$v = \beta v + dW$$
which is equivalent to the above (BUT with the same $\beta = 0.9$, the best learning rate $\alpha$ would be different.)

Second, sometimes after computing $v$, people also apply bias correction
$$v = \frac{v}{1-\beta^t}$$
But in most practices, we don't do this because after $t>10$, the denominator approaches 1, so it doesn't matter.

In [16]:
tf.train.MomentumOptimizer
inspect.signature(tf.train.MomentumOptimizer)

<Signature (learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False)>

## AdaGrad

$$
\begin{cases}
s = s + (dW)^2 \\
W = W - \frac{\alpha \cdot dW}{\sqrt{s} + \epsilon}
\end{cases}
$$
where $\epsilon = 10^{-8}$ is normally used to avoid dividing by tiny $s$.

This makes the learning rate ($\frac{\alpha}{\sqrt{s}}$) keeps decreasing, which I don't like because after a certain point, the learning might stop.

In [17]:
tf.train.AdagradOptimizer
inspect.signature(tf.train.AdagradOptimizer)

<Signature (learning_rate, initial_accumulator_value=0.1, use_locking=False, name='Adagrad')>

## RMSProp (Root Mean Square)

Taking a similar idea from AdaGrad that we should scale the learning rate, we calculate $s$ by exponentially decaying average.

$$
\begin{cases}
s = \beta s + (1-\beta)(dW)^2 \\
W = W - \frac{\alpha \cdot dW}{\sqrt{s} + \epsilon}
\end{cases}
$$
Hyperparameters: $\alpha$, $\beta$.

NOTE that $dW$ is a vector of params, and the above is element-wise computation, hence the scaled learning rate ($\frac{\alpha}{\sqrt{s}}$) is different for each param direction. The motivation of RMSProp is to avoid the problem of oscillation for param updates when gradients are big but in two almost opposite directions between two iterations.

In [18]:
tf.train.RMSPropOptimizer
inspect.signature(tf.train.RMSPropOptimizer)

<Signature (learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, centered=False, name='RMSProp')>

We see that there is also the 'momentum' part (although default 0) in RMSProp of tensorflow. In some literature, people also add momentum to RMSProp, which is then very similar to Adam.

## Adam (Adaptive moment)

We combine the ideas of momentum and RMSProp. And we also do bias correction.

$$
\begin{cases}
v = \beta_1 v + (1 - \beta_1)dW \\
s = \beta_2 s + (1 - \beta_2)(dW)^2 \\
v = \frac{v}{1 - \beta_1^t} \\
s = \frac{s}{1 - \beta_2^t} \\
W = W - \frac{\alpha \cdot v}{\sqrt{s} + \epsilon}
\end{cases}
$$
Hyperparameters: $\alpha, \beta_1, \beta_2$.

In [19]:
tf.train.AdamOptimizer
inspect.signature(tf.train.AdamOptimizer)

<Signature (learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')>

## AdaDelta

AdaDelta is similar to RMSProp, but it also introduces an additional way to adapt the learning rate.

$$
\begin{cases}
s = \beta s + (1-\beta)(dW)^2 \qquad \text{(Accumulate gradients, RMSProp-like step)}\\
v = -\frac{\sqrt{x+\epsilon}\cdot dW}{\sqrt{g + \epsilon}} \\
x = \beta x + (1-\beta)v^2 \qquad \text{(Accumulate updates, momentum-like step, but why? I don't get the previous formula)}\\
W = W + v
\end{cases}
$$

In [20]:
tf.train.AdadeltaOptimizer
inspect.signature(tf.train.AdadeltaOptimizer)

<Signature (learning_rate=0.001, rho=0.95, epsilon=1e-08, use_locking=False, name='Adadelta')>

In principle, we don't need the learning rate param, but different people give different formulas of AdaDelta online, and some of them multiply $\alpha$ in the second formula above. So, maybe Google adopts that.

## Which one to use?

Accoding to Andrew NG, RMSProp and Adam work well in most scenarios!