### Adam: A Method for Stochastic Optimization

**Authors**: Diederik Kingma, Jimmy Ba  
**Link**: https://arxiv.org/pdf/1412.6980.pdf

---

- Introduced a new stochastic optimization method `ADAM` (Adaptive Moment Estimation) designed to combine the advangates of `AdaGrad` and `RMSProp`
    - *Stochastic Optimization Methods* - Finds parameters that minimize or maximize a stochatic function
    - *Stochastic Function* - A non-deterministic function that generates different results for the same set of input parameters. For example, the loss of different mini-batches differ even when the weights remain same, or for the same mini-batch the loss can be different if dropout is used.
    
- `ADAM` advantages
    - Ability to deal with sparse gradients (similar to `AdaGrad`)
    - Ability to deal with non-stationary (similar to `RMSProp`)
    - Faster convergence compared to existing methods
    - Straightforward to implement and requires little memory
    
---

**Principle**
- SGD updates the parameters as per $\theta \leftarrow \theta - \eta \cdot \nabla_\theta J(\theta; x^{(i)}, y^{(i)})$, where $\theta$ is parameters, $\eta$ is learning rate, $\nabla_\theta J(\theta; x^{(i)}, y^{(i)})$ is gradient of the cost function $J$ with respect to the parameters $\theta$ for each training example $x^{(i)}$ and label $y^{(i)}$. For simplicity $\nabla_\theta J(\theta; x^{(i)}, y^{(i)}) \rightarrow \nabla_\theta J(\theta)$

- `ADAM` updates the parameter similarly with some clever modifications
    - Stochasticity comes from using mini-batches
    - It updates exponential moving averages of the gradient $m_t$ and the squared gradient $v_t$. The moving averages are the estimates of the mean (*first moment*) and the uncentered variance (*second (raw) moment*) of the gradient
    - Parameters are updated as per $\theta_t \leftarrow \theta_{t-1} - \eta \cdot \frac{\widehat{m_t}}{\sqrt{\widehat{v_t}}+\epsilon}$, where $\widehat{m_t}$ and $\widehat{v_t}$ are bias-corrected estimates of $m_t$ and $v_t$
    - Assuming $\epsilon = 0$, effective step taken in parameter space at time $t$ is $\Delta_t = \frac{\widehat{m_t}}{\sqrt{\widehat{v_t}}}$
    - Signal to Noise Ratio $\rightarrow \frac{\widehat{m_t}}{\sqrt{\widehat{v_t}}}$

---

**Bias Correction**

- Moving averages $m_0$ and $v_0$ are initialized as (vectors of) zeros - This makes moving averages biased toward zero during the initial time steps

* Bias correction
    * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced.
    * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1 - beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0).
    * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit.
    * So to fix this pretty they perform bias-corrections of the mean and the variance:
      * `correctedMean = mean / (1-beta1^t)` (where `t` is the timestep).
      * `correctedVariance = variance / (1-beta2^t)`.
      * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep).

![Algorithm](images/Adam__algorithm.png?raw=true "Algorithm")


# Summary



* How
  * Basic principle
    * Standard SGD just updates the parameters based on `parameters = parameters - learningRate * gradient`.
    * Adam operates similar to that, but adds more "cleverness" to the rule.
    * It assumes that the gradient values have means and variances and tries to estimate these values.
      * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients.
      * The mean is also called "the first moment".
      * The variance is also called "the second (raw) moment".
    * Then an update rule very similar to SGD would be `parameters = parameters - learningRate * means`.
    * They instead use the update rule `parameters = parameters - learningRate * means/sqrt(variances)`.
      * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'.
      * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`.
      * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`.
  * Exponential moving averages
    * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values.
    * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs).
    * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients.
    * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1 - alpha) * avg`.
    * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using:
      * `mean = beta1 * mean + (1 - beta1) * g`
      * `variance = beta2 * variance + (1 - beta2) * g^2`.
      * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`.
      * At the start of the algorithm, `mean` and `variance` are initialized to zero-vectors.
  