https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

- `The Adam optimization algorithm is an extension to stochastic gradient descent(SGD) that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.`

<h2>How Does Adam Work?

        Adam is different to classical stochastic gradient descent.

`Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.`

        A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.

***`The Adam method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.`***

<h2>Adam is Effective </h2>

    Adam is a popular algorithm in the field of deep learning because it achieves good results fast.

    Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.

`When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:`

        Straightforward to implement.
        
        Computationally efficient.
        
        Little memory requirements.
        
        Invariant to diagonal rescale of the gradients.
        
        Well suited for problems that are large in terms of data and/or parameters.
        
        Appropriate for non-stationary objectives.
        
        Appropriate for problems with very noisy/or sparse gradients.
        
        Hyper-parameters have intuitive interpretation and typically require little tuning.


<h2>Adam Configuration Parameters

- `alpha.`:

            Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
            
- `beta1.`:
        
            The exponential decay rate for the first moment estimates (e.g. 0.9).

- `beta2` :

            The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).

- `epsilon` :

            Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

`Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration.`


<h3>The Adam paper suggests:</h3>

***`Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8`***

`The TensorFlow documentation suggests some tuning of epsilon:`

    The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

*We can see that the popular deep learning libraries generally use the default parameters recommended by the paper.*

    TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.

    Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0.

    Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1.

    Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08

    Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08

    MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

    Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

<h2>Developers Views:

    Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances.Its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

    In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative.

    The two recommended updates to use are either SGD+Nesterov Momentum or Adam.