Breaking ADAM

In case you haven't heard, one of the top papers at ICLR 2018 (pronounced: eye-clear, who knew?) was On the Convergence of Adam and Beyond. In the paper, the authors determine a flaw in the convergence proof of the ubiquitous ADAM optimizer. They also give an example of a simple function for which ADAM does not converge to the correct solution. We've seen how torchbearer can be used for simple function optimization before and we can do something similar to reproduce the results from the paper.

Online Optimization

Online learning basically just means learning from one example at a time, in sequence. The function given in the paper is defined as follows:

$f_t(x) = \begin{cases}1010x, & \text{for } t \; \texttt{mod} \; 101 = 1 \\ -10x, & \text{otherwise}\end{cases}$

We can then write this as a PyTorch model whose forward is a function of its parameters with the following:

/_static/examples/amsgrad.py

We now define a loss (simply return the model output) and a metric which returns the value of our parameter x:

/_static/examples/amsgrad.py

In the paper, x can only hold values in [ − 1, 1]. We don't strictly need to do anything but we can write a callback that greedily updates x if it is outside of its range as follows:

/_static/examples/amsgrad.py

Finally, we can train this model twice; once with ADAM and once with AMSGrad (included in PyTorch) with just a few lines:

/_static/examples/amsgrad.py

Note that we have logged to TensorBoard here and after completion, running tensorboard --logdir logs and navigating to localhost:6006, we can see a graph like the one in Figure 1 from the paper, where the top line is with ADAM and the bottom with AMSGrad:

Stochastic Optimization

To simulate a stochastic setting, the authors use a slight variant of the function, which changes with some probability:

$f_t(x) = \begin{cases}1010x, & \text{with probability } 0.01 \\ -10x, & \text{otherwise}\end{cases}$

We can again formulate this as a PyToch model:

/_static/examples/amsgrad.py

Using the loss, callback and metric from our previous example, we can train with the following:

/_static/examples/amsgrad.py

After execution has finished, again running tensorboard --logdir logs and navigating to localhost:6006, we see another graph similar to that of the stochastic setting in Figure 1 of the paper, where the top line is with ADAM and the bottom with AMSGrad:

Conclusions

So, whatever your thoughts on the AMSGrad optimizer in practice, it's probably the sign of a good paper that you can re-implement the example and get very similar results without having to try too hard and (thanks to torchbearer) only writing a small amount of code. The paper includes some more complex, 'real-world' examples, can you re-implement those too?

Source Code

The source code for this example can be downloaded below:

Download Python source code: amsgrad.py </_static/examples/amsgrad.py>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amsgrad.rst

amsgrad.rst

Breaking ADAM

Online Optimization

Stochastic Optimization

Conclusions

Source Code

Files

amsgrad.rst

Latest commit

History

amsgrad.rst

File metadata and controls

Breaking ADAM

Online Optimization

Stochastic Optimization

Conclusions

Source Code