Skip to content

Peer Review Report 1 -- Matt Hoffman #29

@colah

Description

@colah

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, Matt Hoffman, for taking the time to write such a thorough review.


A note on the relevance of convergence rates.

This article does not include the word “stochastic” anywhere. This is a problem, because

  1. In machine learning (and probably beyond), momentum is almost always applied to noisy gradients. (Quasi-Newton methods like LBFGS are usually more effective than SGD with momentum in the noise-free setting.)

  2. The theoretical results about momentum that the article presents do not apply in the presence of gradient noise.

That is, to the extent that momentum is useful in modern machine learning, it’s not because it yields better convergence rates. It’s because it makes it possible to speed up SGD’s lengthy initial search process. (This is noted in [1], the Sutskever et al. article.)

That’s not to say that the intuitions developed in this article aren’t relevant to the stochastic setting. They are, especially early in optimization when the magnitude of the gradient noise may be small relative to the magnitude of the oscillations due to using a large step size.

But it’s important to be clear that the convergence rate results are at best suggestive—they probably don’t apply in most practical applications.

Suggestions/observations on figures:

Teaser figure:

  • It might be interesting to reparameterize this so that alpha/(1-beta) is constant.

  • The most interesting stuff happens when beta is between 0.9 and 0.99, but it’s hard to get fine control in that region.

  • It would be nice to be able to set the step size high enough that the dynamics actually explode.

“Decomposing the error”:

  • If the largest eigenvalue is 100, shouldn’t the maximum stable step size be 0.02 instead of 2?

“Example - Polynomial Regression”:

  • Scrubbing up and down feels more natural to me than left to right, especially for the first .

Comments on the text:

Content:

  • “regions of f which aren’t scaled properly”, “In these unfortunate regions…”: I think “directions” is probably a better word than “regions” here. “Regions” implies that there are places where it’s not a problem.

  • “The iterates either jump between valleys”: What’s that mean? Are there multiple valleys? (I assume this is meant to describe wild oscillatory behavior.)

  • “The change is innocent”: Innocent of what?

  • 'Optimizers call this minor miracle “acceleration”.’: At first I thought “optimizers” meant algorithms, not people.

  • “But the truth, if anything, is the other way round. It is gradient descent which is the hack.” In what sense is GD “hackier" than GD+momentum? Working less well than an alternative does not make a method a hack.

  • “this is the speedup you get from the Fast Fourier Transform”: To be pedantic, the FFT is O(n log n), not O(n).

  • “think of A as your favorite model of curvature - the Hessian, Fisher Information Matrix [5], etc”: I know what you mean here, but not every reader will.

  • “captures all the key features of pathological curvature”: To the extent that you define curvature in terms of the Hessian, this is tautological. But “all the key features” sounds like studying quadratics is sufficient, even though it can’t teach you anything about other important issues (e.g., vanishing/exploding gradients, saddle points, etc.).

  • “the eigenvalues of A”: This is imprecise—the eigenvalues aren’t a space.

  • In “Choosing A Step-size”, sometimes the smallest eigenvalue is lambda_0 instead of lambda_1.

  • “It is a measure of how singular a matrix is.”: “how singular” should be “how close to singular”—a matrix is either singular or it isn’t.

  • “The above analysis reveals an interesting insight”, “Surprisingly, in many applications”, etc.: These sentences tell the reader how to feel, which bugs me. IMO, it’s best to let the reader decide whether something is surprising, interesting, remarkable, etc. (I know many people who agree. Of course, people still do it: https://nsaunders.wordpress.com/2013/07/16/interestingly-the-sentence-adverbs-of-pubmed-central/)

  • Polynomial regression example: Do you ever say where Q came from? I assume it’s from the eigendecomposition of the Hessian of the linear regression problem, but I don’t see that made explicit anywhere.

  • “starting at a simple initial point like 0 (call this a prior, if you like)”: I do not. A prior is a distribution, not an initialization. The point that shrinking towards 0 (either using Tikhonov regularization or early stopping) mimics maximum a-posteriori estimation with a normal prior is valid, but the initial point 0 is not a “prior”.

  • The dynamics of momentum: Again, it’d be nice to be specific about where Q came from. When using notation for the first time in a little while, it’s always good to remind the reader what it means.

  • “Momentum allows us to use a crank up the step-size up by a factor of 2 before diverging.”: It took me a moment to figure out how the figure demonstrated this—it might be good to point the reader to the upper right corner of the figure.

  • “Plug this into the convergence rate, and you get”: I think the labels for the convergence rates are backwards.

  • “Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.” This is true for quadratics, but for non-convex problems it’s not always good advice. It’s hard to say why in general, but my (very hand-wavy) intuition is that using a very large step size and momentum quickly breaks near-symmetries in your random initialization, which makes SGD behave more greedily than it might otherwise.

Typos etc.:

  • “…descent as many virtues…” as => has.

  • “simple - when” should use an em dash, not a hyphen.

  • “optimizers old nemesis” needs an apostrophe.

  • “along certain directions grind to a halt” grind => grinds.

  • “short term memory” short term => short-term.

  • “Fortunately, momentum comes speeds things up significantly.”: delete “comes”?

  • “the Convex Rosenbrok”: Rosenbrok => Rosenbrock. Also, is this terminology novel?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions