Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Peer Review Report 1 -- Matt Hoffman #29
The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer, Matt Hoffman, for taking the time to write such a thorough review.
A note on the relevance of convergence rates.
This article does not include the word “stochastic” anywhere. This is a problem, because
That is, to the extent that momentum is useful in modern machine learning, it’s not because it yields better convergence rates. It’s because it makes it possible to speed up SGD’s lengthy initial search process. (This is noted in , the Sutskever et al. article.)
That’s not to say that the intuitions developed in this article aren’t relevant to the stochastic setting. They are, especially early in optimization when the magnitude of the gradient noise may be small relative to the magnitude of the oscillations due to using a large step size.
But it’s important to be clear that the convergence rate results are at best suggestive—they probably don’t apply in most practical applications.
Suggestions/observations on figures:
“Decomposing the error”:
“Example - Polynomial Regression”:
Comments on the text:
Thank you very much Matt for the useful comments!
I will formulate a more detailed response soon. The stylistic tips are appreciated, and I will consider them carefully.
A quick response to the main point on stochastic approximation. As you point out, these results are still useful in early optimization. Where I disagree is that these rates do not hold at all in the stochastic setting. These rates still hold in the early parts of optimization, but not asymptotically. There is a detailed analysis of this in Section 4 of Flammarion and Bach, https://arxiv.org/abs/1504.01577. The gist of it is that these rates still hold in expectation, up to the point of it stalling, see eq 13 for the non strongly convex case.
It might be worthwhile talking about this in the article. What are your thoughts Chris/Shan?
It seems like you could address this issue with a paragraph or two. For example, you could acknowledge that you're exploring a limited case -- quadratics without gradient noise -- where it's possible to get theoretical traction, and then explain that you should take these results with a grain of salt, but there's reasons to think a lot of intuition may transfer over.
I disagree with the premise of this objection. As I have noted before, non asymptotic, expected convergence rates can be derived at all points in the optimization. I have added a new section on SGD which hopefully addresses some of these concerns.
I agree with the sentiment, though I have not found a good solution to this problem.
This has been fixed.
Thank you! This has been fixed.
I have found the current scrubbing system to be more natural personally.
I did intend to use the words "Regions". Pathological curvature is not a global phenomena, but a local one. It is easy to construct functions where the condition number of the hessian depends strongly on where the function is evaluated.
For an n dimensional function, there can be indeed be up to (n - 1) valleys, one for each pathological direction.
I do not understand this point.
My point here is largely rhetorical. It is worth noting, however that gradient descent is not an alternative, but a special case of momentum, where beta = 0. The hack I refer to is the oversight in GD which constrains it to a 1 term recurrence.
Thank you. I have changed this to "similar to the speedup"
I have decided to grant the reader the benefit of the doubt here.
I have been deliberately vague about what pathological curvature is, so I do not believe this to be tautological. It is indeed true that there are aberrations of gradient descent not explained by quadratics, but I do not think clause leads the reader to believe that is so.
This has been fixed.
Thank you. This has been fixed.
I have decided to tone down the use of adverbs, though I have not eliminated the use of it. These words do help emphasize important points, when used sparingly.
In consideration of your objection, I have changed the phrase to "by a gross abuse of language, let’s think of this as a prior"
The dynamics of momentum:
I have now made it a point to remind the reader of where the Q's come from.
I have added more detail on this subject, and an explicit formula for the range of permissible step-sizes
This has been fixed
This is an interesting observation, and Chris Olah has raised similar concerns. I believe the counterintuitive nature of these best practices is something worthy of further attention.
Thank you for the detailed feedback. I have fixed all the typos mentioned.
I do not believe so, though I believe this is a verbal tradition.