#  Neural Network Loss Landscapes: What do we know?

_Posted: 2/29/2020_

It's hard to overstate the efficacy of deep neural networks in both AI research and practice. Nearly all state-of-the-art models in AI research for vision, language, and speech utilize deep learning in some form. Outside of research, the accomplishments of large-scale neural network models have garnered significant attention from the general public. For example, you might have heard of [GPT-3](https://arxiv.org/abs/2005.14165), a gigantic neural language model that is capable of generating very human-like text. Or maybe you've heard of [DALL-E](https://openai.com/blog/dall-e/), a neural network capable of generating eerily realistic images from a user-input prompt. You've almost certainly heard of progress on self-driving cars - Tesla's [autopilot features](https://www.tesla.com/autopilotAI), for example, uses neural networks for image recognition.

However, despite the fact that neural networks (in combination with lots of data) have been the backbone of much of the recent progress in AI, the research community of machine learning still has a fairly weak grasp on _just why neural networks work so well_. The crux of the issue is that it is nearly always possible for a neural network to _memorize_ the training data - that is, a neural network can learn a solution which performs _perfectly_ on training data, and yet does no better than _random chance_ on data that it hasn't seen before (that is, the solution does not generalize to unseen data). Despite this fact, neural networks consistently find solutions that are able to generalize extremely well, which is why they are so pervasive in research and practice. So one of the key questions in machine learning right now is: **why does deep learning find generalizable solutions?**

In this post, I'll try to dig into some of the recent research surrounding loss-landscapes and generalization in neural networks, to get a birds-eye view of what we currently know and what we still are struggling to explain.

_(From this point onwards, I'm going to refer to neural networks as NNs)_

## The Apparent Difficulty of Generalization in NNs

NNs pose an extremely difficult optimization challenge, due to the fact that their loss landscape is non-convex and _extremely_ high-dimensional. From an optimization perspective, in a perfect world we could simply look at all possible settings of weights in a neural network and select the setting which performs the best our training data. There are two problems with this ideal setting:

1. This is clearly impossible - searching over all possible weight settings of a neural network is an intractable problem.
2. There is no guarantee that this solution would actually generalize to data outside of the training data!

Issue 1 leaves us with approximate algorithms, such as [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent), to attempt to find a good solution in an efficient manner. This entails often starting from a random point in the hypothesis class (a random assignment of weights), and using gradient-based methods to move from one point to another, within the hypothesis class, ideally moving to subsequent solutions which reduce the training error from the previous solution.
Early on in NN history, this was considered a large problem - a lot of effort was devoted towards [addressing saddle-points](https://arxiv.org/pdf/1406.2572.pdf) in NN loss landscapes, and different optimizers such as [Adam](https://arxiv.org/pdf/1412.6980.pdf) or [RMSProp](http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf) were proposed, all in an attempt to better tackle the optimization problem of minimizing training data error.
However, [Goodfellow et al., 2015](https://arxiv.org/pdf/1412.6544.pdf) first noted that, in fact, the training trajectory of NNs trained with SGD rarely seems to encounter local minima - training is often smooth, with error monotonically decreasing across training steps until negligible traning error is achieved.
Additionally, recent advancements in neural networks, such as residual (or skip) connections, coupled with higher and higher overparameterization, have resulted in much smoother, and easier to optimize loss landscapes:
![](blog_figs/nnlls/visualizing.png)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _Figure: [Li et al., 2018](https://arxiv.org/pdf/1712.09913.pdf)_

As a result, using gradient descent with modern NNs can obtain near perfect accuracy on the training data ([Zhang et al., 2016](https://arxiv.org/pdf/1611.03530.pdf), [Du et al., 2019](https://arxiv.org/pdf/1810.02054.pdf), [Huang et al., 2019](https://arxiv.org/abs/1906.03291)). That is to say: in most cases of NN training these days, we can find a global minimizer to the optimization problem. So issue 1 turns out not to be a major problem for us.

Issue 2, however, is much more difficult to deal with. This issue is one of the key issues of machine learning, and learning theory in general - how do we leverage training data to find solutions that can perform well on unseen data?
Statistical learning theory attempts to give us bounds on how well we can expect a model to generalize given the complexity of the hypothesis class we're using to learn and the amount of training data we have.
The key issue with neural networks is that they are very expressive; in fact, they are [universal function approximators](https://www.sciencedirect.com/science/article/abs/pii/0893608089900208) - as such, standard metrics of complexity suggest that NNs are complex enough to completely memorize the entire training set, without learning anything about the underlying data distribution. As a result, statistical learning theory would suggest that NNs should not be learnable - there are no theoretical guarantees that our solutions will generalize at all. Further, this has actually been demonstrated in practice as well...

### Fitting Noise & Bad Global Minima

In their seminal paper [Understanding deep learning requires rethinking generalization](https://arxiv.org/pdf/1611.03530.pdf), Zhang et al. empirically demonstrated the following result:
**Neural networks are capable of perfectly fitting randomly labeled data.**

![](blog_figs/nnlls/rethinking-generalization.png)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _Figure: [Zhang et al., 2016](https://arxiv.org/pdf/1611.03530.pdf)_

In a randomly labeled data setting, a NN is trained on training data with a random label. In such a setting, there is virtually no information about the output, given the input. The best anyone (or any machine) can do in this scenario is, given any input, to randomly guess a label. Thus, the only way to perform well on a training set with random labels is to essentially _memorize_ the entire training set. [Zhang et al., 2016](https://arxiv.org/pdf/1611.03530.pdf) demonstrated that modern NNs are clearly capable of this, and thus they must clearly be capable of memorizing a training set with true labels as well. This has been corroborated by [Huang et al., 2020](https://arxiv.org/abs/1906.03291), who empirically demonstrated that there are several bad minima (solutions which perfectly classify the training data, but achieve near chance accuracy on test data) near the trajectory of a successfully trained model which achieves 98% accuracy on test data (shown below).

![](blog_figs/nnlls/bad-minima.png)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _Figure: [Huang et al., 2020](https://arxiv.org/abs/1906.03291)_

This leads to one of the key questions of deep learning, currently: Why do neural networks _prefer_ solutions that generalize to unseen data, rather than solutions which simply memorize the training data without actually learning anything? The answer likely lies in the structure of the extremely high-dimensional, difficult-to-conceptualize loss landscapes of our NNs.

## Some Myths of Generalization in Deep Learning

In their wonderfully named paper$^1$ [Truth or Backpropaganda?: An Empirical Investigation of Deep Learning Theory](https://arxiv.org/abs/1910.00359) Goldblum et al. challenged common assumptions about this question.

$^1$_In the interest of full attribution though, I think the first instance of this joke came from the hilarious [Naomi Saphra](https://twitter.com/nsaphra/status/720614007498006533)._

#### Regularization

Different types of regularization 

#### Variance in SGD

The

#### Rank

Something something rank rank rank.

## Properties of good solutions in NNs

### Wide Basins

[Large batch - Keskar et al., 2017](https://arxiv.org/pdf/1609.04836.pdf) perhaps first proposed the notion 

Indeed, [Huang et al., 2020](https://arxiv.org/abs/1906.03291) propose that one of the reasons NN training may avoid poor global minima is due to the size of good global minima - that is, the wider the basin, the more likely we are to find it.

### Mode Connectivity

First conjectured by [Freemand and Bruna](https://openreview.net/pdf?id=Bk0FWVcgx), and later corroborated by [Draxler et al., 2018](https://arxiv.org/pdf/1803.00885.pdf) and [Garipov et al., 2018](https://arxiv.org/pdf/1802.10026.pdf) on more challenging settings, was the notion that 2 unique NN solutions could be connected by a non-linear path through the loss landscape. This connecting path could be traversed without incurring a higher loss than the original 2 solutions,  suggesting that local minima of NNs do not exist in isolation, but rather all exist on a connected manifold.

![](blog_figs/nnlls/mode-connectivity2.png)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _Figure: [Draxler et al., 2018](https://arxiv.org/pdf/1803.00885.pdf)_

Much more recently, [Frankle et al., 2020](https://arxiv.org/pdf/1912.05671.pdf) discovered that _linearly_ connected solutions arise when training models from checkpoints of only a few epochs. In other words, the mode that a particular training trajectory will arive in is determined very early on in training, after which the random order of data does not matter.

[Fort et al., 2020a](https://arxiv.org/pdf/1912.02757.pdf) demonstrated that this disconnect between modes early on could explain why deep subspace-based ensembling methods are less effective than randomly initialized ensembles. They showed that solutions in the same mode, which were linearly connected, were significantly less diverse than solutions in separate, non-linearly connected modes.
Similarly, [Fort et al., 2020b](https://arxiv.org/abs/2010.15110) demonstrated the connection between this phenomena of linear mode connectivity, and linearized training regimes related to the neural tangent kernel.



[Stiffness - Fort et al., 2020](https://arxiv.org/pdf/1901.09491.pdf)

[Break even point - Jastrzebski et al., 2020](https://arxiv.org/abs/2002.09572)

[Emergent properties - Fort et al., 2019](https://arxiv.org/abs/1910.05929)


[Implicit regularization - Neyshabur et al., 2018](https://arxiv.org/pdf/1412.6614.pdf)

[Implicit regularization norms - Razin and Cohen, 2020](https://arxiv.org/pdf/2005.06398.pdf)

[Large scale structure - Fort et al., 2019](https://arxiv.org/pdf/1906.04724.pdf)




## Instrinsic Dimensionality

[Intrinsic Dimension - Li et al., 2018](https://arxiv.org/abs/1804.08838)

[Goldilocks Zone - Fort et al., 2018](https://arxiv.org/pdf/1807.02581.pdf)
