# Working Draft - Gradient Covariance in Neural Networks

Recently, my research has shifted towards studying optimization in deep learning, and in particular I've been interested in this notion of the _temperature_ of SGD, i.e. how much *noise* we incur in our estimation of the true gradient, and it's relation to generalization, etc.
Throughout this transition I've learnt quite a bit more about gradient covariance, and covariance matrices in general, and I thought that I would share what I've learned here.

Unfortunately, these things are not necessarily coherent - my fear is that this reads as more of a random information dump than some actual coherent blog post.
Thus, I'm going to try to explain the structure here at the beginning to make it easier to parse this mess of a blog post.

The first half of this blog post discusses gradient covariance matrices, intutions around these matrices and their eigenvalues, and the gradient covariance matrix's relation to the Fisher Information Matrix and the Hessian.
These sections are _not_ specific to neural networks - they only consider gradients of any parametric function $f$ that is differentiable.
Formally the sections are:
- Gradient Covariance in General
    - Formally, what is gradient covariance
    - Intuitively, what is gradient covariance
    - Eigenvectors and Eigenvalues
- Relation to Other Matrices of Importance
    - The Fisher Information Matrix
    - The Hessian

After this section I introduce stochastic gradient descent (SGD), the the noisiness of the _step_ of optimization. This section is _also_ not specific to neural networks, but it is necessary to discuss the next sections.
This part of the blog post is organized as a single section:

- Stochastic Gradient Descent
    - The step-size
    - The batch-size

Following stochastic gradient descent, I begin (finally) discussing the gradient covariances relevance to deep learning.

- The Temperature of SGD
    - The Stochastic Differential Equation Approximation
    - Issues with the Approximation
- Approximation the Gradient Covariance in Modern NNs
    - The Trace
    - The Spectral Norm
- Empirical Observations of the Gradient Covariance in Modern NNs

Ultimately this is quite a bit of... stuff. The first 3 sections are more akin to material that you would find in a set of notes rather than a blog post, I feel.
For the most part they are fairly formal, and well understood - however, I still found it very helpful to review these things and so for that reason I felt it made sense to keep them in the blog post.
If you are only reading this to get a feel for current work in deep learning surrounding gradient covariance, then only the last 3 sections should be relevant.

## Gradient Covariance in General

### Formally, what _is_ gradient covariance?

Let's start by defining a function, $f_\theta$, and dataset, $\mathcal{D}$.
The function must be differentiable, and we'll assume that it's parameterized by $\theta \in \mathbb{R}^d$.
The dataset $\mathcal{D}$ is composed of $N$ examples $x_1, x_2, \ldots, x_N$, all of which may be passed into $f_\theta(x)$. Finally, we will assume that we have some differentiable loss function $\mathcal{L}(x, \theta)$ which measures the error of $f_\theta$ on example $x$. However, most of this will be abstracted away when we consider the gradient, which we'll term $G$ for simplicity.
$$
G(x) = \nabla_\theta \mathcal{L}(f_\theta(x), x)
$$

In short, $G(x)$ is the gradient of example $x$ given parameters $\theta$. In optimization, we are often interested in the average gradient over the entire dataset, which we'll call the **true gradient** $G_T$
$$
G_T = \frac{1}{N}\sum_{x\in\mathcal{D}}G(x) 
$$
The reason we care about the true gradient is because we often consider a _uniform distribution_ over our dataset, where each individual example $x_i$ in the dataset carries equal importance with respect to learning. There are exceptions to this, e.g. in curriculum learning where we determine some order under which a model should learn certain examples, but in most in practice cases we train models over the uniform distribution.

One key issue with calculating $G_T$, especially in deep learning, is time - if your dataset is large (as they often are) it may take a very long time to compute $G(x)$ for every $x\in\mathcal{D}$.
As a result, it is very common to instead _approximate_ $G_T$ by sampling a random example $x\sim\mathcal{D}$ and using $G(x)$ in place of $G_T$.
In this way, we instead treat our gradient as a random variable $g$.
This turns out to be a fairly reasonable approximation for a number of reasons - key amongst them is the fact that the **expected value** of $\mathcal{G}$ over the entire dataset is $G_T$
$$
\mathbb{E}_{x\sim\mathcal{D}}[\mathcal{G}] = \frac{1}{N} \sum_{x\in\mathcal{D}} G(x) = G_T
$$
so if we take a large number of steps, on average we will move in the right direction.
However, the _noisiness_ of this estimation, i.e. how close each step will be to the average, is reflected in the **covariance matrix** of $\mathcal{G}$,
$$
\Sigma(\mathcal{G}) = \frac{1}{N}\sum_{x\in\mathcal{D}}\big[(G(x) - G_T)(G(x) - G_T)^T \big]
$$
$\Sigma(\mathcal{G})$ is a $d\times d$ matrix - the diagonals $\Sigma(\mathcal{G})_{i,i}$ represent the independent variance of the specific parameter $\theta_i$, while $\Sigma(\mathcal{G})_{i,j}$ represents the _covariance_ between the two parameters $\theta_i$ and $\theta_j$. 

### _Intuitively,_ what _is_ gradient covariance?

Let's imagine a parameter space consisting of 2 parameters: $\theta_1$ and $\theta_2$.
Pictured below are 2 separate sets of gradients ($G$ and $G'$) along with their respective covariance matrix.
Both sets have the same $G_T = G'_T = (3, 3)$ but distinct covariances.

![](blog_figs/cov_blog/demo1.png)

If we were to sample all $4$ gradients and compute their average, then we would see the same result regardless of whether we were using $G$ or $G'$, because $G_T = G'_T$.
However, if we instead opted to _approximate_ the true gradient with a single sample, then _how close_ we are to the true gradient would vary.
If we sample from $G$, on average both dimensions of our sample will be a distance of $1$ from the true gradient - however, if we sample from $G'$, the average distance is now $2$.
Thus, we would intuitively say that $G'$ is _noisier_ than $G$, since our approximations are father from the true gradient, on average.

This intuition is, of course, captured by the diagonal of the covariance matrix, i.e. the independent dimension variances.
We can interpret the variances as telling us how far, on average, the distance of a random sample will be from the mean (squared).
For $G$ both parameters have a variance of $1$ -- so given a random sample, each parameter will be (on average) a distance of $\sqrt{1}$ from $G_T$. For $G'$, that distance is instead $\sqrt{4}$.




### The Eigenvectors and Eigenvalues

There are a couple of important facts to note about the Eigenvectors and Eigenvalues of a covariance matrix.

The first, and most well-known, is that the eigenvectors point in the directions of highest variance while remaining orthogonal to each other.

Another interesting fact is that the sum of the independent variances of each parameter $\theta_i$ is actually equal to the sum of the eigenvalues of the covariance matrix.
So, when I said that one way to determine how "noisy" a dataset is by looking at the covariance eigenvalues, I was lying - you can obtain the same estimate by just looking at the trace of the covariance matrix.
What does this mean for our gradients?
Eigenvectors and eigvenvalues can give us a more efficient break down of the noise in our gradients (this is also the key idea behind methods like PCA) - but, if all we care about is the total amount of noise in our dataset we need not look further than the trace.

This is interesting, because it essentially means that the covariance between 

## Relation to Other Matrices of Importance

### The Fisher Information Matrix


### The Hessian

## Stochastic Gradient Descent

One key application of the gradient is in the optimization of the loss function, $\mathcal{L}$, via gradient descent, which updates parameters $\theta$ in the following way:
$$
\theta^{t+1} = \theta^t - \eta \; G_T
$$
where $\eta$ is the size of the "step" we take in the direction of $G_T$, and is a hyperparameter that we often have to tune by hand.
As mentioned above, since $G_T$ is expensive to compute, we instead approximate it with a random sample from $\mathcal{G}$... I'll write this as $G\sim\mathcal{G}$, although it's worth remembering that $G\sim\mathcal{G}$ is actually $G(x)$ computed for $x\sim\mathcal{D}$.
This approximation yields the parameter update
$$
\theta^{t+1} = \theta^t - \eta \; G
$$
and is called, of course, _stochastic gradient descent_, or SGD.
The difference between SGD and standard gradient descent is that the step $\eta\;G$ has a bit of _noise_ in it - we're estimating $G_T$ with an approximation $G\sim\mathcal{G}$.
This "noise" is characterized by the gradient covariance, which describes how close on average a random sample $G\sim\mathcal{G}$ will be to $G_T$.
In practice SGD often converges slower than GD due to the noise in the gradient estimation. However, the tradeoff might be worth it if you only have to compute the gradient of a few examples at each step, rather than thousands or hundreds of thousands.

### The Step Size

In optimization with SGD there is a difference between the noisiness of the gradient estimation, characterized by $\Sigma(\mathcal{G})$, and the noisiness of the _step_ that we take at each iteration.
In particular, there are two common hyperparameters that have a significant effect on the noise of the _step_ of SGD, in addition to the $\Sigma(\mathcal{G})$.

We have already seen one - the step-size $\eta$.
$\eta$ determines how large of a step we will take in the direction of $G\sim\mathcal{G}$.
It is directly tied into the expectation of our step
$$
\mathbb{E}_{G\sim\mathcal{G}}[\eta \; G] = \eta \; G_T
$$
and the norm of the step
$$
\mathbb{E}_{G\sim\mathcal{G}}\big[ ||\eta\;G||^2 \big] = \eta\; \mathbb{E}\big[ ||\mathcal{G}||^2 \big]
$$
in very obvious fashion.
However, it _also_ has an effect on the variance of our step - when you scale a random variable by a fixed constant, it has a quadratic effect on the variance of that variable, i.e.
$$
\Sigma(\eta\;\mathcal{G}) = \eta^2 \; \Sigma(\mathcal{G})
$$
In almost every application of SGD that I have seen, $\eta < 1$, and so the variance of the step of SGD is often smaller than the gradient covariance, just because we are scaling the step down.

### The Batch Size

Often when using SGD for neural networks we do not actually sample only a single gradient $G\sim\mathcal{G}$. Instead we will often sample $B$ individual examples, $G_B \sim \{\mathcal{G}\}^B$, and consider the sample average $\bar{G}_B$ as our gradient for SGD.
This new random variable, $\bar{\mathcal{G}_B}$, has the exact same expectation as $\mathcal{G}$ - $G_T$.
However, it's _variance_ is tied very closely to the batch-size $B$ - the larger a sample we average over, the lower our variance will be. This variance reduction occurs linearly with respect to the sample size, so we can write the covariance of $\bar{\mathcal{G}_B}$ as
$$
\Sigma(\bar{\mathcal{G_B}}) = \frac{\Sigma(\mathcal{G})}{B}
$$

Note however that this linear scaling comes from assuming that each example $x_1, \ldots, x_B\sim\mathcal{D}$ are sampled indepedently - in practice, we rarely actually sample indepedently because we sample without replacement, i.e. we don't want a batch to contain the same example more than once. Therefore, in theory, we have to correct our variance reduction with the FPC (finite population correction) factor
$$
\Sigma(\bar{\mathcal{G_B}}) = \frac{N - B}{N - 1}\frac{\Sigma(\mathcal{G})}{B}
$$

If our batch size $B$ is significantly smaller than our dataset, i.e. $B \ll N$, then the FPC is close to 1 - so in practice, we can usually assume that 
that the variance of our mini-batch gradient inversely scales with the $B$.

The batch-size also has an effect on the expected _norm_ of our gradient, $\mathbb{E}\big[||\bar{\mathcal{G}_B}||^2\big]$.
Adding vectors together always results in a lower norm vector than the sum of the individual vector norms (see triangle inequality), and so intuitively we expect that, on average, the gradient with a batch-size of 64 examples will have a smaller norm than a gradient with a batch-size of 16 examples.
We can characterize the expected norm of $\mathcal{G}$, the single-sample random variable, in terms of the norm of $G_T$ and the _trace_ of the covariance
$$
\mathbb{E}\big[ ||\mathcal{G}||^2 \big] = ||G_T||^2 + tr\big(\Sigma(\mathcal{G})\big)
$$
i.e. the expected norm of $\mathcal{G}$ is the norm of the true gradient plus the _trace_ of the covariance matrix - the trace, $tr\big(\Sigma(\mathcal{G})\big)$, is the sum of the individual parameter's variances.
We can see now how the batch-size affects the expected gradient norm - increasing the batch-size reduces the variance, resulting in a smaller second term
$$
\mathbb{E}\big[ ||\bar{\mathcal{G}_B}||^2 \big] = ||G_T||^2 + \frac{1}{B}tr\big(\Sigma(\mathcal{G})\big)
$$
Therefore, the larger our batch-size, the smaller our gradient norm will be on average. Of course, as well scale $B$ to be closer to $N$, the FPC factor comes back into play, which is how we get the second term to go to zero as $B \rightarrow N$.

## The Temperature of SGD

### The Linear Scaling Rule

The above section suggests a simple rule when scaling the batch-size of SGD for a particular model - when we increase the batch-size by a factor of $k$, we should increase the learning rate by a factor of $\sqrt{k}$. This scaling rule would allow us to keep the covariance of the _step_ in SGD constant as we increase the number of samples that we average over.

Interestingly, in deep learning this does not quite work... the _linear scaling rule_ instead posits that when we increase the learning rate by a factor of $k$, we need to scale the learning rate by a factor of $k$ as well - i.e. rather than maintaining the gradient covariance constant, we should keep the ratio between the learning rate and batch-size constant.
This ratio is often termed the _temperature_ of SGD
$$
T = \frac{\eta}{B}
$$
and there has been quite a bit of recent work that corroborates this finding in practice.

Why does the linear scaling rule hold in deep learning, and not the _square-root_ scaling rule? The answer to this is still unknown actually! However, a number of recent works have shown that the linear-scaling rule can emerge from a theoretical treatment of SGD as a stochastic differential equation.


### The SDE Approximation

A stochastic differential equation is a function that takes the form of
$$
\delta x = f(x) \delta t + g(x) \delta W_t
$$
where $f(x)$ and $g(x)$ are deterministic functions, and $W_t$ is some random process - it is often a Wiener process, otherwise known as Brownian motion, which models a Gaussian random variable whose variance scales with time. However, the noise does not necessarily have to be a Wiener process. For the purposes of this blog, however, we will model the noise as a Wiener process, as that is the most common process used in the analysis of SGD for neural networks.

We can re-write SGD to look something like the above equation by saying that the difference between $\theta_{t+1}$ and $\theta_t$ can be described as a step in the true gradient direction $\eta\;G_T$ plus the difference between the true gradient and our random sample, e.g.
$$
\theta_{t+1} - \theta_t = -\eta \; G_T + \eta\;(G_T - G)
$$
Here the second term in the right-hand side is a random variable with $0$ mean and covariance $\Sigma(\mathcal{G})$.
In order to derive an SDE from the above equation, we need $t$ to be a continuous variable - this is not exactly natural in SGD, since we take discrete steps.
What we do to overcome this is we describe time in terms of intervals $\Delta t$, where $\Delta t$ is composed of several steps with a _very_ small, but fixed, learning rate $\eta$, i.e. $\Delta t = N\;\eta$.
We then write our SGD update as
$$
\theta_{t+\Delta t} - \theta_t = - G_T\Delta t + \sum_{n=1}^N \eta\;(G_T - G)
$$
Under this approximation we require a couple assumptions - the first is that $\eta$ is _small_ enough so that, from step to step, the statistics of the gradient approximation (the second term) don't change that much. The second assumption is that $\Delta t$ is _large_ enough that the law of large numbers applies to the second term - when the law of large numbers applies to the summation in the second term, we can replace it with a Gaussian random variable:
$$
\theta_{t+\Delta t} - \theta_t = - G_T\;\Delta t + \eta \epsilon \sqrt{N}
$$
where $\epsilon \sim \mathcal{N}\big(0, \Sigma(\mathcal{G})\big)$.
For simplicity, we can re-write this equation with the variance of $\epsilon$ moved into the second term instead, so that $\epsilon$ is a normal random variable.
Additionally, note that $N = \frac{\Delta t}{\eta}$, so we can replace $N$ and re-arrange some terms to obtain the gradient update
$$
\theta_{t+\Delta t} - \theta_t = - G_T\Delta t + \sqrt{\eta \Sigma(\mathcal{G})} \epsilon \sqrt{\Delta t}
$$
where $\epsilon \sim \mathcal{N}\big(0, I\big)$.

Now we can start to see how we can approximate SGD as an SDE - as we take $\Delta t \rightarrow 0$ we get a stochastic process defined as
$$
\delta \theta = - G_T \delta t +  \sqrt{\eta \Sigma(\mathcal{G})} \epsilon \sqrt{\delta t}
$$
where $\epsilon \sim \mathcal{N}\big(0, I\big)$.
Because $\epsilon$ is a normal random variable, $\epsilon \sqrt{\delta t} = \delta W_t$, where $W_t$ is a Wiener process. This yields the following SDE approximation which has been used to characterize SGD in neural networks:
$$
\delta \theta = - G_T \delta t + \sqrt{\eta \Sigma(\mathcal{G})} \delta W_t
$$

The noise level of this process is characterized by the second term and, crucially, the relationship between the gradient covariance $\Sigma(\mathcal{G})$ and the learning rate $\eta$ is _linear_. Thus, when we increase the _batch-size_ to say $B$, our SDE becomes:
$$
\delta \theta = - G_T \delta t + \sqrt{\frac{\eta}{B} \Sigma(\mathcal{G})} \delta W_t
$$
and so, at long last, we have derived the theoretical motivation for the linear relationship between batch-size and learning rate, i.e. the _linear scaling rule_.
For a little more intuition on this I strongly recommend watching David McAllester's Deep Foundation videos over [Gradient Flow](https://www.youtube.com/watch?v=oKbLeEL-Xro), [SDEs](https://www.youtube.com/watch?v=l4B_8DqYfmA), and [Temperature](https://www.youtube.com/watch?v=_Q6quuwjoRQ). They are very intuitive yet concise, and I found them really helpful when first approaching this subject.

### When does the SDE Approximation fail?

[Li et al., 2021](https://arxiv.org/abs/2102.12470)



## Gradient Covariance in Modern Neural Networks

### Approximating the Covariance Matrix

In deep learning, we can't compute the _whole_ covariance matrix - after all, it's a $d\times d$ matrix where $d$ is often on the order of a million or higher, and the individual entries are floats - for simple reference, if $d = 1,000,000$ and we use 4 bytes to represent each entry, then the covariance matrix requires _4 Terabytes_ to store in memory all at once. Luckily, we rarely want to store the whole covariance matrix all at once. Instead we just want to calculate _properties_ of the covariance matrix, like it's trace or it's spectral norm.

#### Trace

The **trace** is a particularly nice property of the covariance matrix to compute - not only is it fairly simple to compute but it captures important properties of the overall matrix, namely the total variance of each parameter.
It's actually not necessarily intractable to compute the trace _exactly_ - we just need to compute the gradient of each individual example, as well as the true gradient.
$$
tr\big(\Sigma(\mathcal{G})\big) = \frac{1}{N} \sum_{x \in \mathcal{D}} \big(G(x) - G_T\big)^T \big(G(x) - G_T\big)
$$
However, this can take _a lot of time_ to do depending on how big your dataset and model are. We can instead approximate it in a couple of ways.

McCandlish et al., propose to estimate it by computing the gradient of a large batch size, $B_{large}$, which is composed of the gradients of several smaller batches, $B_{small}$.
Recall from above that the batch-size has an effect on the expected gradient norm through the 


#### Spectral Norm

### Empirical Observations of Gradient Covariance

Several recent works have studied how gradient covariance evolves during training, and in particular how it's relation to the _temperature_ of SGD determines generalization.

### Relation to the Hessian

Finally, I'll note on an interesting relationship between the Hessian of a model and the Gradient Covariance Matrix.

https://arxiv.org/pdf/1711.04623.pdf

### Relation to the FIM



https://arxiv.org/pdf/1906.07774.pdf




#### Footnotes

<a name="cite_note-1"></a>1. [^](#blog-post/GradientCovariance#cite_ref-1) Ahah... this is actually not something we should take for granted! In theory (and in practice for the most part) we operate on the IID assumption - that our examples are sampled. However, there is a whole subfield of machine learning, called curriculum learning, that challenges this assumption and generally actually yields pretty promising results (in my opinion).

#### References

<a name="reference-1"></a>[1]

[2]:

