# Gradient Covariance in Neural Networks

Recently, my research has shifted towards studying optimization in deep learning, and in particular I've been interested in this notion of the _temperature_ of SGD, i.e. how much *noise* we incur in our estimation of the true gradient, given the gradient of only a single example, or perhaps a small batch.

## The Gradient Covariance Matrix

### Formally, what _is_ gradient covariance?

Let's start by defining a function, $f_\theta$, and dataset, $\mathcal{D}$.
The function must be differentiable, and we'll assume that it's parameterized by $\theta \in \mathbb{R}^d$.
The dataset $\mathcal{D}$ is composed of $N$ examples $x_1, x_2, \ldots, x_N$, all of which may be passed into $f_\theta(x)$. Finally, we will assume that we have some differentiable loss function $\mathcal{L}(x, \theta)$ which measures the error of $f_\theta$ on example $x$. However, most of this will be abstracted away when we consider the gradient, which we'll term $G$ for simplicity.
$$
G(x) = \nabla_\theta \mathcal{L}(f_\theta(x), x)
$$

In short, $G(x)$ is the gradient of example $x$ given parameters $\theta$. In optimization, we are often interested in the average gradient over the entire dataset, which we'll call the **true gradient** $G_T$
$$
G_T = \frac{1}{N}\sum_{x\in\mathcal{D}}G(x) 
$$
The reason we care about the true gradient is because we often consider a _uniform distribution_ over our dataset, where each individual example $x_i$ in the dataset carries equal importance with respect to learning. There are exceptions to this, e.g. in curriculum learning where we determine some order under which a model should learn certain examples, but in most in practice cases we train models over the uniform distribution.

One key issue with calculating $G_T$, especially in deep learning, is time - if your dataset is large (as they often are) it may take a very long time to compute $G(x)$ for every $x\in\mathcal{D}$.
As a result, it is very common to instead _approximate_ $G_T$ by sampling a random example $x\sim\mathcal{D}$ and using $G(x)$ in place of $G_T$.
In this way, we instead treat our gradient as a random variable $g$.
This turns out to be a fairly reasonable approximation for a number of reasons - key amongst them is the fact that the **expected value** of $\mathcal{G}$ over the entire dataset is $G_T$
$$
\mathbb{E}_{x\sim\mathcal{D}}[\mathcal{G}] = \frac{1}{N} \sum_{x\in\mathcal{D}} G(x) = G_T
$$
so if we take a large number of steps, on average we will move in the right direction.
However, the _noisiness_ of this estimation, i.e. how close each step will be to the average, is reflected in the **covariance matrix** of $\mathcal{G}$,
$$
\Sigma(\mathcal{G}) = \frac{1}{N}\sum_{x\in\mathcal{D}}\big[(G(x) - G_T)(G(x) - G_T)^T \big]
$$
$\Sigma(\mathcal{G})$ is a $d\times d$ matrix - the diagonals $\Sigma(\mathcal{G})_{i,i}$ represent the independent variance of the specific parameter $\theta_i$, while $\Sigma(\mathcal{G})_{i,j}$ represents the _covariance_ between the two parameters $\theta_i$ and $\theta_j$. 

### _Intuitively,_ what _is_ gradient covariance?

To give some very simple intuition about what's going on inside of these matrices, pictured above are 2 gradients, $G(x_1)$ and $G(x_2)$.


### The Eigenvectors and Eigenvalues

There are a couple of important facts to note about the Eigenvectors and Eigenvalues of a covariance matrix.

The first, and most well-known, is that the eigenvectors point in the directions of highest variance while remaining orthogonal to each other.

Another interesting fact is that the sum of the independent variances of each parameter $\theta_i$ is actually equal to the sum of the eigenvalues of the covariance matrix.

## Batch-Size and Learning Rate

In SGD optimization with neural networks, there are two very important hyperparameters that really tune the "noisiness" of SGD by directly affecting the covariance. They are the batch-size, which is the number of samples that we average over, and the learning rate, which _scales_ the gradient uniformly by a constant.
I want to touch briefly on what statistics has to say about how these two hyperparameters affect the covariance.

### The Effect of Batch-Size

Often when using SGD for neural networks we do not actually sample only a single gradient, $\mathcal{G}$. Instead, we consider a sample average $\bar{\mathcal{G}}_B$ over a sample of size $B$, $x \sim \{\mathcal{D}\}^B$.
The sample average has an effect on the gradient covariance - this is not specific to gradients, this is a fact of statistics. The larger a sample we average over, the lower our variance will be. This variance reduction occurs linearly with respect to the sample size, so we can write the covariance of $\bar{\mathcal{G_B}}$ as
$$
\Sigma(\bar{\mathcal{G_B}}) = \frac{\Sigma(\mathcal{G})}{B}
$$

Note however that this linear scaling comes from assuming that each example $x_1, \ldots, x_B\sim\mathcal{D}$ are sampled indepedently - in practice, we rarely actually sample indepedently because we sample without replacement, i.e. we don't want a batch to contain the same example more than once. Therefore, in theory, we have to correct our variance reduction with the FPC (finite population correction) factor
$$
\Sigma(\bar{\mathcal{G_B}}) = \frac{N - B}{N - 1}\frac{\Sigma(\mathcal{G})}{B}
$$

However, if our batch size $B$ is significantly smaller than our dataset, i.e. $B \ll N$, then the FPC is close to 1 - so in practice, we can usually assume that 
$$
\Sigma(\bar{\mathcal{G_B}}) \approx \frac{\Sigma(\mathcal{G})}{B}
$$
i.e. that the variance of our mini-batch gradient inversely scales with the size of the mini-batch.

The batch-size also has an effect on the expected **norm of our gradient**, $\mathbb{E}\big[||\mathcal{G}||^2\big]$.
Adding vectors together always results in a lower norm vector than the sum of the individual vector norms (see: triangle inequality), and so intuitively we expect that, on average, the gradient with a batch-size of 64 examples will have a smaller norm than a gradient with a batch-size of 16 examples.

We can characterize the expected norm of $\mathcal{G}$ in terms of the norm of $G_T$ and the _trace_ of the covariance as
$$
\mathbb{E}\big[ ||\mathcal{G}||^2 \big] = ||G_T||^2 + tr\big(\Sigma(\mathcal{G})\big)
$$
i.e. the expected norm of $\mathcal{G}$ is the norm of the true gradient plus the _trace_ of the covariance matrix - note that the trace, $tr\big(\Sigma(\mathcal{G})\big)$, is the sum of the individual parameter's variances.
We can see now how the batch-size affects the expected gradient norm - increasing the batch-size reduces the variance, resulting in a smaller second term
$$
\mathbb{E}\big[ ||\bar{\mathcal{G}_B}||^2 \big] = ||G_T||^2 + \frac{1}{B}tr\big(\Sigma(\mathcal{G})\big)
$$
Therefore, the larger our batch-size, the smaller our gradient norm will be on average.


### The Effect of Learning Rate

The other hyperparameter that has a large effect on gradient covariance is the learning rate, $\eta$. The learning rate plays a key role in SGD, by determining how far of a step optimization takes in the direction of the gradient
$$\theta^{t+1} = \theta^t + \eta\; \mathcal{G}$$

How does scaling $\mathcal{G}$ affect it's covariance? It's fairly straightfoward - scaling a random variable by a factor of $k$ increases it's variance by a factor of $k^2$, i.e.
$$
\Sigma(\eta\;\mathcal{G}) = \eta^2 \Sigma(\mathcal{G})
$$
The learning rate also affects the norm of the gradient, but the relationship is very straightforward:
$$
\mathbb{E}\big[ ||\eta\; \mathcal{G}||^2 \big] = \eta\; \mathbb{E}\big[ ||\mathcal{G}||^2 \big]
$$

### Temperature of SGD

In mini-batch SGD, which is the canonical algorithm used in deep learning, we typically take a step size that looks like:
$$
\theta^{t+1} = \theta^t - \eta \; \bar{\mathcal{G}_B}
$$
where we can treat this step $\eta \; \bar{\mathcal{G}_B}$ as a random variable itself with a covariance of
$$
\Sigma(\eta \; \bar{\mathcal{G}_B}) = \frac{\eta^2}{B}\Sigma(\mathcal{G})
$$
This formula suggests that an increase to our batch-size by a factor of $k$ should require a commensurate increase to our learning rate by a factor of $\sqrt{k}$ in order to maintain the same gradient covariance scale.

Interestingly, in deep learning this is not true... the _linear scaling rule_ instead posits that when we increase the learning rate by a factor of $k$, we need to scale the learning rate by a factor of $k$ as well - i.e. rather than miantaining the gradient covariance constant, we should keep the ratio between the learning rate and batch-size constant.
This ratio is often termed as the _temperature_ of SGD
$$
T = \frac{\eta}{B}
$$
and there has been quite a bit of recent work that corroborates this finding.

Why does the linear scaling rule hold in deep learning, and not the _square-root_ scaling rule?
To answer this a lot of recent work has analyzed SGD as a stochastic differential equation.

## Gradient Covariance in Modern Neural Networks

### Approximating the Covariance Matrix

In deep learning, we can't compute the _whole_ covariance matrix - after all, it's a $d\times d$ matrix where $d$ is often on the order of a million or higher, and the individual entries are floats - for simple reference, if $d = 1,000,000$ and we use 4 bytes to represent each entry, then the covariance matrix requires _4 Terabytes_ to store in memory all at once. Luckily, we rarely want to store the whole covariance matrix all at once. Instead we just want to calculate _properties_ of the covariance matrix, like it's trace or it's spectral norm.

#### Trace

The **trace** is a particularly nice property of the covariance matrix to compute - not only is it fairly simple to compute but it captures important properties of the overall matrix, namely the total variance of each parameter.
It's actually not necessarily intractable to compute the trace _exactly_ - we just need to compute the gradient of each individual example, as well as the true gradient.
$$
tr(\Sigma(\mathcal{G})) = \frac{1}{N} \sum_{x \in \mathcal{D}} (G(x) - G_T)^T (G(x) - G_T)
$$
However, this can take _a lot of time_ to do depending on how big your dataset and model are. We can instead approximate it in a couple of ways.

McCandlish et al., propose to estimate it by computing the gradient of a large batch size, $B_{large}$, which is composed of the gradients of several smaller batches, $B_{small}$.
Recall from above that the batch-size has an effect on the expected gradient norm through the 


#### Spectral Norm

### Empirical Observations of Gradient Covariance

Several recent works have studied how gradient covariance evolves during training, and in particular how it's relation to the _temperature_ of SGD determines generalization.


### Relation to the FIM

https://arxiv.org/pdf/1906.07774.pdf

### Relation to the Hessian

Finally, I'll note on an interesting relationship between the Hessian of a model and the Gradient Covariance Matrix.

https://arxiv.org/pdf/1711.04623.pdf