# The Basics of Gradient Covariance and Temperature

Recently, my research has shifted towards studying optimization in deep learning, and in particular I've been interested in this notion of the _temperature_ of SGD, i.e. how much *noise* we incur in our estimation of the true gradient, given the gradient of only a single example, or perhaps a small batch.


### What _is_ gradient covariance?

To begin with, gradient covariance is defined for a specific function and dataset. The function, which we'll call $f_\theta$, must be differentiable and we'll assume that it's parameterized by $\theta \in \mathbb{R}^d$.
The dataset $\mathcal{D}$ is composed of $N$ examples $x_1, x_2, \ldots, x_N$, all of which may be passed into $f_\theta(x_i)$. Finally, we will assume that we have some differentiable loss function $\mathcal{L}(x, \theta)$ which measure the error of $f_\theta$ on example $x$.

We define $G(x) = \nabla_\theta \mathcal{L}(x, \theta)$ to be the gradient of $x$ given parameters $\theta$. In optimization, we are often interested in the average gradient over the entire dataset, which we'll call the **true gradient** $G_T$
$$
G_T = \frac{1}{N}\sum_{x\in\mathcal{D}}G(x)
$$

However, the true gradient can be expensive to compute since it involves computing the gradient over the entire dataset. So, often we instead decide to _approximate it_ by approximating $G_T$.
We start by treating the gradient as a discrete random variable, which we'll call $G$. All possible values of $G$ can be determined by running over all possible $x\in\mathcal{D}$ and computing $G(x)$.
Because we often only consider uniform distributions over our dataset (i.e. all examples are weighted equally), the **expected value** of $G$ is
$$
\mathbb{E}_{x\sim\mathcal{D}}\big[G\big] = \frac{1}{N} \sum_{x\in\mathcal{D}} G(x) = G_T
$$
The **covariance matrix** of $G$ is
$$
\Sigma(G) = \frac{1}{N}\sum_{x\in\mathcal{D}}\big[(G(x) - G_T)(G(x) - G_T)^T \big]
$$
$\Sigma(G)$ is a $d\times d$ matrix - the diagonals $\Sigma_{i,i}$ represent the independent variance of $\theta_i$, while $\Sigma{i,j}$ represents the covariance of $\theta_i$ and $\theta_j$.

### A Simple Demonstration



### The Eigenvectors and Eigenvalues

There are a couple of important facts to note about the Eigenvectors and Eigenvalues of a covariance matrix.

The first, and most well-known, is that the eigenvectors point in the directions of highest variance while remaining orthogonal to each other.

Another interesting fact is that the sum of the independent variances of each parameter $\theta_i$ is actually equal to the sum of the eigenvalues of the covariance matrix.

### Measuring "Noise"

Given these facts about.

Say we wanted to compress the covariance matrix down to a single value, which determined how noisy the dataset is compared to.

Trace.

### The Effect of Batch-Size on Covariance

Increasing the batch-size decreases the noise

### The Gradient Norm

Expected gradient norm is large compared to norm of expected gradient.

### Temperature of SGD

The _temperature_ of SGD is often defined as the ratio between the learning rate $\epsilon$ and the batch-size $B$, i.e.
$$
T = \frac{\epsilon}{B}
$$
these two hyperparameters together can effect the variance of $G$ - increasing the batch-size decreases the variance, as we have already seen. And the learning rate scales the variance of $G$ by a factor of $\sqrt{\epsilon}$.

### Relation to the FIM

??

### Relation to the Hessian

??