<a href="https://colab.research.google.com/github/aecins/tutorials/blob/main/least_squares/least_squares_covariance_derivation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covariance of a random variable transformed by a linear transform
Consider a random variable $\mathbf{x}$ with covariance matrix $cov(\mathbf{x}) = \Sigma$. Now consider a new random variable $\mathbf{y} = M\mathbf{x}$ that is obtained by applying a fixed linear transformation $M$ to $\mathbf{x}$. The covariance of $\mathbf{y}$ is:
$$
\begin{aligned}
cov(\mathbf{y}) &= cov(M\mathbf{x}) \\
                &= M cov(\mathbf{x}) M^T \\
                &= M \Sigma M^T
\end{aligned}
$$

This property can be proven by expanding the definition of covariance matrix:
$$
\begin{aligned}
cov(M\mathbf{x})
&= 𝔼[(M\mathbf{x} - 𝔼[M\mathbf{x}])(M\mathbf{x} - 𝔼[M\mathbf{x}])^T] \\
&= 𝔼[(M\mathbf{x} - M𝔼[\mathbf{x}])(M\mathbf{x} - M𝔼[\mathbf{x}])^T] \\
&= 𝔼[(M(\mathbf{x} - \mu_{\mathbf{x}}))(M(\mathbf{x} - \mu_{\mathbf{x}}))^T] \\
&= \quad ...
\end{aligned}
$$

# Ordinary least squares covariance proof
For an ordinary least squares problem we are solving for the parameters $x$ that minimize the sum of squared residuals $||Ax -b||^2$ where $A$ is a fixed matrix of independent variables and $b$ is a vector of noisy dependent variables. Dependent variables are uncorrelated and have the same variance i.e.:
$$
\begin{aligned}
\mathbf{b} & \sim \mathcal{N}(0, \Sigma) \\
\Sigma & = \sigma^2 I
\end{aligned}
$$
The solution to the ordinary least squares problem is obtained through normal equations:
$$
\mathbf{x^*} = (A^TA)^{-1}A^T\mathbf{b}
$$
We want to know what is the covariance $cov(\mathbf{x^*})$ of the solution.

We can think of the solution $\mathbf{x^*}$ as a random variable $\mathbf{b}$ being transformed by a linear transform $M = (A^TA)^{-1}A^T$.
We can use the property of covariance of a random variable transformed by a linear transform:
$$
\begin{align*}
M                  & = (A^TA)^{-1}A^T \\
\mathbf{x^*}       & = M\mathbf{b} \\
cov(\mathbf{b})    & = \Sigma \\
cov(\mathbf{x^*})  & = M \Sigma M^T \\
                   & = (A^TA)^{-1}A^T \Sigma ((A^TA)^{-1}A^T)^T \\
((A^TA)^{-1}A^T)^T & = A (A^TA)^{-T} \\
                   & = A (A^TA)^{-1}
\end{align*}
$$
The last equality holds since $A^TA$ is a symmetric matrix and inverse of a symmetric matrix is also a symmetric matrix [[proof](https://math.stackexchange.com/questions/325082/is-the-inverse-of-a-symmetric-matrix-also-symmetric)]. This gives us:
$$
\begin{align*}
cov(\mathbf{x^*})  & = (A^TA)^{-1}A^T \Sigma A (A^TA)^{-1}
\end{align*}
$$
This expression can be simplified further by taking advantage of the fact that $\Sigma = \sigma^2 * I$. The middle part of the expression can be simplified as follows:
$$
\begin{align*}
A^T \Sigma A & = A^T \sigma^2 I A \\
             & = A^T \sigma^2 A \\
             & = \sigma^2 A^T A \\
\end{align*}
$$
Plugging this back we get:
$$
\begin{align*}
cov(\mathbf{x^*})  & = (A^TA)^{-1} \sigma^2 (A^T A) (A^TA)^{-1} \\
                   & = (A^TA)^{-1} \sigma^2
\end{align*}
$$


# Weighted least squares covariance proof
For weighted least squares the covariance of the measurement noise is a diagonal matrix:
$$
\mathbf{b} \sim \mathcal{N}(0, \Sigma) \\
\Sigma =
\begin{bmatrix}
\sigma_0^2 & \cdots & 0 \\
\vdots & \ddots & \vdots \\
0 & \cdots & \sigma_n^2 \\
\end{bmatrix} \\
$$
And the solution to the weighted least squares problem is:
$$
x^* = (A^T W A)^{-1}A^T Wb \\
$$
where $W = \Sigma^{-1}$.

Similarly to ordinary least squares we obtain the expression for the covariance of $x^*$ by taking advantage of the property of random variables transformed by a linear transform:
$$cov(M\mathbf{x}) = M cov(\mathbf{x}) M^T$$

For weighted least squares we have:
$$
\begin{align*}
M                  & = (A^T W A)^{-1}A^T W \\
\mathbf{x^*}       & = M\mathbf{b} \\
cov(\mathbf{b})    & = \Sigma \\
cov(\mathbf{x^*})  & = M \Sigma M^T \\
\end{align*}
$$
Since both $W$ and $A^TWA$ are symmetric we have:
$$
\begin{align*}
M^T                & = ((A^T W A)^{-1}A^T W)^T \\
                   & = W^TA(A^T W A)^{-T} \\
                   & = WA(A^T W A)^{-1} \\
\end{align*}
$$
Plugging back into expression for covariance of $x^*$ we get:
$$
\begin{align*}
cov(\mathbf{x^*})  & = M \Sigma M^T \\
                   & = (A^T W A)^{-1}A^T W \Sigma WA(A^T W A)^{-1} \\
\end{align*}
$$
Since $W = \Sigma^{-1}$ this simplifies to:
$$
\begin{align*}
cov(\mathbf{x^*}) & = (A^T \Sigma^{-1} A)^{-1}A^T \Sigma^{-1} \Sigma \Sigma^{-1}A(A^T \Sigma^{-1} A)^{-1} \\
                  & = (A^T \Sigma^{-1} A)^{-1}(A^T \Sigma^{-1} A)(A^T \Sigma^{-1} A)^{-1} \\
                  & = (A^T \Sigma^{-1} A)^{-1} \\
\end{align*}
$$