Working with Gaussians
=======


### Sums and Transformations of Gaussian Variables

- The Gaussian distribution
$$
\mathcal{N}(x|\mu,\Sigma) = |2 \pi \Sigma |^{-\frac{1}{2}} \,\mathrm{exp}\left\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) \right\}
$$

for variable $x$ is completely specified by its mean $\mu$ and variance $\Sigma$. 

- $\Lambda = \Sigma^{-1}$ is called the **precision matrix**.

- The sum of two Gaussian _distributions_ is NOT Gaussian. Why not?

- A **linear transformation** $z=Ax+b$ of a Gaussian variable $\mathcal{N}(x|\mu,\Sigma)$ is Gaussian distributed as

$$
p(z) = \mathcal{N} \left(z|A\mu+b, A\Sigma A^T \right) \tag{SRG-4a}
$$

- The **sum of two independent Gaussian variables** is also Gaussian distributed. Specifically, if $x \sim \mathcal{N} \left(x|\mu_x, \Sigma_x \right)$ and $y \sim \mathcal{N} \left(y|\mu_y, \Sigma_y \right)$, then the PDF for $z=x+y$ is given by

\begin{align}
p(z) &= \mathcal{N}(x|\mu_x,\Sigma_x) \ast \mathcal{N}(y|\mu_y,\Sigma_y) \notag\\
  &= \mathcal{N} \left(z|\mu_x+\mu_y, \Sigma_x +\Sigma_y \right) \tag{SRG-8}
\end{align}


### Example: Gaussian Signals in a Linear System

\begin{figure}[h]\centering
\includegraphics[height=2cm]{./figures/fig-linear-system}
\end{figure}

- [Q.]: Given independent variables
$x \sim \mathcal{N}(\mu_x,\sigma_y)$ and $y \sim \mathcal{N}(\mu_y,\sigma_y)$, what is the PDF for $z = A\cdot(x -y) + b$ ?

- [A.]: $z$ is also Gaussian with 
$$
p_z(z) = \mathcal{N}(z|A(\mu_x-\mu_y)+b, \, A(\sigma_x \mathbf{+} \sigma_y)A^T)
$$

- Think about the role of the Gaussian distribution for stochastic linear systems in relation to what sinusoidals mean for deterministic linear system analysis.


### Example: Bayesian Estimation of a Constant

- [Q.] Estimate a constant $\theta$ from one 'noisy' measurement $x$ about that constant. Assume the following model specification:
     
\begin{align}
x &= \theta + \epsilon \\
\theta &\sim \mathcal{N}(\theta|\mu_\theta,\sigma_\theta^2) \\
\epsilon &\sim \mathcal{N}(\epsilon|0,\sigma^2_{\epsilon})
\end{align}

- [A.]
1. **Model specification**
Note that you can rewrite these specifications in probabilistic notation as follows:
\begin{align}
    p(x|\theta) &=\mathcal{N}(x|\theta,\sigma^2_{\epsilon}) \tag{likelihood}\\
    p(\theta) &=\mathcal{N}(\theta|\mu_\theta,\sigma_\theta^2) \tag{prior}
\end{align}
2. **Inference** for the posterior PDF $p(\theta|x)$
\begin{align*}
p(\theta|x)  &= \frac{p(x|\theta)p(\theta)}{p(x)} = \frac{p(x|\theta)p(\theta)} { \int p(x|\theta)p(\theta) \, \mathrm{d}\theta } \notag \\
    &= \frac{1}{C} \,\mathcal{N}(x|\theta,\sigma^2_{\epsilon})\, \mathcal{N}(\theta|\mu_\theta,\sigma_\theta^2) \notag \\
    &= \frac{1}{C_1} \mathrm{exp} \left\{ -\frac{(x-\theta)^2}{2\sigma^2_{\epsilon}} - \frac{(\theta-\mu_\theta)^2}{2\sigma_\theta^2} \right\} \notag \\
    &= \frac{1}{C_1} \mathrm{exp} \left\{ \theta^2\left( -\frac{1}{2\sigma^2_{\epsilon}} - \frac{1}{2\sigma_\theta^2} \right) + \theta \left( \frac{x}{\sigma^2_{\epsilon}} + \frac{\mu_\theta}{\sigma_\theta^2} \right) +  C_2 \right\} \notag \\
    &= \frac{1}{C_1} \mathrm{exp} \left\{ -\frac{\sigma_\theta^2 + \sigma^2_{\epsilon}}{2\sigma_\theta^2 \sigma^2_{\epsilon}} \left( \theta - \frac{x\sigma_\theta^2 + \mu_s\sigma^2_{\epsilon}}{\sigma_\theta^2 + \sigma^2_{\epsilon}} \right)^2 + C_3  \right\}
\end{align*}

- This computational 'trick' for multiplying two Gaussians is called **completing the square**. Compare the procedure to $$ax^2+bx+c_1 = a\left(x+\frac{b}{2a}\right)^2+c_2$$

- Hence, it follows that the posterior for $\theta$ is
$$
    p(\theta|x) = \mathcal{N} (\theta |\, \mu_{\theta|x}, \sigma_{\theta|x}^2)
$$
where

\begin{align}
  \sigma_{\theta|x}^2  &= \frac{\sigma^2_{\epsilon}\sigma_\theta^2}{\sigma^2_{\epsilon} + \sigma_\theta^2} = \left( \frac{1}{\sigma_\theta^2} + \frac{1}{\sigma^2_{\epsilon}}\right)^{-1} \\
  \mu_{\theta|x}   &= \sigma_{\theta|x}^2 \, \left( \frac{1}{\sigma^2_{\epsilon}}x + \frac{1}{\sigma_\theta^2} \mu_\theta \right) 
\end{align}

- So, multiplication of two Gaussians yields another (unnormalized) Gaussian.

### Multivariate Gaussian Multiplication
- In general, the multiplication of two multi-variate Gaussians yields an (unnormalized) Gaussian, see [SRG-6]:

$$
\mathcal{N}(x|\mu_a,\Sigma_a) \cdot \mathcal{N}(x|\mu_b,\Sigma_b) = \alpha \cdot \mathcal{N}(x|\mu_c,\Sigma_c)
$$
where
\begin{align*}
\Sigma_c &= \left( \Sigma_a^{-1} + \Sigma_b^{-1} \right)^{-1}\\
\mu_c &= \Sigma_c \left( \Sigma_a^{-1}\mu_a + \Sigma_b^{-1}\mu_b\right)
\end{align*}

and normalization constant $\alpha = \mathcal{N}(\mu_a|\, \mu_b, \sigma_a + \sigma_b)$.

- If we define the **precision** as $\Lambda \equiv \Sigma^{-1}$, then we see that **precisions add** and **precision-weighted means add** too.
- As we just saw, great application to Bayesian inference!

$$
\underbrace{\text{Gaussian}}_{\text{posterior}}
 \propto \underbrace{\text{Gaussian}}_{\text{likelihood}} \times \underbrace{\text{Gaussian}}_{\text{prior}}
$$


### Conditioning and Marginalization of a Gaussian

Let $z = \begin{bmatrix} x \\ y \end{bmatrix}$ be jointly normal distributed as

$$
p(z|\mu,\Sigma) = \mathcal{N} \left( \begin{bmatrix} x \\ y \end{bmatrix} \left| \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, 
  \begin{bmatrix} \Sigma_x & \Sigma_{xy} \\ \sigma_{yx} & \sigma_y \end{bmatrix} \right. \right)
$$

- Note that the symmetry $\Sigma=\Sigma^T$ implies that $\Sigma_x$ and $\Sigma_y$ are symmetric and $Sigma_{xy} = \Sigma_{yx}^T$.

- Now let's factorize $p(x,y)$ into $p(y|x)\, p(x)$ through conditioning and marginalization (for applications to Bayesian inference in jointly Gaussian systems).

- **Marginalization**
$$
p(x) = \int p(x,y)\,\mathrm{d}y = \mathcal{N}\left( x|\mu_x, \Sigma_x \right), \qquad p(y)=\mathcal{N} \left(y|\mu_y, \Sigma_y \right)
$$

- **Conditioning**
\begin{align}
p(y|x) &= p(x,y)/p(x) \\
 &= \mathcal{N}\left(y|\mu_y + \Sigma_{yx}\Sigma_x^{-1}(x-\mu_x),\, \Sigma_y - \Sigma_{yx}\Sigma_x^{-1}\Sigma_{xy} \right)
\end{align}

- See [MJ-ch13] for a clear and detailed derivation



### Example: Conditioning of Gaussian

Consider (again) the system 

\begin{align}
x &= \theta + \epsilon \\
\theta &\sim \mathcal{N}(\theta|\mu_\theta,\sigma_\theta^2) \\
\epsilon &\sim \mathcal{N}(\epsilon|0,\sigma^2_{\epsilon})
\end{align}

- This system is equivalent to (derive this!)
$$
p(\theta,x|\,\mu,\sigma) = \mathcal{N} 
  \left( 
  \begin{bmatrix} \theta\\ x \end{bmatrix} 
  \left| \begin{bmatrix} \mu_\theta\\ \mu_\theta\end{bmatrix}, 
         \begin{bmatrix} \sigma_\theta^2 & \sigma_\theta^2\\ \sigma_\theta^2 & \sigma_\theta^2+\sigma_{\epsilon}^2 
  \end{bmatrix} 
  \right. 
  \right)
$$

- Direct substitution of the rule for Gaussian conditioning:
\begin{align*}
K &= \frac{\sigma_\theta^2}{\sigma_\theta^2+\sigma_{\epsilon}^2} \tag{K is called Kalman gain}\\
p(\theta|x) &= \mathcal{N} \left( \theta|\, \mu_\theta + K\cdot(x-\mu_\theta),\,\sigma_\theta^2\left( 1-k\right) \right)
\end{align*}
    
- Exercises: (1) Actually derive this; (2) show that the result is equivalent to the previous slide on 'estimation of a constant'; and (3) Try to interpret the resulting formula's}
- homework: Derive this result
- Moral: For jointly Gaussian systems, we do inference simply in one step by using the formulas for conditioning and marginalization.
Compare this to Eqs.~\ref{eq:recursive-bayes}-\ref{eq:recursive-bayes-kalman-gain}



### Application: Recursive Bayesian Estimation

Now consider the signal $x(t)=\theta+\epsilon(t)$, where $D_t= \left(x(1),\ldots,x(t)\right)$ is observed _sequentially_ (over time).

- [Q.] We want a recursive algorithm for $p(\theta|D_t)$.
    
- [A.] Again we assume prior $p(\theta) = \mathcal{N}(\mu,\sigma^2)$ and define $p(\theta|D_t) = \mathcal{N}(\mu(t),\sigma^2(t))$ 
        
- We will solve this by using the estimate after $t-1$ as the **prior distribution** in conjunction with the **likelihood** for observation $x(t)$,
$$
p(\mu(t)|D_t) \propto p(x(t)|\mu(t-1),\sigma^2(t-1)) \times p(\mu(t)|D_{t-1})
$$

- Use the 'batch processing' posteriors for $\mu$ and $\sigma^2$ to get
\begin{align}
\hat \mu(t) &= \sigma_{\mu}^2(t) \, \left( \frac{1}{\sigma^2_{\epsilon}(t)}x(t) + \frac{1}{\sigma_{\mu}^2(t-1)} \hat \mu(t-1) \right) \\
    &= \frac{\sigma^2_{\epsilon}(t)}{\sigma^2_{\epsilon}(t)+\sigma_{\mu}^2(t-1)}x(t) + \frac{\sigma_{\mu}^2(t-1)}{\sigma^2_{\epsilon}(t)+\sigma_{\mu}^2(t-1)} \hat \mu(t-1) \\
    &= \hat \mu(t-1) + K(t) \left[ x(t) - \hat \mu(t-1) \right] \\
\sigma_{\mu}^2(t) &= \sigma_{\mu}^2(t-1) \frac{\sigma^2_{\epsilon}(t)}{\sigma^2_{\epsilon}(t)+\sigma_{\mu}^2(t-1)} \\
    &= \sigma_{\mu}^2(t-1) \left( 1-K(t) \right)
\end{align}
where we defined the **Kalman gain**
$$
    K(t) =  \frac{\sigma_{\mu}^2(t-1)}{\sigma^2_{\epsilon}(t)+\sigma_{\mu}^2(t-1)}
$$
- This linear sequential estimator of mean and variance in Gaussian observations is a **Kalman Filter**.
- The new observation $x(t)$ 'corrects' the old estimate $\hat \mu(t-1)$ by a quantity that is proportional to the _innovation_ (or _residual_)  $\left( x(t) - \hat \mu(t-1) \right)$.
- Note that the uncertainty about $\mu$ decreases over time
- Recursive Bayesian estimation is the basis for **adaptive signal processing** algorithms such as Least Mean Squares (LMS) and Recursive Least Squares (RLS).



### Review Gaussians
The success of Gaussian distributions in probabilistic modeling is large due to the following properties:
- The product of two Gaussian functions is another Gaussian function (use in Bayes rule). 
- The convolution of two Gaussian functions is another Gaussian function (use in sum of 2 variables)
- A linear transformation of a Gaussian dsitributed variable is also Gaussian distributed
- Conditioning and marginalization of multivariate Gaussian distributions produce Gaussians again (use in working with observations and when doing Bayesian predictions)
- The Gaussian PDF has higher entropy than any other with the same variance. (Not discussed in this course).
- Any smooth function with single rounded maximum, if raised to higher and higher powers, goes into a Gaussian function. (Not discussed).

#### What's Next?
- We discussed how Bayesian probability theory provides an integrated framework for making predictions based on observed data.
- The process involves model specification (your main task!), inference and actual model-based prediction.
- The latter two tasks are only difficult because of computational issues.
   - Maximum likelihood was introduced as a computationally simpler approximation to the Bayesian approach.
   - In particular under some linear Gaussian assumptions, a few interesting models can be designed.
   - The rest of this course (part-1) concerns introduction to these Linear Gaussian models.
