# Bayesian Measurement Error Model

Consider inverse problems: to find $\theta$, an input to a mathematical model, given $y$ an observation of (some components of, or functions of) the solution of the model. We have an equation of the form

$$y = \mathcal{G}(\theta) + \eta$$

Here $\mathcal{G} : R^{N_{\theta}} \mapsto R^{N_y}$ denotes the parameter to observation map, and the observational noise $\eta$, is assumed to be drawn from a Gaussian with distribution $\mathcal{N}(0,\Sigma_{\eta})$.

## Optimization approach
From optimization viewpoint, the optimal $\theta$ can be solved by the following least square problem

$$\min_{\theta} \Phi(\theta, y) = \frac{1}{2}\lVert\Sigma_{\eta}^{-\frac{1}{2}} (y - \mathcal{G}(\theta)) \rVert^2$$

However, when the problem is not identifiable, (for example, when there are not enough observations, $\theta$ might not be uniquely determined) regularization is required :

$$\min_{\theta} \Phi_R(\theta, y) = \Phi(\theta, y) + \frac{1}{2}\lVert\Sigma_{0}^{-\frac{1}{2}} (\theta - r_0) \rVert^2$$

where $r_0$ and $\Sigma_0$ generally encode prior mean and covariance information about $\theta$.

## Probabilistic approach
From Bayesian viewpoint, $\theta$ and $y$ are treated as random variables, and the inverse problem can be formulated as posterior distribution approximation problem:

$$ 
\begin{align*}
p(\theta|y) &= \frac{p(y | \theta) p(\theta)}{p(y)}\\
&= \frac{1}{Z(y)} e^{-\Phi(\theta , y) } p_0(\theta) 
\end{align*}
$$

where $p_0(\theta)$ is the prior distribution of $\theta$ and $Z(y)$ is the normalization constant:

$$ Z(y) =  \int e^{-\Phi(\theta , y) } p_0(\theta) d\theta $$

When the prior is an uninformative prior (improper uniform prior), the posterior distribution becomes

$$ p(\theta|y) = \frac{1}{Z(y)} e^{-\Phi(\theta , y) } $$

When the prior is assumed to be a Gaussian distribution, the posterior distribution becomes

$$ p(\theta|y) = \frac{1}{Z(y)} e^{-\Phi_R(\theta , y) } $$

## Relations
The optimization viewpoint and the Bayesian viewpoint are linked via the fact:
* the minimizer of $\Phi(\theta,y)$ coincides with the maximum a posterior (MAP) estimator with an uninformative prior; 
* the minimizer of $\Phi_R(\theta,y)$ coincides with the MAP estimator.


## Linear Case : $\mathcal{G}(\theta) = G \theta$

When the parameter to observation map is linear, closed form of the posterior distibution exists:

* The posterior distribution with a Gaussian prior $\mathcal{N}(r_0, \Sigma_0)$ is 

    $$\theta|y \sim \mathcal{N}\Big(r_0 + \Sigma_0 G^T(\Sigma_{\eta} + G \Sigma_0 G^T)^{-1} (y - G r_0), \quad \Sigma_0 -\Sigma_0 G^T(\Sigma_{\eta} + G \Sigma_0G^T)^{-1} G\Sigma_0 \Big)$$

    It is worth mentioning the Sherman–Morrison–Woodbury formula 

    $$\Big(G^T \Sigma_{\eta}^{-1} G + \Sigma_0^{-1}\Big)^{-1} = \Sigma_0 -\Sigma_0 G^T(\Sigma_{\eta} + G \Sigma_0G^T)^{-1} G\Sigma_0$$

    Thus $G^T \Sigma_{\eta}^{-1} G + \Sigma_0^{-1}$ is the inverse of the covariance, which is called precision (matrix). 
    
    The posteior distribution can be rewritten as 
    
    $$\theta|y \sim \mathcal{N}\Big(r_0 + \Big(G^T \Sigma_{\eta}^{-1} G + \Sigma_0^{-1}\Big)^{-1} G^T\Sigma_{\eta}^{-1}(y - G r_0), \quad \Big(G^T \Sigma_{\eta}^{-1} G + \Sigma_0^{-1}\Big)^{-1} \Big)$$
    
    Moerover, the mean of the Gaussian distribution is its MAP estimator: 
    
    $$\textrm{argmin}_{\theta} \Phi_R(\theta, y)$$
    
    
* The posterior distribution with an uninformation prior is

    $$
    \theta|y \sim \mathcal{N}\Big((G \Sigma_{\eta}G^T)^{-1}  G^T \Sigma_{\eta}^{-1}y, \quad (G \Sigma_{\eta}G^T)^{-1}\Big)
    $$

    which exists only when $G$ has empty null space ($G^T \Sigma_{\eta}^{-1} G$ is not singular). 
    Moerover, the mean of the Gaussian distribution is its MAP estimator:
    
    $$\textrm{argmin}_{\theta} \Phi(\theta, y)$$

In a nutshell, when the parameter to observation map $\mathcal{G}$ is linear, the posterior distibution is also Gaussian with Gaussian or uninformative priors.
