## Implicit Inference vs Explicit Inference

> Model is defined implicitly with $\{(d_i, \theta_i)\}_{i=1}^N$

> Model is defined with explicit pdfs, e.g., Gaussian, exponential, etc. For example, $d=\mu(\theta)+n \quad n \sim \mathcal{N}(0, C)$

The $d_i$ can come from the following sources:
- Found: e.g., astronomical observations;
- Generated: (a) Physical simulation (b) Generative model $\Rightarrow$ often called "Simulation-based Inference" (SBI)

**Example**<br>
Computing the posterior mean when the model is specified by $\{d_i, \theta_i\}_{i=1}^N$ such that $(d, \theta) \sim p(d, \theta)$.

We want to $f(d) \approx \mathbb{E}(\theta)_{p(\theta|d)}$, remerber (from Bayesian Decision Theory):<br>

For $L = \int (f(d)-\theta)^2 p(d, \theta) d d\theta$, $\frac{\partial L}{\partial f}|_{f=\hat{f}} = 0 \Rightarrow \hat{f}(d) = \mathbb{E}(\theta)_{p(\theta|d)}$

$L \approx \frac{1}{N} \sum_{i} (f(d_i)-\theta_i)^2$

In practice, we can represent $f(d)$ as a Neural Network, $f_w(d)$, where $w$ is the set of weights and biases of the neural network.

<img src="https://raw.githubusercontent.com/bd0525/new2GPR/main/assets/nn_demo001.jpg" alt="nn_demo001" width="400"/>

$h = \varphi(w_1 d+ b_1)$, $\hat{\theta} = \varphi(w_2 h + b_2)$, $\varphi$ is a non-linear function (e.g., ReLU, tanh, etc.)

NN are universal function approximators, it is often straight-forward to find $w_i^*$ and $b_i^*$ such that $f_w(d)$ minimizes a diifferentiable loss function $L[f_w]$.<br>


=================================================<br>

To get the posterior variance:

$L = \int (g(d) - (\theta - \hat{f}_{w|d})^2)^2 p(d, \theta) d\theta\, dd$ to get $\hat{g}(d) = \mathbb{E}[(\theta-\hat{\theta})^2]_{p(\theta|d)}$

Posterior median:

$L = \int |\theta - f(d)| p(\theta, d) d\theta\, dd$

**Caveats:**<br>
> Need enough "training data", $\{d_i, \theta_i\}_{i=1}^N$;

> $f_w(d)$ must be sufficiently expressive (number of hidden layers, architecture, etc.);

> have to be able to find a good minimum of $L[f_w]$;

<ins>**Finding approximations to the posterior pdf**</ins>

Certain NN architectures are designed to represent pdfs, for example:
- Normalizing flows (MAF);

- Mixture density networks, e.g., $p(x) = \sum_{i=1} w_i \mathcal{N}(x; \mu_i, C_i)$, where $w_i$, $\mu_i$, $C_i$ are parameters of the neural network, with the help of conditional variables ($p(x|y) =  \sum_{i=1} w_i(y) \mathcal{N}(x; \mu_i(y), C_i(y))$)
To solve for $\hat{p}(\theta|d)$, maximize the kl divergence:<br>
$$\begin{align*}
L &= KL\left(p(\theta, d) \mid\mid \hat{p}_{w}(\theta, d)\right) - \lambda \int \hat{p} d\theta\, dd\\
&=\int p(\theta, d) \ln p(\theta, d)/\hat{p}_{W}(\theta, d) d\theta\, dd - \lambda \cdots\\
&=\text{const} - \int p(\theta, d)\ln \hat{p}(\theta, d)dd\,d\theta - \lambda \cdots\\
&\approx \text{const} - \frac{1}{N}\sum\ln\hat{p}(\theta, d)dd\,d\theta
\end{align*}$$




