# Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models

The true likelihood function we want to retrieve using a simulator-based model so that $y_{\theta} \sim p_{y|\theta}$

$\mathcal{L}(\theta) = p_{y|\theta}(y_0|\theta)$ -> true likelihood based on observed data

The approximation of the likelihood function is based on some measurement of discrepancy $\Delta_{\theta}$ between the observed data $y_o$ and data $y_{\theta}$ simulated with parameter value $\theta$. $\Delta_{\theta}$ is used to approximate $\mathcal{L}$ by $\hat{L}$.

The approximation is based on reduction of the data to some features, or summary statistics $\Phi$. The purpose of the summary statistics is to reduce the dimensionality and to filter out information which is not deemed relevant for the inference of $\theta$. 
So $\mathcal{L}$ is replaced with $L$: 

$L(\theta)=p_{\Phi|\theta}(\Phi_0|\theta)$ -> true likelihood based on summary statistics

$L(\theta)$ is a valid likelihood function but for the inference of $\theta$ given $\Phi$ and not for the inference of $\theta$ given $y_0$, in contrast to $\mathcal{L}$, unless $\Phi$ happens to be statistically sufficient.

Howeverm $L(\theta)$ is not known, because the pdf $p_{\Phi|\theta}$ is of unknown analytical form, which is a property inherited from $p_{y|\theta}$. So we again have to approximate $L(\theta)$, which we denote as a pratical approximation with finite resources by $\hat{L}(\theta)$

$\hat{L}(\theta)$ -> computable approximation of $L$

## Parametric Approximation of the Likelihood

If $\Phi_{\theta}$ is obtained via averaging (choose $\theta$ and simulate $N$ datasets from the simulator, calculate $\Phi_{\theta}$ for each and calculate the average), the central limit theorem suggests that the pdf may be well approximated by a Gaussian distribution if the number of samples n is sufficiently large,

$p_{\Phi|\theta}(\phi|\theta) \approx \cfrac{1}{(2\pi)^{(p/2)}|\det \Sigma_{\theta}|^{1/2}} \exp (-\cfrac{1}{2}(\phi - \mu_{\theta})^T\Sigma_{\theta}^{-1}(\phi - \mu_{\theta}))$, with $p$ the dimension of $\Phi_{\theta}$

The corresponding likelihood function is $\tilde{L}_s = \exp(\tilde{l}_s)$

$\tilde{l}_s(\theta) = -\cfrac{p}{2}\log(2\pi) - \cfrac{1}{2}\log|\det \Sigma_{\theta}| -\cfrac{1}{2}(\phi_o - \mu_{\theta})^T\Sigma_{\theta}^{-1}(\phi_o - \mu_{\theta}))$

$\mu_{\theta}$ and $\Sigma_{\theta}$ are generally not known, but the simulator can be used to estimate them via sample average $E^N$ over N independently generated summary statistics (for a given $\theta$),

$\hat{\mu}_{\theta} = E^N[\Phi_{\theta}] = \cfrac{1}{N} \sum_i \Phi_{\theta}^{(i)}, \quad \Phi_{\theta}^{(i)} \overset{i.i.d}{\sim } p_{\Phi|\theta}, \quad \hat{\Sigma}_{\theta} = E^N[(\Phi_{\theta} - \hat{\mu}_{\theta})(\Phi_{\theta} - \hat{\mu}_{\theta})^T]$

A computable estimate $\hat{L}_s^N$ is then given by $\hat{L}_s^N = \exp(\hat{l}_s^N)$. This approximation was named "synthetical likelihood" by Wood (2010)

## Nonparametric Approximation of the Likelihood

If we dont know or dont want to assume a parametric model for the pdf $p_{\Phi|\theta}$ of the summary statistics, we can approximate it by a kernel density estimate,

$p_{\Phi|\theta}(\phi|\theta) \approx E^N[K(\phi,\Phi_{\theta})], \quad E^N[K(\phi,\Phi_{\theta})] = \cfrac{1}{N} \sum_i K(\phi, \Phi_{\theta}^{(i)})$

An approximation of the likelihood function $L(\theta)$ is given by $\hat{L}_K^N(\theta)$,

$\hat{L}_K^N(\theta) = E^N[K(\Phi_0, \Phi_{\theta})]$

We can rewrite $K$ in another form as $\kappa(\Delta_{\theta})$ where $\Delta_{\theta} \geq 0$ depends on $\Phi_o$ and $\Phi_{\theta}$, and $\kappa$ is a univariate non-negative function not depending on $\theta$. Kernels $K$ are generally such that $\kappa$ has a maximum at zero (because kernels measure similarity and that is the greatest when two values are equal). 

Now we can express $\hat{L}_K^N(\theta)$ with $\kappa$ which we denote with $\hat{L}_{\kappa}^N$

$\hat{L}_{\kappa}^N (\theta) = E^N[\kappa(\Delta_{\theta})]$

When we denote the empirical expectation $E^N[\Delta_{\theta}]$ as $\hat{J}(\theta)$ it can be shown that for convex functions $\kappa$ $\kappa(\hat{J})$ is a lower bound for the likelihood.

$\hat{L}_{\kappa}^N(\theta) \geq \kappa(\hat{J}^N (\theta))$,

which means we do not have to calculate $\kappa(\Delta_{\theta})$ for each sample, but instead calculate the empirical expectation over the discrepancies $E^N[\Delta_{\theta}]$ and apply $\kappa$ only once.

Since $\kappa$ is maximum at zero, the lower bound is maximized by minimizing the conditional empirical expectation $\hat{J}^N (\theta)$

A popular choice for $\kappa$ is a uniform kernel $\kappa = \kappa_u$ which lead to the approximate likelihood $\hat{L}_u^N$

$\kappa_u(u) = c\chi_{[0,h)}(u), \quad \hat{L}_u^N (\theta) = cP^N(\Delta_{\theta} < h)$,

where the indicator function $\chi$ equals one if $u \in [0,h)$ and zero otherwise. $c$ is a scaling parameter and does not depend on $\theta$ and $h$ is the bandwidth  of the kernel and acts as acceptance/rejection threshold.

A lower bound for $\hat{L}_u^N$ is given by:

$\hat{L}_u^N(\theta) = c[1 - P^N(\Delta_{\theta}^u \geq h] \geq c[1 - \cfrac{1}{h}E^N[\Delta_{\theta}]] = c[1 - \cfrac{1}{h} \hat{J}^N(\theta)]$,

so again, minimizing $\hat{J}^N(\theta)$ maximizies the likelihood $\hat{L}_u^N(\theta$

## Relation between Nonparametric and Parametric Approximation

A special choice of Kernel allows us to embed the synthetic likelihood approch into the nonparametric approach

For the Gaussian kernel, we have that $K(\Phi_i, \Phi_{\theta}) = K_g(\Phi_o - \Phi_{\theta})$,

$K_g(\Phi_o - \Phi_{\theta}) = \cfrac{1}{(2\pi)^{p/2}} \cfrac{1}{|\det C_{\theta}|^{1/2}} \exp \left( -\cfrac{(\Phi_o - \Phi_{\theta})^T C^{-1}_{\theta} (\Phi_0 - \Phi_{\theta})}{2} \right)$,

where $C_{\theta}$ is a positive definite bandwith matrix possibly depending on $\theta$. The kernel $K_g$ corresponds to $\kappa = \kappa_g$ and $\Delta_{\theta} = \Delta_{\theta}^g$,

$\kappa_g(u) = \cfrac{1}{(2\pi)^{p/2}} \exp(-\cfrac{u}{2}), \quad \Delta_{\theta}^g = \log|\det C_{\theta}| + (\Phi_o - \Phi_{\theta})^T C^{-1}_{\theta} (\Phi_0 - \Phi_{\theta})$

The function $\kappa_g$ is convex and thus yields a lower bound for $\hat{L}^N(\theta)=\hat{L}_g^N(\theta)$

$\hat{J}_g^N(\theta) = E^N[\Delta_{\theta}^g] \Rightarrow \log \hat{L}_g^N(\theta) \geq \log \kappa_g(\hat{J}_g^N(\theta)) = -\cfrac{p}{2} \log(2\pi) - \cfrac{1}{2}\hat{J}_g^N(\theta)$

Now it can be shown that with $C_{\theta} = \hat{\Sigma}_{\theta}$ we can approximate the synthetic likelihood $\hat{l}_s^N(\theta)$ (which is not necessary for BOLFI but a nice to have for the relation between the two approaches):

$\hat{l}_s^N(\theta) = \cfrac{p}{2} - \cfrac{p}{2}\log(2\pi) - \cfrac{1}{2}\hat{J}_g^N(\theta)$,

$\log \hat{L}_g^N(\theta) \geq -\cfrac{p}{2} + \hat{l}_s^N(\theta)$.

This shows that maximizing the synthetic log likelihood $\hat{l}_s^N$ corresponds to maximizing a lower bound of a nonparametric approximation of the log likelihood $\hat{L}_g^N$

## Posterior Inference