## [Feb 23] Causal Inference and Transfer Learning III

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

---

### 1.  Bayesian Method Basis

In a Bayesian setting, we have some unknown real-world quantity  $\Theta$  takes values in a parameter space  $\Omega$. Typically  $\Omega \subset R^{p}$. 

> Let  $S \subseteq \Omega$, is  $\Theta$  in  $S$  ?

If knowledge of the world is coherent (as is shown below), then $\exists$ a unique **prior distribution** with density  $\pi(\theta)$  on  $\Omega$ and 

> $$\int_{S} \pi(\theta) d \theta=\operatorname{Pr}(\Theta \in S). $$

Observations  $Y=\left(Y_{1}, \ldots, Y_{n}\right)$, $Y \in \mathcal{Y}$  are distributed according to an observation model with probability density  $p(y \mid \theta)$  for realisation  $Y=y$  with  $y=\left(y_{1}, \ldots, y_{n}\right)$  given  $\Theta=\theta$. If the observations are iid then we have an **observation model** with probability density

$$p(y \mid \theta)=\prod_{i=1}^{n} p\left(y_{i} \mid \theta\right).$$

Therefore, our beliefs about $\Theta \in S$ change with the **posterior density** of the posterior distribution being

$$\pi(S \mid y)=\operatorname{Pr}(\Theta \in S \mid Y=y) \pi(\theta \mid y)=\frac{p(y \mid \theta) \pi(\theta)}{p(y)} \quad \text{ where } \quad p(y)=\int_{\Omega} p(y \mid \theta) \pi(\theta) d \theta
$$

is the normalising marginal likelihood (also, **the prior predictive distribution**). Answers to questions about  $\Theta$  can be given in terms of  $\pi(\theta \mid y)$ via $\operatorname{Pr}(\Theta \in S \mid Y=y)=\int_{S} \pi(\theta \mid y) d \theta$.


---

Given observation model  $Y \sim p(\cdot \mid \theta)$, the  $\Theta$-estimator $\delta: \mathcal{Y} \rightarrow \mathbb{R}^{p}$, has risk  $\mathcal{R}(\theta, \delta)$  at $\Theta=\theta$  given by

$$\begin{aligned}
\mathcal{R}(\theta, \delta) & =E_{Y \mid \Theta=\theta}(L(\theta, \delta(Y))) \\
& =\int_{\mathcal{Y}} L(\theta, \delta(y)) p(y \mid \theta) d y .
\end{aligned}$$

>  The Bayes risk,  $\rho(\pi, \delta)$, is the risk averaged over the prior,
>
> $$\begin{aligned}
\rho(\pi, \delta) & =E_{\Theta}(\mathcal{R}(\theta, \delta)) =E_{\Theta, Y}(L(\Theta, \delta(Y))) \\
& =\int_{\Omega} \int_{\mathcal{Y}} L(\theta, \delta(y)) p(y \mid \theta) \pi(\theta) d y d \theta .
\end{aligned}$$
> 
> where we have a prior  $\pi(\theta)$, posterior  $\pi(\theta \mid y)$  and marginal likelihood  $p(y)$.


A Bayes estimator  $\delta^{\pi}$  for  $\theta$  minimises the Bayes risk $\delta^{\pi}=\arg \min _{\delta} \rho(\pi, \delta)$. 

This is not straightforward as $\delta$ is a function, then we consider the **expected posterior
loss**
$$\rho(\pi, \delta \mid y) = E_{\Theta \mid Y=y}(L(\Theta, \delta(y))) =\int_{\Omega} L(\theta, \delta(y)) \pi(\theta \mid y) d \theta \implies \rho(\pi, \delta)=\int_{\mathcal{Y}} \rho(\pi, \delta \mid y) p(y) d y.$$



> $\delta^{\pi}(y)=\arg \min _{\delta} \rho(\pi, \delta \mid y)$ minimizes the Bayes risk.

For example, let $L(\Theta, \delta(Y))=(\Theta-\delta(Y))^{2}$, in this case $ E_{\Theta \mid y}(L(\Theta, \delta)) $ is minimised over actions by the posterior mean  $\delta^{*}=E_{\Theta \mid y}(\Theta)$  (just differentiate wrt  $\delta(y)$  at fixed  $y$  ). 

---

Suppose we wish to estimate the expectation w.r.t. $f(\theta)$. If the loss is the square error then we estimate  $f(\Theta)$  with the posterior mean $ E_{\Theta \mid Y=y}(f(\Theta))$. Simulate  $\theta^{(t)} \sim \pi(\cdot \mid y)$, $t=1, \ldots, T$  and compute
$$\hat{f}=\frac{1}{T} \sum_{t=1}^{T} f\left(\theta^{(t)}\right) .$$

> For example, if  $S \in \mathcal{B}_{\Omega}$  and  $f(\theta)=\mathbb{I}_{\theta \in S}$  then  $\hat{f}$  estimates  $\pi(S \mid y)$.

We also commonly report posterior credible sets in order to quantify uncertainty. A level-$\alpha$  Highest Posterior Density (HPD) credible set  $C_{\alpha}$  satisfies

$$\int_{\Omega \cap C_{\alpha}} \pi(\theta \mid y) d \theta=1-\alpha, \text { and } \theta \in C_{\alpha}, \theta^{\prime} \in \Omega \backslash C_{\alpha} \Rightarrow \pi(\theta \mid y) \geq \pi\left(\theta^{\prime} \mid y\right) .$$

An HPD set (or general credible set with fixed posterior probability mass) is qualitatively
different in meaning from a frequentist confidence interval since it involves the **the probability of parameters**.


> The HPD set can be estimated from Monte Carlo samples  $\theta^{(t)} \sim \pi(\cdot \mid y)$. Specifically, the HPD set is a Bayes estimator by supposing the action space is  $\delta \in \Delta = \left\{A \in \mathcal{B}_{\Omega}: \pi(A \mid y)=1-\alpha\right\}$ and the loss  $L(\Theta, \delta)=\mathbb{I}_{\Theta \notin \delta}+|\delta|$  where  $|\delta|=\int_{\delta} d \theta$. 
>
> The expected posterior loss is minimised over the action space by  $\delta^{*}=C_{\alpha}$,  an HPD set.

After that, we obtain the **posterior predictive distribution** of the data $p\left(y^{\prime} \mid y\right)=\int_{\Omega} p\left(y^{\prime} \mid \theta\right) \pi(\theta \mid y) d \theta$.

---



In Bayesian inference we have, for model  $m$, a parameter prior  $\Theta \sim   \pi(\theta \mid m)$ and an observation model  $Y \sim p(y \mid \theta, m)$. The parameter space may vary from model to model. 

> The "model" is the joint model for the "generative process" for  $\Theta, Y$, with joint density  $\pi(\theta \mid m) p(y \mid \theta, m)$  and state space  $\Omega_{m} \times \mathcal{Y}$. 

Conditioning on  $Y=y$  we get the posterior under model  $M=m$,

$$\pi(\theta \mid y, m)=\frac{p(y \mid \theta, m) \pi(\theta \mid m)}{p(y \mid m)}$$

with $p(y \mid m)=\int_{\Omega_{m}} p(y \mid \theta, m) \pi(\theta \mid m) d \theta$. Similarly, the posterior model probability is $\pi(m \mid y)= p(y \mid m) \pi_{M}(m) /p(y)$ where  $\pi_{M}(m)$  is our prior probability and $p(y)=\sum_{m \in \mathcal{M}} p(y \mid m) \pi_{M}(m)$.


> Under the  $0-1$  loss function with truth  $M$  and action  $\delta \in \mathcal{M}$  the loss is  $L(M, \delta)=\mathbb{I}_{M \neq \delta}$.  The Bayes estimator is the
maximum a posteriori (MAP) model.

This can be seen that the expected posterior loss  $E_{M \mid y}(L(M, \delta))=1-\pi(\delta \mid y)$  and this is minimised by the choice  $\delta=m^{*}$  with  $m^{*}$  the mode, the most probable model a posteriori.

> The model averaged posterior which allows for uncertainty is $\pi(\theta \mid y)=\sum_{m \in \mathcal{M}} \pi(\theta \mid y, m) \pi_{M}(m \mid y)$.

### 2. MCMC

MCMC is a family of algorithms for simulating  $X_{0}, X_{1}, X_{2}, \ldots \xrightarrow{D} p $ for a user-defined probability distribution  $p$.

### 3. Coherent

### 4. Model Averaging

---

### Reference

1. Shuxiao Chen. Minimax Rates and Adaptivity in Combining Experimental and Observational Data.
2. Qingyuan Zhao. Lecture Notes on Causal Inference. 
2. Joaquin Quiñonero-Candela. Dataset Shift In Machine Learning.
3. Geoff K. Nicholls. Bayes Methods.
4. Patrick J. Laub. Hawkes Processes.
5. Tomas Björk. An Introduction to Point Processes from a Martingale Point of View.