## [Feb 23] Causal Inference and Transfer Learning III

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

---

### 1.  Basic Aspects of Bayesian Method 

In a Bayesian setting, we have some unknown real-world quantity  $\Theta$  takes values in a parameter space  $\Omega$. Typically  $\Omega \subset R^{p}$. 

> Let  $S \subseteq \Omega$, is  $\Theta$  in  $S$  ?

If knowledge of the world is coherent (as is shown below), then $\exists$ a unique **prior distribution** with density  $\pi(\theta)$  on  $\Omega$ and 

> $$\int_{S} \pi(\theta) d \theta=\operatorname{Pr}(\Theta \in S). $$

Observations  $Y=\left(Y_{1}, \ldots, Y_{n}\right)$, $Y \in \mathcal{Y}$  are distributed according to an observation model with probability density  $p(y \mid \theta)$  for realisation  $Y=y$  with  $y=\left(y_{1}, \ldots, y_{n}\right)$  given  $\Theta=\theta$. If the observations are iid then we have an **observation model** with probability density

$$p(y \mid \theta)=\prod_{i=1}^{n} p\left(y_{i} \mid \theta\right).$$

Therefore, our beliefs about $\Theta \in S$ change with the **posterior density** of the posterior distribution being

$$\pi(S \mid y)=\operatorname{Pr}(\Theta \in S \mid Y=y) = \int_S \pi(\theta \mid y) = \int_S \frac{p(y \mid \theta) \pi(\theta)}{p(y)}$$

where $p(y)=\int_{\Omega} p(y \mid \theta) \pi(\theta) d \theta$ is the normalising marginal likelihood (also, **the prior predictive distribution**). Answers to questions about  $\Theta$  can be given in terms of  $\pi(\theta \mid y)$ via $\operatorname{Pr}(\Theta \in S \mid Y=y)=\int_{S} \pi(\theta \mid y) d \theta$.


---

Given observation model  $Y \sim p(\cdot \mid \theta)$, the  $\Theta$-estimator $\delta: \mathcal{Y} \rightarrow \mathbb{R}^{p}$, has risk  $\mathcal{R}(\theta, \delta)$  at $\Theta=\theta$  given by

$$\begin{aligned}
\mathcal{R}(\theta, \delta) & =E_{Y \mid \Theta=\theta}(L(\theta, \delta(Y))) \\
& =\int_{\mathcal{Y}} L(\theta, \delta(y)) p(y \mid \theta) d y .
\end{aligned}$$

>  The Bayes risk $\rho(\pi, \delta)$ is the risk averaged over the prior,
>
> $$\begin{aligned}
\rho(\pi, \delta) & =E_{\Theta}(\mathcal{R}(\theta, \delta)) =E_{\Theta, Y}(L(\Theta, \delta(Y))) \\
& =\int_{\Omega} \int_{\mathcal{Y}} L(\theta, \delta(y)) p(y \mid \theta) \pi(\theta) d y d \theta .
\end{aligned}$$
> 
> where we have a prior  $\pi(\theta)$, posterior  $\pi(\theta \mid y)$  and marginal likelihood  $p(y)$.


A Bayes estimator  $\delta^{\pi}$  for  $\theta$  minimises the Bayes risk $\delta^{\pi}=\arg \min _{\delta} \rho(\pi, \delta)$. 

This is not straightforward as $\delta$ is a function, then we consider the **expected posterior
loss**
$$\rho(\pi, \delta \mid y) = E_{\Theta \mid Y=y}(L(\Theta, \delta(y))) =\int_{\Omega} L(\theta, \delta(y)) \pi(\theta \mid y) d \theta \implies \rho(\pi, \delta)=\int_{\mathcal{Y}} \rho(\pi, \delta \mid y) p(y) d y.$$



> $\delta^{\pi}(y)=\arg \min _{\delta} \rho(\pi, \delta \mid y)$ minimizes the Bayes risk.

For example, let $L(\Theta, \delta(Y))=(\Theta-\delta(Y))^{2}$, in this case $ E_{\Theta \mid y}(L(\Theta, \delta)) $ is minimised over actions by the posterior mean  $\delta^{*}=E_{\Theta \mid y}(\Theta)$  (just differentiate wrt  $\delta(y)$  at fixed  $y$  ). 

---

Suppose we wish to estimate the expectation w.r.t. $f(\theta)$. If the loss is the square error then we estimate  $f(\Theta)$  with the posterior mean $ E_{\Theta \mid Y=y}(f(\Theta))$. Simulate  $\theta^{(t)} \sim \pi(\cdot \mid y)$, $t=1, \ldots, T$  and compute
$$\hat{f}=\frac{1}{T} \sum_{t=1}^{T} f\left(\theta^{(t)}\right) .$$

> For example, if  $S \in \mathcal{B}_{\Omega}$  and  $f(\theta)=\mathbb{I}_{\theta \in S}$  then  $\hat{f}$  estimates  $\pi(S \mid y)$.

We also commonly report posterior credible sets in order to quantify uncertainty. A level-$\alpha$  Highest Posterior Density (HPD) credible set  $C_{\alpha}$  satisfies

$$\int_{\Omega \cap C_{\alpha}} \pi(\theta \mid y) d \theta=1-\alpha, \text { and } \theta \in C_{\alpha}, \theta^{\prime} \in \Omega \backslash C_{\alpha} \Rightarrow \pi(\theta \mid y) \geq \pi\left(\theta^{\prime} \mid y\right) .$$

An HPD set (or general credible set with fixed posterior probability mass) is qualitatively
different in meaning from a frequentist confidence interval since it involves the **the probability of parameters**.


> The HPD set can be estimated from Monte Carlo samples  $\theta^{(t)} \sim \pi(\cdot \mid y)$. Specifically, the HPD set is a Bayes estimator by supposing the action space is  $\delta \in \Delta = \left\{A \in \mathcal{B}_{\Omega}: \pi(A \mid y)=1-\alpha\right\}$ and the loss  $L(\Theta, \delta)=\mathbb{I}_{\Theta \notin \delta}+|\delta|$  where  $|\delta|=\int_{\delta} d \theta$. 
>
> The expected posterior loss is minimised over the action space by  $\delta^{*}=C_{\alpha}$,  an HPD set.

After that, we obtain the **posterior predictive distribution** of the data $p\left(y^{\prime} \mid y\right)=\int_{\Omega} p\left(y^{\prime} \mid \theta\right) \pi(\theta \mid y) d \theta$.

---



For model  $m$, we have a parameter prior  $\Theta \sim   \pi(\theta \mid m)$ and an observation model  $Y \sim p(y \mid \theta, m)$. The parameter space may vary from model to model. 

> The "model" is the joint model for the "generative process" for  $\Theta, Y$, with joint density  $\pi(\theta \mid m) p(y \mid \theta, m)$  and state space  $\Omega_{m} \times \mathcal{Y}$. 

Conditioning on  $Y=y$  we get the posterior under model  $M=m$,

$$\pi(\theta \mid y, m)=\frac{p(y \mid \theta, m) \pi(\theta \mid m)}{p(y \mid m)}$$

with $p(y \mid m)=\int_{\Omega_{m}} p(y \mid \theta, m) \pi(\theta \mid m) d \theta$. Similarly, the posterior model probability is $\pi(m \mid y)= p(y \mid m) \pi_{M}(m) /p(y)$ where  $\pi_{M}(m)$  is our prior probability and $p(y)=\sum_{m \in \mathcal{M}} p(y \mid m) \pi_{M}(m)$.


> Under the  $0-1$  loss function with truth  $M$  and action  $\delta \in \mathcal{M}$  the loss is  $L(M, \delta)=\mathbb{I}_{M \neq \delta}$.  The Bayes estimator is the
maximum a posteriori (MAP) model.

This can be seen that the expected posterior loss  $E_{M \mid y}(L(M, \delta))=1-\pi(\delta \mid y)$  and this is minimised by the choice  $\delta=m^{*}$  the most probable model a posteriori.

> The model averaged posterior which allows for uncertainty is $\pi(\theta \mid y)=\sum_{m \in \mathcal{M}} \pi(\theta \mid y, m) \pi_{M}(m \mid y)$.

---

### 2. MCMC

MCMC is a family of algorithms for simulating $X_{t} \xrightarrow{D} p$.

Write $P_{i, j}=\mathbb{P}\left(X_{t+1}=j \mid X_{t}=i\right)$ for a homogeneous Markov chain on $\Omega$. 

> $p_i$ is **detailed balance** if $p_{i} P_{i, j}=p_{j} P_{j, i} \text { holds for all } i, j \in \Omega$.

As a consequence, $p$ is stationary for $P$.

> Suppose $\left\{X_{t}\right\}_{t=0}^{\infty}$ is irreducible and aperiodic on $\Omega$ satisfying detailed balance w.r.t. $p$. Then 
>
> $$ \hat{f}_{T} = T^{-1} \sum_{t} f\left(X_{t}\right) \xrightarrow{\text { a.s. }} E_{X \sim p}(f(X)) $$

Then we propose MCMC.

> Given a proposal $Q$ s.t. $q(j \mid i)>0 \Leftrightarrow q(i \mid j)>0$.
>
> Let  $X_{t}=i$. The next state  $X_{t+1}$  is realised below
> 1. Draw  $j \sim q(\cdot \mid i)$  and  $u \sim U[0,1]$.
> 2. If  $u \leq \alpha(j \mid i)$  where $\alpha(j \mid i)=\min \left\{1, \frac{p_{j} q(i \mid j)}{p_{i} q(j \mid i)}\right\}$, then set  $X_{t+1}=j$ , otherwise set  $X_{t+1}=i$.

Seeing that $P_{i, j}=P\left(X_{t+1}=j \mid X_{t}=i\right)=q(j \mid i) \alpha(j \mid i)$. Then 

$$\begin{aligned}
p_{i} P_{i, j} & =p_{i} q(j \mid i) \alpha(j \mid i) = p_{i} q(j \mid i) \min \left\{1, \frac{p_{j} q(i \mid j)}{p_{i} q(j \mid i)}\right\} = \min \left\{p_{i} q(j \mid i), p_{j} q(i \mid j)\right\} \\
& \left.=p_{j} q(i \mid j) \min \left\{\frac{p_{i} q(j \mid i)}{p_{j} q(i \mid j)}, 1\right)\right\} =p_{j} q(i \mid j) \alpha(i \mid j) =p_{j} P_{j, i}
\end{aligned}$$

and we are done. In continuous case,

> The transition kernel for Metropolis-Hasting MCMC is
>
> $$K\left(\theta, d \theta^{\prime}\right)=\alpha\left(\theta^{\prime} \mid \theta\right) q\left(d \theta^{\prime} \mid \theta\right)+c(\theta) \delta_{\theta}\left(d \theta^{\prime}\right),$$
>
> where $\alpha\left(\theta^{\prime} \mid \theta\right)=\min \left\{1, \frac{p\left(\theta^{\prime}\right) q\left(\theta \mid \theta^{\prime}\right)}{p\left(\theta^{\prime}\right) q\left(\theta^{\prime} \mid \theta\right)}\right\}$, $c(\theta)=1-\int_{\Omega} \alpha\left(\theta^{\prime} \mid \theta\right) q\left(d \theta^{\prime} \mid \theta\right)$, and $\int_{A} \delta_{\theta}\left(d \theta^{\prime}\right)=\mathbb{I}_{\theta^{\prime} \in A}$.
>
> The detailed balance is satisfied when $p\left(d \theta^{\prime}\right) K\left(\theta^{\prime}, d \theta\right)=p(d \theta) K\left(\theta, d \theta^{\prime}\right)$. 

For $\theta=\left(\theta_{1}, \ldots, \theta_{p}\right)$ with $\geq$ one dimension, choose a kernel at random from  $K_{1}, \ldots, K_{N}$ where

$$K_{k}\left(\theta, d \theta^{\prime}\right)=\alpha_{k}\left(\theta^{\prime} \mid \theta\right) q_{k}\left(d \theta^{\prime} \mid \theta\right)+c_{k}(\theta) \delta_{\theta}\left(d \theta^{\prime}\right)$$

The overall kernel is $K\left(\theta, d \theta^{\prime}\right)=\sum_{k=1}^{N} \xi_{k} K_{k}\left(\theta, d \theta^{\prime}\right)$.


> The Gibbs sampler is a special case of the multi-dim case where $\theta_{i}^{\prime} \sim \pi\left(\cdot \mid \theta_{-i}\right)$ and $\theta_{-i}^{\prime}=\theta_{-i}$.

Important applications of the Gibbs sampler arise for missing data. Suppose the observation process is

$$z \sim p(z \mid \theta), \quad y \sim p(y \mid z, \theta)$$

The posterior  $\pi(\theta \mid y)$  is awkward as $\pi(\theta \mid y) \propto \pi(\theta) \int p(y \mid z, \theta) p(z \mid \theta) d z$. 

> In data augmentation we work with $p(\theta, z \mid y)$, thinking of missing data as another parameter: the posterior is simply
>
> $$p(\theta, z \mid y) \propto p(y \mid z, \theta) p(z \mid \theta) p(\theta) .$$

---

### 3. Exchangibility

> An infinite exchangeable sequence of $\left\{X_{i}\right\}_{i=1}^{\infty}$  is an infinite sequence s.t. $X_{1}, X_{2}, \ldots, X_{n}$  are exchangeable for every  $n \geq 1$.

Suppose the data forms an exchangeable sequence, then $\exists$ a generative model

$$\begin{aligned}
\Theta & \sim F \\
X_{i} \mid \Theta=\theta & \sim p(\cdot \mid \theta), \quad \text { iid for } i=1, \ldots, n
\end{aligned}$$

for the data. These distributions all exist by **de Finetti**. If  $x=\left(x_{1}, \ldots, x_{n}\right)$, the likelihood is

$$p(x \mid \theta)=\prod_{i} p\left(x_{i} \mid \theta\right) .$$


Here  F  is "nature's prior", which may not coincide with our own prior  $\pi$. However, de Finetti gives the form for the prior predictive distribution of the data

$$p_{1: n}\left(x_{1}, \ldots, x_{n}\right)=\int p\left(x_{1}, \ldots, x_{n} \mid \theta\right) d F(\theta) .$$

This is of course the marginal likelihood. Suppose we have some priors  $\pi(\theta \mid m), m \in \mathcal{M}$  and we carry out model selection using the marginal likelihood

$$p(x \mid M=m)=\int p(x \mid \theta) \pi(\theta \mid M=m) d \theta .$$

When we choose a model  $M=m$, we are forming an estimate  $\pi(\theta \mid M=m)$  for  $F$ the prior.

We also get an expression for the posterior in terms of  $F$. Given $x_{1: m}=\left(x_{1}, \ldots, x_{m}\right)$  and we wish to predict  $x_{m+1: n}=\left(x_{m+1}, \ldots, x_{n}\right)$. The posterior predictive distribution is

$$\begin{aligned}
p\left(x_{m+1: n} \mid x_{1: m}\right) & =p\left(x_{1: n}\right) / p\left(x_{1: m}\right) \\
& =\int p\left(x_{m+1: n} \mid \theta\right) \frac{p\left(x_{1: m} \mid \theta\right) d F(\theta)}{p\left(x_{1: m}\right)} \\
& =\int p\left(x_{m+1: n} \mid \theta\right) d \tilde{F}(\theta) .
\end{aligned}$$


After $X_{1}=x_{1}, \ldots X_{m}=x_{m}$, $\exists$ a generative model

$$\begin{aligned}
\Theta \mid X_{1: m} & \sim \tilde{F}(\theta) \\
X_{i} \mid \Theta=\theta & \sim p(\cdot \mid \theta), \quad \text { iid for } i=m+1, \ldots, n .
\end{aligned}$$


Here $d \tilde{F}(\theta) \propto p\left(x_{1}, \ldots, x_{m} \mid \theta\right) d F(\theta)$ is the updated true generative model for the parameter  $\Theta \mid X_{1: m}$, or in other words, the posterior. **de Finetti** tells us that Bayesian inference is possible in this exchangeable setting.

---

### 4. Model Averaging

> The model-averaged posterior  is $\pi(\theta \mid y)=\sum_{m \in \mathcal{M}} \pi(\theta, m \mid y), \theta \in \Omega$ where $\Omega=\bigcup_{m \in \mathcal{M}} \Omega_{m}$.


Then we claim that model averaging is preferred to inference after model selection. Consider estimating $h(\theta)$. Then we have the posterior mean and the single-model posterior mean

$$E_{\theta, m \mid y}(h(\theta))=\sum_{m \in \mathcal{M}} \int_{\Omega_{m}} h(\theta) \pi(\theta, m \mid y) d \theta, \quad E_{\theta \mid y, m^{*}}(h(\theta))=\int_{\Omega_{m^{*}}} h(\theta) \pi\left(\theta \mid y, m^{*}\right) d \theta.$$


If the loss for estimating $\delta$  when the truth is  $h$  is  $(h-\delta)^{2}$,  then the Bayes risk  $\rho(\pi, \delta(y))$  allowing for model and parameter uncertainty is minimised by  $E_{\theta, m \mid y}(h(\theta))$  and  $\rho\left(\pi, E_{\theta \mid y, m}(h)\right) \geq \rho\left(\pi, E_{\theta, m \mid y}(h)\right)$  for every  $m \in \mathcal{M}$.

**Proof.** Recall that the Bayes risk is minimised by the estimator minimising the expected posterior loss  $\rho(\pi, \delta \mid y)$  at every $ y \in \mathcal{Y}$. This is

$$\rho(\pi, \delta \mid y)=\sum_{m \in \mathcal{M}} \int_{\Omega_{m}}(\delta-h(\theta))^{2} \pi(\theta, m \mid y) d \theta$$


Differentiation gives 

$$\frac{\partial \rho}{\partial \delta}=\sum_{m \in \mathcal{M}} \int_{\Omega_{m}}(2 \delta-2 h(\theta)) \pi(\theta, m \mid y) d \theta$$

and this is $0$ when $\delta(y)=E_{\theta, m \mid y}(h(\theta))$.

---

### Reference

1. Shuxiao Chen. Minimax Rates and Adaptivity in Combining Experimental and Observational Data.
2. Qingyuan Zhao. Lecture Notes on Causal Inference. 
2. Joaquin Quiñonero-Candela. Dataset Shift In Machine Learning.
3. Geoff K. Nicholls. Bayes Methods.
4. Patrick J. Laub. Hawkes Processes.
5. Tomas Björk. An Introduction to Point Processes from a Martingale Point of View.