## [Feb 21] Causal Inference and Transfer Learning

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

---

### 1.  Causal Inference Basis

For randomised experiments, the basic postulates follow.

> The assignment mechanism for $n$ units is $\mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \boldsymbol{X}_{[n]}=\boldsymbol{x}_{[n]}\right)=\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{x}_{[n]}\right)$ with treatments conditional on covariates. Here follows some examples.
>
> ( **Bernoulli trial with covariate** ) $\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{x}_{[n]}\right)=\prod_{i=1}^{n} \pi\left(\boldsymbol{x}_{i}\right)^{a_{i}}\left\{1-\pi\left(\boldsymbol{x}_{i}\right)\right\}^{1-a_{i}}$
>
> ( **Sample without replacement** ) $\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{x}_{[n]}\right)=\left\{\begin{array}{ll}
\left(\begin{array}{c}
n \\
n_{1}
\end{array}\right)^{-1}, & \text { if } \sum_{i=1}^{n} a_{i}=n_{1} \\
0, & \text { otherwise. }
\end{array}\right.$
>
> The **PO** model introduces: an observed factual (outcome) is linked with **counterfactuals (potential outcomes)** via $Y_{i}=Y_{i}\left(\boldsymbol{A}_{[n]}\right) = \sum \mathbb{1}(A_n=a_n) Y_i(a_n)$, together with **assumption of no interference**,  i.e. $Y_{i}\left(\boldsymbol{a}_{[n]}\right)=Y_{i}\left(a_{i}\right) \text { for all } i \in[n] \text { and } \boldsymbol{a}_{[n]} \in \mathcal{A}^{n}$. ( note the abuse of notation and therefore, $Y_{i} \neq Y_i(A_i) $ generally )

The potential outcome framework allows as to view causal inference as a missing data
problem, which consider two populations: $\mathrm{SATE}=\frac{1}{n} \sum_{i=1}^{n} Y_{i}(1)-Y_{i}(0)$ and $\mathrm{PATE}=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\right]$. ( The latter implicitly assumes that the $n$ units are sampled from a superpopulation )


> **Assumption of randomisation.** $\boldsymbol{A}_{[n]} \perp \boldsymbol{Y}_{[n]}\left(\boldsymbol{a}_{[n]}\right) \mid \boldsymbol{X}_{[n]} \text { for } \boldsymbol{a}_{[n]} \in \mathcal{A}^{n}$.

Note that **assumption of randomisation** is different from  $\boldsymbol{A}_{[n]} \perp \boldsymbol{Y}_{[n]} \mid \boldsymbol{X}_{[n]}$, as  $Y_{i}=Y_{i}\left(A_{i}\right)$  generally depends on  $A_{i}$. 

> We are using  $\boldsymbol{X}$, $A$, $Y$, and $Y(a)$  to refer to a generic  $\boldsymbol{X}_{i}$, $A_{i}$, $Y_{i}$, and $Y_{i}(a)$ when they are iid.
>
> **Thm.** ( **Causal identification in randomised experiments** ) Consider any assignment mechanism where  $\left\{\boldsymbol{X}_{i}, A_{i}, Y_{i}(a), a \in \mathcal{A}\right\}$  are iid. Suppose the above assumptions are given, then
>
> $$\begin{aligned} \mathbb{P}(A=a \mid \boldsymbol{X}=\boldsymbol{x})>0 & \implies  (Y(a) \mid \boldsymbol{X}=\boldsymbol{x}) \stackrel{d}{=}(Y \mid A=a, \boldsymbol{X}=\boldsymbol{x}) \\
& \implies  A T E=\mathbb{E}[Y(1)-Y(0)]=\mathbb{E}\{\mathbb{E}[Y \mid A=1, \boldsymbol{X}]-\mathbb{E}[Y \mid A=0, \boldsymbol{X}]\}. \end{aligned} $$

**Proof.** For the first implication, computation shows that

$$\begin{aligned}
\mathbb{P}(Y(a) \leq y \mid \boldsymbol{X}=\boldsymbol{x}) & =\mathbb{P}\left(Y_{i}(a) \leq y \mid \boldsymbol{X}=\boldsymbol{x}, A=a\right), \\
& =\mathbb{P}(Y_i(A) \leq y \mid \boldsymbol{X}=\boldsymbol{x}, A=a), \\
& =\mathbb{P}(Y \leq y \mid \boldsymbol{X}=\boldsymbol{x}, A=a),
\end{aligned}$$

where the first equality uses assumption of randomisation.

> As a special case, if $\mathbb{P}(A=1 \mid \boldsymbol{X})$  does not depend on  $\boldsymbol{X}$  ( i.e. $A \perp \boldsymbol{X}$ ), then $PATE=\mathbb{E}[Y \mid A=1]-\mathbb{E}[Y \mid A=0]$.


---

Neyman considered the following difference-in-means estimator:

$$\hat{\beta}=\bar{Y}_{1}-\bar{Y}_{0}, \text { where } \bar{Y}_{1}=\frac{\sum_{i=1}^{n} A_{i} Y_{i}}{\sum_{i=1}^{n} A_{i}}, \bar{Y}_{0}=\frac{\sum_{i=1}^{n}\left(1-A_{i}\right) Y_{i}}{\sum_{i=1}^{n} 1-A_{i}} \text {. }$$

Denote  $\boldsymbol{Y}(a)=\left(Y_{1}(a), Y_{2}(a), \ldots, Y_{n}(a)\right)^{T}$. 

> Neyman studied the conditional distribution of  $\hat{\beta}$  given the potential outcomes  $\boldsymbol{Y}(0), \boldsymbol{Y}(1)$. We may refer to this as the **randomization distribution**, because the only randomness left in  $\hat{\beta}$  comes from the randomization of $\boldsymbol{A}_{[n]}$. 

Set $\bar{Y}(a)=\sum_{i=1}^{n} Y_{i}(a) / n$. Suppose the treatment assignments $A_i$ are sampled without replacement, by using  $\mathbb{E}\left[A_{i}\right]=n_{1} / n$, ( For simplicity of exposition, we omit the conditioning on  $\boldsymbol{Y}(a)$ )

$$\begin{aligned}
\mathbb{E}[\hat{\beta}] & =\mathbb{E}\left[\frac{1}{n_{1}} \sum_{i=1}^{n} A_{i} Y_{i}-\frac{1}{n_{0}} \sum_{i=1}^{n}\left(1-A_{i}\right) Y_{i}\right] =\mathbb{E}\left[\frac{1}{n_{1}} \sum_{i=1}^{n} A_{i} Y_{i}(1)-\frac{1}{n_{0}} \sum_{i=1}^{n}\left(1-A_{i}\right) Y_{i}(0)\right] \\
& =\frac{1}{n_{1}} \sum_{i=1}^{n} \frac{n_{1}}{n} Y_{i}(1)-\frac{1}{n_{0}} \sum_{i=1}^{n} \frac{n_{0}}{n} Y_{i}(0) =\bar{Y}(1)-\bar{Y}(0) .
\end{aligned}$$


> Suppose the treatment assignments $A_i$ are sampled without replacement, $\mathbb{E}[\hat{\beta} \mid \boldsymbol{Y}(0), \boldsymbol{Y}(1)]=S A T E$ and 
> $$\operatorname{Var}(\hat{\beta} \mid \boldsymbol{Y}(0), \boldsymbol{Y}(1))=\frac{1}{n_{0}} S_{0}^{2}+\frac{1}{n_{1}} S_{1}^{2}-\frac{S_{01}^{2}}{n},$$
>
> where  $n_{0}=n-n_{1}$, $S_{a}^{2}=\sum_{i=1}^{n}\left(Y_{i}(a)-\bar{Y}(a)\right)^{2} /(n-1)$, and  $S_{01}^{2}=\sum_{i=1}^{n}\left(Y_{i}(1)-Y_{i}(0)-S A T E\right)^{2} /(n-1)$.
>
> 


For the variance, we can show that $$\operatorname{Var}(\hat{\beta} \mid \boldsymbol{Y}(0), \boldsymbol{Y}(1))=\mathbb{E}\left[\left(\sum_{i=1}^{n} \frac{A_{i}}{n_{1}} Y_{i}^{*}(1)-\frac{1-A_{i}}{n_{0}} Y_{i}^{*}(0)\right)^{2}\right].$$

where $Y_{i}^{*}(a)=Y_{i}(a)-\bar{Y}(a)$. Expand the sum of squares and use

$$\mathbb{E}\left[A_{i} A_{i^{\prime}}\right]=\frac{n_{1}}{n} \frac{n_{1}-1}{n-1}, \mathbb{E}\left[A_{i} (1-A_{i^{\prime}})\right]=\frac{n_{1}}{n} \frac{n_{0}}{n-1}, i \neq i^{\prime} \text { and } \sum_{i=1}^{n} Y_{i}^{*}(a)=0,$$

then we arrives at the conclusion. One drawback of Neyman’s randomisation inference is that it is difficult to extend it to settings with covariates unless the covariates are discrete. The main obstacle is that the randomisation distribution necessarily depends on unobserved potential outcomes.

> It is common to estimate the variance  by  $\hat{S}_{0}^{2} / n_{0}+\hat{S}_{1}^{2} / n_{1}$, where 
>
> $$\hat{S}_{1}^{2}=\frac{1}{n_{1}-1} \sum_{i=1}^{n} A_{i}\left(Y_{i}-\bar{Y}_{1}\right)^{2}, \hat{S}_{0}^{2}=\frac{1}{n_{0}-1} \sum_{i=1}^{n}\left(1-A_{i}\right)\left(Y_{i}-\bar{Y}_{0}\right)^{2}.$$
>
> This is an unbiased estimator of $S_{0}^{2} / n_{0}+S_{1}^{2} / n_{1}$.


---

Fisher is the first to grasp fully the importance of randomisation. Consider $H_{0}: Y_{i}(1)-Y_{i}(0)=\beta, \forall i \in[n]$. Using the consistency assumption, the hypothesis allow us to impute the potential outcomes as

$$Y_{i}(a)=\left\{\begin{array}{ll}
Y_{i}, & \text { if } a=A_{i} \\
Y_{i}+\beta, & \text { if } a>A_{i} \\
Y_{i}-\beta, & \text { if } a<A_{i}
\end{array}\right.$$


A more compact form is $\boldsymbol{Y}_{[n]}\left(\boldsymbol{a}_{[n]}\right)=\boldsymbol{Y}_{[n]}+\beta\left(\boldsymbol{a}_{[n]}-\boldsymbol{A}_{[n]}\right)$. The key step is to derive the **randomisation distribution** of $T = T\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\right)$. There are two ways to do this, which is shown below. $\big($**In both cases, the randomness comes from the randomisation of $\boldsymbol{A}_{[n]}$**$\big)$

> Consider the distribution of  $T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right)$  given  $\boldsymbol{X}_{[n]}$  and  $\boldsymbol{Y}_{[n]}(0)$;
>
> Consider the distribution of  $T_{2}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\left(\boldsymbol{A}_{[n]}\right)\right)$  given  $\boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)$, and  $\boldsymbol{Y}_{[n]}(1)$.

Let  $\mathcal{F}=\left(\boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0), \boldsymbol{Y}_{[n]}(1)\right)$. The randomisation distributions in the two approaches above are given by

$$F_{1}(t)=\mathbb{P}\left(T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq t \mid \mathcal{F}\right) \quad \text{and} \quad F_{2}(t)=\mathbb{P}\left(T_{2}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\left(\boldsymbol{A}_{[n]}\right)\right) \leq t \mid \mathcal{F}\right). $$

The observed test statistics are

$$T_{1}=T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}-\beta \boldsymbol{A}_{[n]}\right), T_{2}=T_{2}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\right) .$$

The one-sided  $p$-value is the probability of observing the same or a more extreme test statistic than the observed statistic $T$, which is denoted by $P_{m}=F_{m}\left(T_{m}\right)$. An equivalent and perhaps more informative representation is

$$P_{1}=\mathbb{P}^{*}\left(T_{1}\left(\boldsymbol{A}_{[n]}^{*}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq T_{1} \mid \mathcal{F}\right),$$

where  $\boldsymbol{A}_{[n]}^{*}$  is an independent copy of  $\boldsymbol{A}$, so  $\boldsymbol{A}_{[n]}^{*} \mid \boldsymbol{X}_{[n]} \sim \pi$  but  $\boldsymbol{A}^{*} \perp \boldsymbol{A}$, and  $\mathbb{P}^{*}$ is w.r.t.  $\boldsymbol{A}^{*}$. The other  $p$-value  $P_{2}$  can be similarly defined. 

> A level- $\alpha$  randomisation test then rejects  $H_{0}$  if  $P_{m} \leq \alpha$ .

**Proof.** We know that 

>  If $F(t)$ is the distribution function of a random variable $T$, then $\mathbb{P}(F(T) \leq \alpha)=\mathbb{P}\left(T<F^{-1}(\alpha)\right)=\lim _{t \uparrow F^{-1}(\alpha)} \mathbb{P}(T \leq t) \leq \alpha$. Here $F^{-1}(\alpha)=\sup \{t \mid F(t) \leq \alpha\}$. 

This shows that under assumption of randomness and $ H_{0}$, 

> $$\mathbb{P}\left(P_{m} \leq \alpha\right) \leq   \alpha, \forall 0<\alpha<1, m=1,2.$$

which enables us to do the level-$\alpha$  randomisation test.

The randomisation assumption make the $p$-values possible to compute. To see this, by definition we have

$$\begin{aligned}
F_{1}(t) & =\mathbb{P}\left(T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq t \mid \mathcal{F}\right) \\
& =\sum_{\boldsymbol{a}_{[n]} \in \mathcal{A}^{n}} \mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \mathcal{F}\right) \cdot I\left(T\left(\boldsymbol{a}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq t\right)
\end{aligned}$$

Assumption of randomness and  $H_{0}$  allow us to replace the first term by

$$\mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \mathcal{F}\right)=\mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \boldsymbol{X}_{[n]}\right)=\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{X}_{[n]}\right).$$

---



Next we consider a different inference paradigm where the potential outcomes are drawn
from a “super-population”.

> Asymptotic super-population inference will be discusses by considering the **simple** Bernoulli trial (Example 2.1), so  $A_{i} \perp \boldsymbol{X}_{i}$. 
>
> Further, suppose  $\left(A_{i}, \boldsymbol{X}_{i}, Y_{i}(0), Y_{i}(1)\right)$  are i.i.d. and $\mathbb{E}[\boldsymbol{X}]=\mathbf{0}$.

Denote  $\pi=\mathbb{P}(A=1)$, $\boldsymbol{\Sigma}=\mathbb{E}\left[\boldsymbol{X} \boldsymbol{X}^{T}\right]$, and  $\beta=\mathbb{E}[Y \mid A=1]-\mathbb{E}[Y \mid A=0]$.

> In a randomised experiment, $\beta =$ PATE.

We shall consider three regression estimators of  $\beta$:

$$\begin{aligned}
\left(\hat{\alpha}_{1}, \hat{\beta}_{1}\right) & =\underset{\alpha, \beta}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\alpha-\beta A_{i}\right)^{2}, \\
\left(\hat{\alpha}_{2}, \hat{\beta}_{2}, \hat{\gamma}_{2}\right) & =\underset{(\alpha, \beta, \boldsymbol{\gamma})}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\alpha-\beta A_{i}-\boldsymbol{\gamma}^{T} \boldsymbol{X}_{i}\right)^{2}, \\
\left(\hat{\alpha}_{3}, \hat{\beta}_{3}, \hat{\gamma}_{3}, \hat{\boldsymbol{\delta}}_{3}\right) & =\underset{(\alpha, \beta, \boldsymbol{\gamma}, \boldsymbol{\delta})}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\alpha-\beta A_{i}-\boldsymbol{\gamma}^{T} \boldsymbol{X}_{i}-A_{i}\left(\boldsymbol{\delta}^{T} \boldsymbol{X}_{i}\right)\right)^{2} .
\end{aligned}$$

Then write down the population version of the least squares problems:

$$\begin{aligned}
\left(\alpha_{1}, \beta_{1}\right) & =\underset{\alpha, \beta}{\arg \min } \mathbb{E}\left[(Y-\alpha-\beta A)^{2}\right], \\
\left(\alpha_{2}, \beta_{2}, \gamma_{2}\right) & =\underset{(\alpha, \beta, \boldsymbol{\gamma})}{\arg \min } \mathbb{E}\left[\left(Y-\alpha-\beta A-\boldsymbol{\gamma}^{T} \boldsymbol{X}\right)^{2}\right], \\
\left(\alpha_{3}, \beta_{3}, \boldsymbol{\gamma}_{3}, \boldsymbol{\delta}_{3}\right) & =\underset{(\alpha \beta, \boldsymbol{\gamma}, \boldsymbol{\delta})}{\arg \min } \mathbb{E}\left[\left(Y-\alpha-\beta A-\boldsymbol{\gamma}^{T} \boldsymbol{X}-A \cdot\left(\boldsymbol{\delta}^{T} \boldsymbol{X}\right)\right)^{2}\right] .
\end{aligned}$$

> Lemma. Suppose  $\left(\boldsymbol{X}_{i}, A_{i}, Y_{i}\right)$  are iid,  $A \perp X$, $\mathbb{E}[\boldsymbol{X}]=0$. Then  $\alpha_{1}=\alpha_{2}=\alpha_{3}$  and  $\beta_{1}=\beta_{2}=\beta_{3}=\beta$ .

Proof. By taking partial derivatives, we obtain

$$\begin{aligned}
\mathbb{E}\left[Y-\alpha_{3}-\beta_{3} A-\gamma_{3}^{T} \boldsymbol{X}-A\left(\boldsymbol{\delta}_{3}^{T} \boldsymbol{X}\right)\right] & =0 \\
\mathbb{E}\left[A\left(Y-\alpha_{3}-\beta_{3} A-\gamma_{3}^{T} \boldsymbol{X}-A\left(\boldsymbol{\delta}_3^{T} \boldsymbol{X}\right)\right)\right] & =0 .
\end{aligned}$$


Using  $\mathbb{E}[\boldsymbol{X}]=0$  and  $A \perp \boldsymbol{X}$ , they can be simplified to

$$\begin{aligned}
\mathbb{E}\left[Y-\alpha_{3}-\beta_{3} A\right] & =0, \\
\mathbb{E}\left[A\left(Y-\alpha_{3}-\beta_{3} A\right)\right] & =0 .
\end{aligned}$$


Following the same derivation, these two equations also hold for the other estimators. By cancelling  $\alpha_{3}$  in the equations, we get  $\beta_{3}=\beta$.

Suppose $\hat{\boldsymbol{\theta}}$  is an empirical solution to the equation

$$\mathbb{E}[\boldsymbol{\psi}(\boldsymbol{\theta} ; \boldsymbol{Z}, Y)]=\mathbf{0},$$

where $\boldsymbol{\psi}(\boldsymbol{\theta} ; \boldsymbol{Z}, Y)=\boldsymbol{Z} \cdot\left(Y-\boldsymbol{Z}^{T} \boldsymbol{\theta}\right)=\boldsymbol{Z} \epsilon$. Suppose $Y$ and $Z$
have bounded fourth moments, the $Z$-estimation theory shows that

$$\begin{aligned} & \sqrt{n}(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta}) \xrightarrow{d} \mathrm{~N}\left(\mathbf{0},\left\{\mathbb{E}\left[\frac{\partial \boldsymbol{\psi}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]\right\}^{-1} \mathbb{E}\left[\boldsymbol{\psi}(\boldsymbol{\theta}) \boldsymbol{\psi}(\boldsymbol{\theta})^{T}\right]\left\{\mathbb{E}\left[\frac{\partial \boldsymbol{\psi}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]\right\}^{-1}\right) \\
\implies  & \sqrt{n}(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta}) \xrightarrow{d} \mathrm{~N}\left(0,\left\{\mathbb{E}\left[\boldsymbol{Z} \boldsymbol{Z}^{T}\right]\right\}^{-1} \mathbb{E}\left[\boldsymbol{Z} \boldsymbol{Z}^{T} \epsilon^{2}\right]\left\{\mathbb{E}\left[\boldsymbol{Z} \boldsymbol{Z}^{T}\right]\right\}^{-1}\right). \end{aligned}$$


The asymptotic normality follows from the argument below. Using Taylor’s expansion,


> (**Decomposition**) Causal estimator - True causal effect  =  Design bias + Modelling bias + Statistical noise.

Concrete, let  $\boldsymbol{O}$  be all the observed variables with distribution $\mathcal{O}$; $\boldsymbol{F}$  (full data) denote the relevant factuals and counterfactuals with  $\mathcal{F}$  be its distribution.
Then the decomposition really is

$$\beta\left(\boldsymbol{O}_{[n]} ; \hat{\theta}\right)-\beta(\mathcal{F})=\{\beta(\mathcal{O})-\beta(\mathcal{F})\}+\{\beta(\mathcal{O} ; \theta)-\beta(\mathcal{O})\}+\left\{\beta\left(\boldsymbol{O}_{[n]} ; \hat{\theta}\right)-\beta(\mathcal{O} ; \theta)\right\}$$

where  $\beta$  is a generic symbol for causal effect functional (estimator),  $\boldsymbol{O}_{[n]}$  is the observed data of size  $n$, $\theta$  is the parameter in a statistical model and  $\hat{\theta}=\hat{\theta}\left(\boldsymbol{O}_{[n]}\right)$  is an estimator of  $\theta$.

> **Example.** In regression adjustment for randomised experiments,  $\boldsymbol{O}=(\boldsymbol{X}, A, Y)$ ,  $\boldsymbol{F}=(Y(0), Y(1))$, $\beta(\mathcal{F})=\mathbb{E}[Y(1)-Y(0)]$, $\beta(\mathcal{O})=\mathbb{E}[Y \mid A=1]-\mathbb{E}[Y \mid A=0]$, $\beta(\mathcal{O}, \theta)$  be the any of  (2.14),(2.15) , or (2.16),  $\beta\left(\boldsymbol{O}_{[n]}; \hat{\theta}\right)$  be the corresponding (2.11), (2.12), or  (2.13) .

---

### 2. Graphic Models

> Given a DAG  $\mathcal{G}=(V=[p], E)$ (**DAG implies a topological ordering**), the random variables  $\boldsymbol{X}=\boldsymbol{X}_{[p]}$  satisfy a NPSEM if the observed and interventional distributions of  $\boldsymbol{X}_{[p]}$  satisfy
>
> $$X_{i}=f_{i}\left(\boldsymbol{X}_{p a_{\mathcal{G}}(i)}, \epsilon_{i}\right), i=1, \ldots, p,$$
> for some functions  $f_{1}, \ldots, f_{p}$  and random variables  $\boldsymbol{\epsilon}_{[p]}$.

Given the above NPSEM, the counterfactual variables  $X_{i}(\boldsymbol{X}_{J}=\boldsymbol{x}_{J})$  can be obtained via recursive substitution: For any  $i \in[p]$, $J \subseteq[p]$  and  $J \neq p a(i)$ , we recursively set

$$ X_{i}\left(\boldsymbol{x}_{J}\right) = X_{i}\left(\boldsymbol{X}_{J}=\boldsymbol{x}_{J}\right)=X_{i}\left(\boldsymbol{X}_{p a(i) \cap J}=\boldsymbol{x}_{p a(i) \cap J}, \boldsymbol{X}_{p a(i) \backslash J}=\boldsymbol{X}_{p a(i) \backslash J}\left(\boldsymbol{x}_{J}\right)\right).$$


> Consider any disjoint  $J, K \subseteq V$  and any  $i \in V$. If  $K$  blocks all directed paths from  $J$  to  $i$, then $\boldsymbol{X}_{i}\left(\boldsymbol{x}_{J}, \boldsymbol{x}_{K}\right)=\boldsymbol{X}_{i}\left(\boldsymbol{x}_{K}\right)$.


Proof. This follows from recursive substitution and the next observation: if  K  blocks all directed paths from  J  to  i , then  K  also blocks directed paths from  J  to  p a(i) \backslash K .

---

### 3. No Unmeasured Confounders

An observational study is an empirical investigation that utilises observation data (without
manipulation or intervention), which envolves two stages: **design** and **analysis**. 

> ( **Randomised experiment** ) Randomisation allows us to choose between statistical
error and causality. This reasoning is inductive.
>
> ( **Observational Study** ) Randomisation is replaced by pair matching. As a consequence, apart
from statistical error and causality, a third possible **explanation** is that the treated patients
and the control patients are systematically different in some other way.

> Assume  $\left(\boldsymbol{X}_{i}, A_{i}, Y_{i}(0), Y_{i}(1)\right)$,  $i=1, \ldots, n$, are i.i.d. 
> 
> **Assumption of no unmeasured confounders.** $ A \perp Y(a) \mid \boldsymbol{X}$ for $a=0,1$.

Next we consider the statistical inference after matching.

> Assume treated observation  $i$  is matched to control observation  $i+n_{1}$, $i\in [n_{1}]$ . 

Let $D_{i}=\left(A_{i}-A_{i+n_{1}}\right)\left(Y_{i}-Y_{i+n_{1}}\right)$ be the treated-minus-control difference in pair  $i$. Let $M=\left\{\boldsymbol{a}_{\left[2 n_{1}\right]} \in\{0,1\}^{2 n_{1}} \mid a_{i}+a_{i+n_{1}}=1, \forall i \in\left[n_{1}\right]\right\}$ be all the reasonable treatment assignments . Let  $\boldsymbol{C}_{i}=\left(\boldsymbol{X}_{i}, Y_{i}(0), Y_{i}(1)\right)$.

There are two ways to proceed from here. The first approach is to use the sample average of  D_{i} ,

\bar{D}=\frac{1}{n_{1}} \sum_{i=1}^{n_{1}} D_{i}


 semiparametric inference

---

### 4. Unmeasured Confounders

Let $\pi_{i}=\mathbb{P}\left(A_{i}=1 \mid \boldsymbol{C}_{i}\right), i \in\left[2 n_{1}\right]$.

---

### Reference

1. Shuxiao Chen. Minimax Rates and Adaptivity in Combining Experimental and Observational Data.
2. Qingyuan Zhao. Lecture Notes on Causal Inference. 
2. Joaquin Quiñonero-Candela. Dataset Shift In Machine Learning.
3. Geoff K. Nicholls. Bayes Methods.
4. Patrick J. Laub. Hawkes Processes.
5. Tomas Björk. An Introduction to Point Processes from a Martingale Point of View.