## [Feb 21] Causal Inference and Transfer Learning

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

---

### 1.  Causal Inference Basis

For randomised experiments, the basic postulates follow.

> The assignment mechanism for $n$ units is $\mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \boldsymbol{X}_{[n]}=\boldsymbol{x}_{[n]}\right)=\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{x}_{[n]}\right)$ with treatments conditional on covariates. Here follows some examples.
>
> ( **Bernoulli trial with covariate** ) $\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{x}_{[n]}\right)=\prod_{i=1}^{n} \pi\left(\boldsymbol{x}_{i}\right)^{a_{i}}\left\{1-\pi\left(\boldsymbol{x}_{i}\right)\right\}^{1-a_{i}}$
>
> ( **Sample without replacement** ) $\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{x}_{[n]}\right)=\left\{\begin{array}{ll}
\left(\begin{array}{c}
n \\
n_{1}
\end{array}\right)^{-1}, & \text { if } \sum_{i=1}^{n} a_{i}=n_{1} \\
0, & \text { otherwise. }
\end{array}\right.$
>
> The **PO** model introduces: an observed factual (outcome) is linked with **counterfactuals (potential outcomes)** via $Y_{i}=Y_{i}\left(\boldsymbol{A}_{[n]}\right) = \sum \mathbb{1}(A_n=a_n) Y_i(a_n)$, together with **assumption of no interference**,  i.e. $Y_{i}\left(\boldsymbol{a}_{[n]}\right)=Y_{i}\left(a_{i}\right) \text { for all } i \in[n] \text { and } \boldsymbol{a}_{[n]} \in \mathcal{A}^{n}$. ( note the abuse of notation and therefore, $Y_{i} \neq Y_i(A_i) $ generally )

The potential outcome framework allows as to view causal inference as a missing data
problem, which consider two populations: $\mathrm{SATE}=\frac{1}{n} \sum_{i=1}^{n} Y_{i}(1)-Y_{i}(0)$ and $\mathrm{PATE}=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\right]$. ( The latter implicitly assumes that the $n$ units are sampled from a superpopulation )


> **Assumption of randomisation.** $\boldsymbol{A}_{[n]} \perp \boldsymbol{Y}_{[n]}\left(\boldsymbol{a}_{[n]}\right) \mid \boldsymbol{X}_{[n]} \text { for } \boldsymbol{a}_{[n]} \in \mathcal{A}^{n}$.

Note that **assumption of randomisation** is different from  $\boldsymbol{A}_{[n]} \perp \boldsymbol{Y}_{[n]} \mid \boldsymbol{X}_{[n]}$, as  $Y_{i}=Y_{i}\left(A_{i}\right)$  generally depends on  $A_{i}$. 

> We are using  $\boldsymbol{X}$, $A$, $Y$, and $Y(a)$  to refer to a generic  $\boldsymbol{X}_{i}$, $A_{i}$, $Y_{i}$, and $Y_{i}(a)$ when they are iid.
>
> **Thm.** ( **Causal identification in randomised experiments** ) Consider any assignment mechanism where  $\left\{\boldsymbol{X}_{i}, A_{i}, Y_{i}(a), a \in \mathcal{A}\right\}$  are iid. Suppose the above assumptions are given, then
>
> $$\begin{aligned} \mathbb{P}(A=a \mid \boldsymbol{X}=\boldsymbol{x})>0 & \implies  (Y(a) \mid \boldsymbol{X}=\boldsymbol{x}) \stackrel{d}{=}(Y \mid A=a, \boldsymbol{X}=\boldsymbol{x}) \\
& \implies  A T E=\mathbb{E}[Y(1)-Y(0)]=\mathbb{E}\{\mathbb{E}[Y \mid A=1, \boldsymbol{X}]-\mathbb{E}[Y \mid A=0, \boldsymbol{X}]\}. \end{aligned} $$

**Proof.** For the first implication, computation shows that

$$\begin{aligned}
\mathbb{P}(Y(a) \leq y \mid \boldsymbol{X}=\boldsymbol{x}) & =\mathbb{P}\left(Y_{i}(a) \leq y \mid \boldsymbol{X}=\boldsymbol{x}, A=a\right), \\
& =\mathbb{P}(Y_i(A) \leq y \mid \boldsymbol{X}=\boldsymbol{x}, A=a), \\
& =\mathbb{P}(Y \leq y \mid \boldsymbol{X}=\boldsymbol{x}, A=a),
\end{aligned}$$

where the first equality uses assumption of randomisation.

> As a special case, if $\mathbb{P}(A=1 \mid \boldsymbol{X})$  does not depend on  $\boldsymbol{X}$  ( i.e. $A \perp \boldsymbol{X}$ ), then $PATE=\mathbb{E}[Y \mid A=1]-\mathbb{E}[Y \mid A=0]$.


---

Neyman considered the following difference-in-means estimator:

$$\hat{\beta}=\bar{Y}_{1}-\bar{Y}_{0}, \text { where } \bar{Y}_{1}=\frac{\sum_{i=1}^{n} A_{i} Y_{i}}{\sum_{i=1}^{n} A_{i}}, \bar{Y}_{0}=\frac{\sum_{i=1}^{n}\left(1-A_{i}\right) Y_{i}}{\sum_{i=1}^{n} 1-A_{i}} \text {. }$$

Denote  $\boldsymbol{Y}(a)=\left(Y_{1}(a), Y_{2}(a), \ldots, Y_{n}(a)\right)^{T}$. 

> Neyman studied the conditional distribution of  $\hat{\beta}$  given the potential outcomes  $\boldsymbol{Y}(0), \boldsymbol{Y}(1)$. We may refer to this as the **randomization distribution**, because the only randomness left in  $\hat{\beta}$  comes from the randomization of $\boldsymbol{A}_{[n]}$. 

Set $\bar{Y}(a)=\sum_{i=1}^{n} Y_{i}(a) / n$. Suppose the treatment assignments $A_i$ are sampled without replacement, by using  $\mathbb{E}\left[A_{i}\right]=n_{1} / n$, ( For simplicity of exposition, we omit the conditioning on  $\boldsymbol{Y}(a)$ )

$$\begin{aligned}
\mathbb{E}[\hat{\beta}] & =\mathbb{E}\left[\frac{1}{n_{1}} \sum_{i=1}^{n} A_{i} Y_{i}-\frac{1}{n_{0}} \sum_{i=1}^{n}\left(1-A_{i}\right) Y_{i}\right] =\mathbb{E}\left[\frac{1}{n_{1}} \sum_{i=1}^{n} A_{i} Y_{i}(1)-\frac{1}{n_{0}} \sum_{i=1}^{n}\left(1-A_{i}\right) Y_{i}(0)\right] \\
& =\frac{1}{n_{1}} \sum_{i=1}^{n} \frac{n_{1}}{n} Y_{i}(1)-\frac{1}{n_{0}} \sum_{i=1}^{n} \frac{n_{0}}{n} Y_{i}(0) =\bar{Y}(1)-\bar{Y}(0) .
\end{aligned}$$


> Suppose the treatment assignments $A_i$ are sampled without replacement, $\mathbb{E}[\hat{\beta} \mid \boldsymbol{Y}(0), \boldsymbol{Y}(1)]=S A T E$ and 
> $$\operatorname{Var}(\hat{\beta} \mid \boldsymbol{Y}(0), \boldsymbol{Y}(1))=\frac{1}{n_{0}} S_{0}^{2}+\frac{1}{n_{1}} S_{1}^{2}-\frac{S_{01}^{2}}{n},$$
>
> where  $n_{0}=n-n_{1}$, $S_{a}^{2}=\sum_{i=1}^{n}\left(Y_{i}(a)-\bar{Y}(a)\right)^{2} /(n-1)$, and  $S_{01}^{2}=\sum_{i=1}^{n}\left(Y_{i}(1)-Y_{i}(0)-S A T E\right)^{2} /(n-1)$.
>
> 


For the variance, we can show that $$\operatorname{Var}(\hat{\beta} \mid \boldsymbol{Y}(0), \boldsymbol{Y}(1))=\mathbb{E}\left[\left(\sum_{i=1}^{n} \frac{A_{i}}{n_{1}} Y_{i}^{*}(1)-\frac{1-A_{i}}{n_{0}} Y_{i}^{*}(0)\right)^{2}\right].$$

where $Y_{i}^{*}(a)=Y_{i}(a)-\bar{Y}(a)$. Expand the sum of squares and use

$$\mathbb{E}\left[A_{i} A_{i^{\prime}}\right]=\frac{n_{1}}{n} \frac{n_{1}-1}{n-1}, \mathbb{E}\left[A_{i} (1-A_{i^{\prime}})\right]=\frac{n_{1}}{n} \frac{n_{0}}{n-1}, i \neq i^{\prime} \text { and } \sum_{i=1}^{n} Y_{i}^{*}(a)=0,$$

then we arrives at the conclusion. One drawback of Neyman’s randomisation inference is that it is difficult to extend it to settings with covariates unless the covariates are discrete. The main obstacle is that the randomisation distribution necessarily depends on unobserved potential outcomes.

> It is common to estimate the variance  by  $\hat{S}_{0}^{2} / n_{0}+\hat{S}_{1}^{2} / n_{1}$, where 
>
> $$\hat{S}_{1}^{2}=\frac{1}{n_{1}-1} \sum_{i=1}^{n} A_{i}\left(Y_{i}-\bar{Y}_{1}\right)^{2}, \hat{S}_{0}^{2}=\frac{1}{n_{0}-1} \sum_{i=1}^{n}\left(1-A_{i}\right)\left(Y_{i}-\bar{Y}_{0}\right)^{2}.$$
>
> This is an unbiased estimator of $S_{0}^{2} / n_{0}+S_{1}^{2} / n_{1}$.


---

Fisher is the first to grasp fully the importance of randomisation. Consider $H_{0}: Y_{i}(1)-Y_{i}(0)=\beta, \forall i \in[n]$. Using the consistency assumption, the hypothesis allow us to impute the potential outcomes as

$$Y_{i}(a)=\left\{\begin{array}{ll}
Y_{i}, & \text { if } a=A_{i} \\
Y_{i}+\beta, & \text { if } a>A_{i} \\
Y_{i}-\beta, & \text { if } a<A_{i}
\end{array}\right.$$


A more compact form is $\boldsymbol{Y}_{[n]}\left(\boldsymbol{a}_{[n]}\right)=\boldsymbol{Y}_{[n]}+\beta\left(\boldsymbol{a}_{[n]}-\boldsymbol{A}_{[n]}\right)$. The key step is to derive the **randomisation distribution** of $T = T\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\right)$. There are two ways to do this, which is shown below. $\big($**In both cases, the randomness comes from the randomisation of $\boldsymbol{A}_{[n]}$**$\big)$

> Consider the distribution of  $T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right)$  given  $\boldsymbol{X}_{[n]}$  and  $\boldsymbol{Y}_{[n]}(0)$;
>
> Consider the distribution of  $T_{2}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\left(\boldsymbol{A}_{[n]}\right)\right)$  given  $\boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)$, and  $\boldsymbol{Y}_{[n]}(1)$.

Let  $\mathcal{F}=\left(\boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0), \boldsymbol{Y}_{[n]}(1)\right)$. The randomisation distributions in the two approaches above are given by

$$F_{1}(t)=\mathbb{P}\left(T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq t \mid \mathcal{F}\right) \quad \text{and} \quad F_{2}(t)=\mathbb{P}\left(T_{2}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\left(\boldsymbol{A}_{[n]}\right)\right) \leq t \mid \mathcal{F}\right). $$

The observed test statistics are

$$T_{1}=T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}-\beta \boldsymbol{A}_{[n]}\right), T_{2}=T_{2}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}\right) .$$

The one-sided  $p$-value is the probability of observing the same or a more extreme test statistic than the observed statistic $T$, which is denoted by $P_{m}=F_{m}\left(T_{m}\right)$. An equivalent and perhaps more informative representation is

$$P_{1}=\mathbb{P}^{*}\left(T_{1}\left(\boldsymbol{A}_{[n]}^{*}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq T_{1} \mid \mathcal{F}\right),$$

where  $\boldsymbol{A}_{[n]}^{*}$  is an independent copy of  $\boldsymbol{A}$, so  $\boldsymbol{A}_{[n]}^{*} \mid \boldsymbol{X}_{[n]} \sim \pi$  but  $\boldsymbol{A}^{*} \perp \boldsymbol{A}$, and  $\mathbb{P}^{*}$ is w.r.t.  $\boldsymbol{A}^{*}$. The other  $p$-value  $P_{2}$  can be similarly defined. 

> A level- $\alpha$  randomisation test then rejects  $H_{0}$  if  $P_{m} \leq \alpha$ .

**Proof.** We know that 

>  If $F(t)$ is the distribution function of a random variable $T$, then $\mathbb{P}(F(T) \leq \alpha)=\mathbb{P}\left(T<F^{-1}(\alpha)\right)=\lim _{t \uparrow F^{-1}(\alpha)} \mathbb{P}(T \leq t) \leq \alpha$. Here $F^{-1}(\alpha)=\sup \{t \mid F(t) \leq \alpha\}$. 

This shows that under assumption of randomness and $ H_{0}$, 

> $$\mathbb{P}\left(P_{m} \leq \alpha\right) \leq   \alpha, \forall 0<\alpha<1, m=1,2.$$

which enables us to do the level-$\alpha$  randomisation test.

The randomisation assumption make the $p$-values possible to compute. To see this, by definition we have

$$\begin{aligned}
F_{1}(t) & =\mathbb{P}\left(T_{1}\left(\boldsymbol{A}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq t \mid \mathcal{F}\right) \\
& =\sum_{\boldsymbol{a}_{[n]} \in \mathcal{A}^{n}} \mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \mathcal{F}\right) \cdot I\left(T\left(\boldsymbol{a}_{[n]}, \boldsymbol{X}_{[n]}, \boldsymbol{Y}_{[n]}(0)\right) \leq t\right)
\end{aligned}$$

Assumption of randomness and  $H_{0}$  allow us to replace the first term by

$$\mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \mathcal{F}\right)=\mathbb{P}\left(\boldsymbol{A}_{[n]}=\boldsymbol{a}_{[n]} \mid \boldsymbol{X}_{[n]}\right)=\pi\left(\boldsymbol{a}_{[n]} \mid \boldsymbol{X}_{[n]}\right).$$

---



Next we consider a different inference paradigm where the potential outcomes are drawn
from a “super-population”.

> Asymptotic super-population inference will be discusses by considering the **simple** Bernoulli trial (Example 2.1), so  $A_{i} \perp \boldsymbol{X}_{i}$. 
>
> Further, suppose  $\left(A_{i}, \boldsymbol{X}_{i}, Y_{i}(0), Y_{i}(1)\right)$  are i.i.d. and $\mathbb{E}[\boldsymbol{X}]=\mathbf{0}$.

Denote  $\pi=\mathbb{P}(A=1)$, $\boldsymbol{\Sigma}=\mathbb{E}\left[\boldsymbol{X} \boldsymbol{X}^{T}\right]$, and  $\beta=\mathbb{E}[Y \mid A=1]-\mathbb{E}[Y \mid A=0]$.

> In a randomised experiment, $\beta =$ PATE.

We shall consider three regression estimators of  $\beta$:

$$\begin{aligned}
\left(\hat{\alpha}_{1}, \hat{\beta}_{1}\right) & =\underset{\alpha, \beta}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\alpha-\beta A_{i}\right)^{2}, \\
\left(\hat{\alpha}_{2}, \hat{\beta}_{2}, \hat{\gamma}_{2}\right) & =\underset{(\alpha, \beta, \boldsymbol{\gamma})}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\alpha-\beta A_{i}-\boldsymbol{\gamma}^{T} \boldsymbol{X}_{i}\right)^{2}, \\
\left(\hat{\alpha}_{3}, \hat{\beta}_{3}, \hat{\gamma}_{3}, \hat{\boldsymbol{\delta}}_{3}\right) & =\underset{(\alpha, \beta, \boldsymbol{\gamma}, \boldsymbol{\delta})}{\arg \min } \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\alpha-\beta A_{i}-\boldsymbol{\gamma}^{T} \boldsymbol{X}_{i}-A_{i}\left(\boldsymbol{\delta}^{T} \boldsymbol{X}_{i}\right)\right)^{2} .
\end{aligned}$$

Then write down the population version of the least squares problems:

$$\begin{aligned}
\left(\alpha_{1}, \beta_{1}\right) & =\underset{\alpha, \beta}{\arg \min } \mathbb{E}\left[(Y-\alpha-\beta A)^{2}\right], \\
\left(\alpha_{2}, \beta_{2}, \gamma_{2}\right) & =\underset{(\alpha, \beta, \boldsymbol{\gamma})}{\arg \min } \mathbb{E}\left[\left(Y-\alpha-\beta A-\boldsymbol{\gamma}^{T} \boldsymbol{X}\right)^{2}\right], \\
\left(\alpha_{3}, \beta_{3}, \boldsymbol{\gamma}_{3}, \boldsymbol{\delta}_{3}\right) & =\underset{(\alpha \beta, \boldsymbol{\gamma}, \boldsymbol{\delta})}{\arg \min } \mathbb{E}\left[\left(Y-\alpha-\beta A-\boldsymbol{\gamma}^{T} \boldsymbol{X}-A \cdot\left(\boldsymbol{\delta}^{T} \boldsymbol{X}\right)\right)^{2}\right] .
\end{aligned}$$

> Lemma. Suppose  $\left(\boldsymbol{X}_{i}, A_{i}, Y_{i}\right)$  are iid,  $A \perp X$, $\mathbb{E}[\boldsymbol{X}]=0$. Then  $\alpha_{1}=\alpha_{2}=\alpha_{3}$  and  $\beta_{1}=\beta_{2}=\beta_{3}=\beta$ .

Proof. By taking partial derivatives, we obtain

$$\begin{aligned}
\mathbb{E}\left[Y-\alpha_{3}-\beta_{3} A-\gamma_{3}^{T} \boldsymbol{X}-A\left(\boldsymbol{\delta}_{3}^{T} \boldsymbol{X}\right)\right] & =0 \\
\mathbb{E}\left[A\left(Y-\alpha_{3}-\beta_{3} A-\gamma_{3}^{T} \boldsymbol{X}-A\left(\boldsymbol{\delta}_3^{T} \boldsymbol{X}\right)\right)\right] & =0 .
\end{aligned}$$


Using  $\mathbb{E}[\boldsymbol{X}]=0$  and  $A \perp \boldsymbol{X}$ , they can be simplified to

$$\begin{aligned}
\mathbb{E}\left[Y-\alpha_{3}-\beta_{3} A\right] & =0, \\
\mathbb{E}\left[A\left(Y-\alpha_{3}-\beta_{3} A\right)\right] & =0 .
\end{aligned}$$


Following the same derivation, these two equations also hold for the other estimators. By cancelling  $\alpha_{3}$  in the equations, we get  $\beta_{3}=\beta$.

Suppose $\hat{\boldsymbol{\theta}}$  is an empirical solution to the equation

$$\mathbb{E}[\boldsymbol{\psi}(\boldsymbol{\theta} ; \boldsymbol{Z}, Y)]=\mathbf{0},$$

where $\boldsymbol{\psi}(\boldsymbol{\theta} ; \boldsymbol{Z}, Y)=\boldsymbol{Z} \cdot\left(Y-\boldsymbol{Z}^{T} \boldsymbol{\theta}\right)=\boldsymbol{Z} \epsilon$. Suppose $Y$ and $Z$
have bounded fourth moments, the $Z$-estimation theory shows that

$$\begin{aligned} & \sqrt{n}(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta}) \xrightarrow{d} \mathrm{~N}\left(\mathbf{0},\left\{\mathbb{E}\left[\frac{\partial \boldsymbol{\psi}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]\right\}^{-1} \mathbb{E}\left[\boldsymbol{\psi}(\boldsymbol{\theta}) \boldsymbol{\psi}(\boldsymbol{\theta})^{T}\right]\left\{\mathbb{E}\left[\frac{\partial \boldsymbol{\psi}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]\right\}^{-1}\right) \\
\implies  & \sqrt{n}(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta}) \xrightarrow{d} \mathrm{~N}\left(0,\left\{\mathbb{E}\left[\boldsymbol{Z} \boldsymbol{Z}^{T}\right]\right\}^{-1} \mathbb{E}\left[\boldsymbol{Z} \boldsymbol{Z}^{T} \epsilon^{2}\right]\left\{\mathbb{E}\left[\boldsymbol{Z} \boldsymbol{Z}^{T}\right]\right\}^{-1}\right). \end{aligned}$$


The asymptotic normality follows from the argument below. Using Taylor’s expansion,
$$ \begin{aligned}
0 & =\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{\psi}\left(\hat{\boldsymbol{\theta}} ; \boldsymbol{Z}_{i}, Y_{i}\right) \\
& =\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{\psi}\left(\boldsymbol{\theta} ; \boldsymbol{Z}_{i}, Y_{i}\right)+(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta})^{T}\left[\frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\psi}\left(\boldsymbol{\theta} ; \boldsymbol{Z}_{i}, Y_{i}\right)\right]+R_{n}
\end{aligned} $$

By using $\hat{\boldsymbol{\theta}} \xrightarrow{p} \boldsymbol{\theta}$, it can be shown that $R_{n}$  is asymptotically smaller than the other two terms and can be ignored. Thus

$$\sqrt{n}(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta}) \approx \left[\frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\psi}(\boldsymbol{\theta} ; \boldsymbol{Z}, Y)\right]^{-1}\left[\frac{1}{\sqrt{n}} \sum_{i=1}^{n} \boldsymbol{\psi}\left(\boldsymbol{\theta} ; \boldsymbol{Z}_{i}, Y_{i}\right)\right].$$

The first term on the right hand side converges in probability to  $\mathbb{E}[\partial \boldsymbol{\psi}(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}]^{-1}$. The second term converges in distribution to a normal random variable with variance  $\mathbb{E}\left[\boldsymbol{\psi}(\boldsymbol{\theta}) \boldsymbol{\psi}(\boldsymbol{\theta})^{T}\right]$. Using Slutsky's theorem, we arrive at the conclusion.

> Let  $\epsilon_{i 1}, \epsilon_{i 2}, \epsilon_{i 3}$  be the error terms in the three regression estimators:
>
> $$\epsilon_{i m}=Y_{i}-\alpha_{m}-\beta_{m} A_{i}-\boldsymbol{\gamma}_{m}^{T} \boldsymbol{X}_{i}-A_{i}\left(\boldsymbol{\delta}_{m}^{T} \boldsymbol{X}_{i}\right), m=1,2,3 .$$
>
> Conventionally  $\gamma_{1}=0$  and  $\boldsymbol{\delta}_{1}=\boldsymbol{\delta}_{2}=\mathbf{0}$. Then 
>
> $$\sqrt{n}\left(\hat{\beta}_{m}-\beta\right) \xrightarrow{d} \mathrm{~N}\left(0, V_{m}\right).$$



> (**Decomposition**) Causal estimator - True causal effect  =  Design bias + Modelling bias + Statistical noise.

Concrete, let  $\boldsymbol{O}$  be all the observed variables with distribution $\mathcal{O}$; $\boldsymbol{F}$  (full data) denote the relevant factuals and counterfactuals with  $\mathcal{F}$  be its distribution.
Then the decomposition really is

$$\beta\left(\boldsymbol{O}_{[n]} ; \hat{\theta}\right)-\beta(\mathcal{F})=\{\beta(\mathcal{O})-\beta(\mathcal{F})\}+\{\beta(\mathcal{O} ; \theta)-\beta(\mathcal{O})\}+\left\{\beta\left(\boldsymbol{O}_{[n]} ; \hat{\theta}\right)-\beta(\mathcal{O} ; \theta)\right\}$$

where  $\beta$  is a generic symbol for causal effect functional (estimator),  $\boldsymbol{O}_{[n]}$  is the observed data of size  $n$, $\theta$  is the parameter in a statistical model and  $\hat{\theta}=\hat{\theta}\left(\boldsymbol{O}_{[n]}\right)$  is an estimator of  $\theta$.

> **Example.** In regression adjustment for randomised experiments,  $\boldsymbol{O}=(\boldsymbol{X}, A, Y)$ ,  $\boldsymbol{F}=(Y(0), Y(1))$, $\beta(\mathcal{F})=\mathbb{E}[Y(1)-Y(0)]$, $\beta(\mathcal{O})=\mathbb{E}[Y \mid A=1]-\mathbb{E}[Y \mid A=0]$, $\beta(\mathcal{O}, \theta)$  be the any of (population version) regression estimators,  $\beta\left(\boldsymbol{O}_{[n]}; \hat{\theta}\right)$  be the corresponding regression estimators.

---

### 2. Graphic Models

> Given a DAG  $\mathcal{G}=(V=[p], E)$ (**DAG implies a topological ordering**), the random variables  $\boldsymbol{X}=\boldsymbol{X}_{[p]}$  satisfy a NPSEM if the observed and interventional distributions of  $\boldsymbol{X}_{[p]}$  satisfy
>
> $$X_{i}=f_{i}\left(\boldsymbol{X}_{p a_{\mathcal{G}}(i)}, \epsilon_{i}\right), i=1, \ldots, p,$$
> for some functions  $f_{1}, \ldots, f_{p}$  and random variables  $\boldsymbol{\epsilon}_{[p]}$.

Given the above NPSEM, the counterfactual variables  $X_{i}(\boldsymbol{X}_{J}=\boldsymbol{x}_{J})$  can be obtained via recursive substitution: For any  $i \in[p]$, $J \subseteq[p]$  and  $J \neq p a(i)$ , we recursively set

$$ X_{i}\left(\boldsymbol{x}_{J}\right) = X_{i}\left(\boldsymbol{X}_{J}=\boldsymbol{x}_{J}\right)=X_{i}\left(\boldsymbol{X}_{p a(i) \cap J}=\boldsymbol{x}_{p a(i) \cap J}, \boldsymbol{X}_{p a(i) \backslash J}=\boldsymbol{X}_{p a(i) \backslash J}\left(\boldsymbol{x}_{J}\right)\right).$$


Recall that

> Given a DAG  $\mathcal{G}$, a path is blocked by  $K \subseteq V$  if $\exists k$  on the path such that either $k$  is not a collider on this path and  $k \in K$; or $k$  is a collider on this path and  $k$  and all its descendants are not in  $K$.
>
>  For $\mathbb{P}$ on $\mathcal{G}$,  $f(\boldsymbol{x})=\prod_{i \in V} f_{i \mid p a(i)}\left(x_{i} \mid \boldsymbol{x}_{p a(i)}\right)$ iff $I \perp J\left|K[\mathcal{G}] \Longrightarrow \boldsymbol{X}_{I} \perp \boldsymbol{X}_{J}\right| \boldsymbol{X}_{K}$ for disjoint $I,J,K$.

We provide further that

> Given any disjoint  $J, K \subseteq V$  and any  $i \in V$. 
>
> **Prop I.** If  $K$  blocks all directed paths from  $J$  to  $i$, then $\boldsymbol{X}_{i}\left(\boldsymbol{x}_{J}, \boldsymbol{x}_{K}\right)=\boldsymbol{X}_{i}\left(\boldsymbol{x}_{K}\right)$. 
>
> **Cor I.** $X_{i}\left(\boldsymbol{x}_{J}\right)=X_{i}\left(\boldsymbol{x}_{J \cap a n(i)} \right)$.
> 
> **Prop II.** $\boldsymbol{X}_{J}\left(\boldsymbol{x}_{K}\right)=\boldsymbol{x}_{J} \Longrightarrow X_{i}\left(\boldsymbol{x}_{J}, \boldsymbol{x}_{K}\right)=X_{i}\left(\boldsymbol{x}_{K}\right)$.

**Proof.** The first follows from the observation: if  $K$  blocks all directed paths from  $J$  to  $i$, then  $K$  also blocks directed paths from  $J$  to  $p a(i) \backslash K$. The second follows from the following equations and induction

$$\begin{aligned}
X_{i}\left(\boldsymbol{x}_{J}, \boldsymbol{x}_{K}\right)=X_{i}\left(\boldsymbol{x}_{p a(i) \cap J}, \boldsymbol{x}_{p a(i) \cap K}, \boldsymbol{X}_{p a(i) \backslash J \backslash K}\left(\boldsymbol{x}_{J}, \boldsymbol{x}_{K}\right)\right), \\
X_{i}\left(\boldsymbol{x}_{K}\right)=X_{i}\left(\boldsymbol{X}_{p a(i) \cap J}\left(\boldsymbol{x}_{K}\right), \boldsymbol{x}_{p a(i) \cap K}, \boldsymbol{X}_{p a(i) \backslash J \backslash K}\left(\boldsymbol{x}_{K}\right)\right) .
\end{aligned}$$

> A NPSEM satisfies the single(multiple)-world independence assumptions, if $X_{i}\left(\boldsymbol{x}_{p a(i)}\right)$ $\big(\epsilon_{i}\big)$ are mutually independent.
>
> A **causal model** is a NPSEM together with the single-world independence assumption.

Note that in addition to the single-world independence assumptions, the multiple-world independence assumptions also make the following cross-world independence assumption:
$$X_{2}\left(x_{1}\right) \perp X_{3}\left(\tilde{x}_{1}, x_{2}\right) \text { for any } x_{1} \neq \tilde{x}_{1}, x_{2} .$$

> The single-world intervention graph (SWIG) $\mathcal{G}\left[\boldsymbol{x}_{J}\right]$ for the intervention  $\boldsymbol{X}_{J}=\boldsymbol{x}_{J}$  is constructed from  $\mathcal{G}$  via the following two steps:
>
> (i) Node splitting: For every  $j \in J$, split the vertex  $X_{j}$  into a random and a fixed component, labelled $X_{j}$ and $x_{j}$  respectively. The random half inherited all edges into  $X_{j} $ and the fixed half inherited all edges out of  $X_{j}$.
>
> (ii) Labelling: For every random node  $X_{i}$  in the new graph, label it with  $X_{i}\left(\boldsymbol{x}_{J}\right)=   X_{i}\left(\boldsymbol{x}_{J \cap a n(i)}\right)$.

> ( **Factorisation of counterfactual distributions** ). Suppose  $\boldsymbol{X}$  satisfies the causal model defined by a DAG  $\mathcal{G}$, then  $\boldsymbol{X}\left(\boldsymbol{x}_{J}\right)$  factorises according to  the random part $\mathcal{G}^{*}\left[\boldsymbol{X}\left(\boldsymbol{x}_{J}\right)\right]$ (by removing $x_J$ from $\mathcal{G}\left[\boldsymbol{X}\left(\boldsymbol{x}_{J}\right)\right]$), $\forall  J \subseteq[p]$.

**Proof.** It's clear that 

> For any  $k \notin J$ s.t. $\operatorname{de}(k) \subseteq J$,
>
> $$\mathbb{P}\left(X_{i}\left(\boldsymbol{x}_{J}, \tilde{x}_{k}\right)=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \backslash J \backslash\{k\}}\left(\boldsymbol{x}_{J}, \tilde{x}_{k}\right)=\tilde{\boldsymbol{x}}_{p a(i) \backslash J \backslash\{k\}}\right) = \mathbb{P}\left(X_{i}\left(\boldsymbol{x}_{J}\right)=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \backslash J}\left(\boldsymbol{x}_{J}\right)=\tilde{\boldsymbol{x}}_{p a(i) \backslash J}\right) .$$

Then we simply prove by reverse induction using $J \cup\{k\} \subseteq[p] \text { to } J \text { where } k \notin J \text { and } d e(k) \subseteq J$.

Suppose $\boldsymbol{X}$ satisfies the causal model, then

$$ \mathbb{P}\left(X_{i}\left(\boldsymbol{x}_{J}\right)=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \backslash J}\left(\boldsymbol{x}_{J}\right)=\tilde{\boldsymbol{x}}_{p a(i) \backslash J}\right) = \mathbb{P}\left(X_{i}=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \backslash J}=\tilde{\boldsymbol{x}}_{p a(i) \backslash J}, \boldsymbol{X}_{p a(i) \cap J}=\boldsymbol{x}_{p a(i) \cap J}\right).$$

This shows that 

> $\mathbb{P}\left(\boldsymbol{X}\left(\boldsymbol{x}_{J}\right)=\tilde{\boldsymbol{x}}\right)=\prod_{i=1}^{p} \mathbb{P}\left(X_{i}=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \cap J}=\boldsymbol{x}_{p a(i) \cap J}, \boldsymbol{X}_{p a(i) \backslash J}=\tilde{\boldsymbol{x}}_{p a(i) \backslash J}\right)$.
>
> $\mathbb{P}\left(\boldsymbol{X}_{I}\left(\boldsymbol{x}_{J}\right)=\tilde{\boldsymbol{x}}_{I}\right)=\sum_{\tilde{\boldsymbol{x}}_{K}} \prod_{i \in I \cup K} \mathbb{P}\left(X_{i}=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \cap J}=\boldsymbol{x}_{p a(i) \cap J}, \boldsymbol{X}_{p a(i) \backslash J}=\tilde{\boldsymbol{x}}_{p a(i) \backslash J}\right)$ where $I,J,K$ form a partition.

**Proof.** The second is a corollary of the first equation by seeing that 

$$\mathbb{P}\left(\boldsymbol{X}_{I}\left(\boldsymbol{x}_{J}\right)=\tilde{\boldsymbol{x}}_{I}\right)=\sum_{\tilde{\boldsymbol{x}}_{K}, \tilde{\boldsymbol{x}}_{J}} \prod_{i=1}^{p} \mathbb{P}\left(X_{i}=\tilde{x}_{i} \mid \boldsymbol{X}_{p a(i) \cap J}=\boldsymbol{x}_{p a(i) \cap J}, \boldsymbol{X}_{p a(i) \backslash J}=\tilde{\boldsymbol{x}}_{p a(i) \backslash J}\right) .$$

---

In the graphical framework, we can check  $Y(a) \perp A \mid X$  by d-separation in the SWIG. Because there is no out-going arrow from  $A$  in  $\mathcal{G}^{*}[a]$, this essentially says that every back-door path from  $A$  to  $Y$  (i.e. the path has an edge going into  $A$) must be blocked by  $X$.

> ( **Back-door adjustment** ). Suppose $(\boldsymbol{X}, A, Y)$  in a causal model  $\mathcal{G}$  that may contain other unobserved variables. Suppose  $\boldsymbol{X}$  contains no descendant of  $A$  and blocks every back-door path from  $A$  to  $Y$  in  $\mathcal{G}$. Then  $Y(a) \perp A \mid \boldsymbol{X}$   and
>
> $$\mathbb{P}(Y(a) \leq y)=\sum_{\boldsymbol{x}} \mathbb{P}(\boldsymbol{X}=\boldsymbol{x}) \cdot \mathbb{P}(Y \leq y \mid A=a, \boldsymbol{X}=\boldsymbol{x}), \forall a, \boldsymbol{x}, y.$$


---

### 3. No Unmeasured Confounders

An observational study is an empirical investigation that utilises observation data (without
manipulation or intervention), which envolves two stages: **design** and **analysis**. 

> ( **Randomised Experiment** ) Randomisation allows us to choose between statistical
error and causality. This reasoning is inductive.
>
> ( **Observational Study** ) Randomisation is replaced by pair matching. As a consequence, apart
from statistical error and causality, a third possible **explanation** is that the treated patients
and the control patients are systematically different in some other way.
>
> Assume  $\left(\boldsymbol{X}_{i}, A_{i}, Y_{i}(0), Y_{i}(1)\right)$,  $i=1, \ldots, n$, are i.i.d. 
> 
> **Assumption of no unmeasured confounders.** $ A \perp Y(a) \mid \boldsymbol{X}$.

For matching usually we apply **propensity score matching**. We say $b(\boldsymbol{X})$  is a **balancing score** if  $A \perp \boldsymbol{X} \mid b(\boldsymbol{X})$, which follows that

$$ A \perp Y(a) \mid b(\boldsymbol{X}).$$


Among all the balancing scores, of particular interest is the propensity score $\pi(\boldsymbol{x})=\mathbb{P}(A=1 \mid \boldsymbol{X}=\boldsymbol{x})$, which can be written as  a function of any balancing score $b(\boldsymbol{X})$. And a popular distance measure is the squared distance between the estimated propensity scores in the logit scale:
$$d_{\mathrm{PS}}\left(\boldsymbol{X}_{i}, \boldsymbol{X}_{j}\right)=\left[\log \left(\frac{\hat{\pi}\left(\boldsymbol{X}_{i}\right)}{1-\hat{\pi}\left(\boldsymbol{X}_{i}\right)}\right)-\log \left(\frac{\hat{\pi}\left(\boldsymbol{X}_{j}\right)}{1-\hat{\pi}\left(\boldsymbol{X}_{j}\right)}\right)\right].$$

Next we consider the statistical inference after matching.

> Assume treated observation  $i$  is matched to control observation  $i+n_{1}$, $i\in [n_{1}]$ . 

Let $D_{i}=\left(A_{i}-A_{i+n_{1}}\right)\left(Y_{i}-Y_{i+n_{1}}\right)$ be the treated-minus-control difference in pair  $i$. Let 

$$M=\left\{\boldsymbol{a}_{\left[2 n_{1}\right]} \in\{0,1\}^{2 n_{1}} \mid a_{i}+a_{i+n_{1}}=1, \forall i \in\left[n_{1}\right]\right\}$$

be all the reasonable treatment assignments. Let  $\boldsymbol{C}_{i}=\left(\boldsymbol{X}_{i}, Y_{i}(0), Y_{i}(1)\right)$.



> $$\pi\left(\boldsymbol{X}_{i}\right)=\pi\left(\boldsymbol{X}_{i+n_{1}}\right) \implies \mathbb{P}\left(\boldsymbol{A}_{\left[2 n_{1}\right]}=\boldsymbol{a} \mid \boldsymbol{C}_{\left[2 n_{1}\right]}, \boldsymbol{A}_{\left[2 n_{1}\right]} \in M\right)=\left\{\begin{array}{ll}
2^{-n_{1}}, & \text { if } \boldsymbol{a} \in M \\
0, & \text { otherwise }
\end{array}\right.$$

This means that $\pi\left(\boldsymbol{X}_{i}\right)=\pi\left(\boldsymbol{X}_{i+n_{1}}\right)$ implies the **assumption that matching reconstructs a pairwise randomised experiment**.

Consider $H_{\beta}: Y_{i}(1)-Y_{i}(0)=\beta, \forall i$. Under  $H_{0}$, the counterfactual values of  $\boldsymbol{D}_{\left[n_{1}\right]}$  can be imputed as

$$D_{i}\left(\boldsymbol{a}_{\left[2 n_{1}\right]}\right)=\left(a_{i}-a_{i+n_{1}}\right) \cdot\left(Y_{i}\left(a_{i}\right)-Y_{i+n_{1}}\left(a_{i+n_{1}}\right)\right)=\left\{\begin{array}{ll}
D_{i}, & \text { if } a_{i}=1, a_{i+n_{1}}=0 \\
2 \beta-D_{i}, & \text { if } a_{i}=0, a_{i+n_{1}}=1
\end{array}\right.$$

Consider any test statistic  $T=T\left(\boldsymbol{D}_{\left[n_{1}\right]}\right)$. Next we construct a randomisation test based on the randomisation distribution of  $T\left(\boldsymbol{D}_{\left[n_{1}\right]}\left(\boldsymbol{A}_{\left[2 n_{1}\right]}\right)\right)$. Let  $F(t)$  denote its cumulative distribution function given  $\boldsymbol{C}_{\left[2 n_{1}\right]}$  and  $\boldsymbol{A}_{\left[2 n_{1}\right]} \in M$  under  $H_{0}$,

$$\begin{aligned}
F\left(t ; \boldsymbol{D}_{\left[n_{1}\right]}, \beta\right) & =\mathbb{P}\left(T \leq t \mid \boldsymbol{C}_{\left[n_{1}\right]}, \boldsymbol{A}_{\left[2 n_{1}\right]} \in M, H_{\beta}\right), \\
& =\sum_{\boldsymbol{a}_{\left[2 n_{1}\right]} \in M}\left(\frac{1}{2}\right)^{n_{1}} \cdot I\left(T\left(\boldsymbol{D}_{\left[n_{1}\right]}\left(\boldsymbol{a}_{\left[2 n_{1}\right]}\right)\right) \leq t\right) .
\end{aligned}$$

> Under the assumption that matching reconstructs a pairwise randomised experiment,  $\mathbb{P}\left(P_{2} \leq \alpha\right) \leq \alpha$  under  $H_{0}$.

---

Next we apply some semiparametric inference. 

> **Assumption of Positivity.** $\pi_{a}(\boldsymbol{x})=\mathbb{P}(A=a \mid \boldsymbol{X}=\boldsymbol{x})>0$, $\forall a, \boldsymbol{x}$.
>
> This is also called the overlap assumption, because by the Bayes rule, it is equivalent to assuming that the distribution  $\boldsymbol{X}$  has the same support given  $A=a$  for all  $a$.

We have $\mathrm{ATE}=\mathbb{E}[Y(1)-Y(0)]=\mathbb{E}\{\mathbb{E}[Y \mid A=1, \boldsymbol{X}]-\mathbb{E}[Y \mid A=0, \boldsymbol{X}]\}$. Then it suffices to estimate 

$$\beta_{a}=\mathbb{E}\{\mathbb{E}[Y \mid A=a, \boldsymbol{X}]\}=\sum_{\boldsymbol{x}} \mu_{a}(\boldsymbol{x}) \mathbb{P}(\boldsymbol{X}=\boldsymbol{x}).$$

Given an iid sample from the population, we can empirically estimate by

$$\begin{aligned}
\hat{\mu}_{a}(\boldsymbol{x}) & = \frac{\sum_{i=1}^{n} I\left(A_{i}=a, \boldsymbol{X}_{i}=\boldsymbol{x}\right) Y_{i}}{\sum_{i=1}^{n} I\left(A_{i}=a, \boldsymbol{X}_{i}=\boldsymbol{x}\right)}, \\
\hat{\mathbb{P}}(\boldsymbol{X}=\boldsymbol{x}) & = \frac{1}{n} \sum_{i=1}^{n} I\left(\boldsymbol{X}_{i}=\boldsymbol{x}\right)
\end{aligned}$$


Therefore, we obtain the OR estimator $$\hat{\beta}_{a, \mathrm{OR}}=\sum_{\boldsymbol{x}} \hat{\mu}_{a}(\boldsymbol{x}) \hat{\mathbb{P}}\left(\boldsymbol{X}_{i}=\boldsymbol{x}\right)=\frac{1}{n} \sum_{i=1}^{n} \sum_{\boldsymbol{x}} \hat{\mu}_{a}(\boldsymbol{x}) I\left(\boldsymbol{X}_{i}=\boldsymbol{x}\right)=\frac{1}{n} \sum_{i=1}^{n} \hat{\mu}_{a}\left(\boldsymbol{X}_{i}\right) .$$

> The ATE can be estimated by $\hat{\beta}_{\mathrm{OR}}=\hat{\beta}_{1, \mathrm{OR}}-\hat{\beta}_{0, \mathrm{OR}}=\frac{1}{n} \left(\sum_{i=1}^{n} \hat{\mu}_{1}(\boldsymbol{X}_{i}\right)-\hat{\mu}_{0}\left(\boldsymbol{X}_{i})\right)$.


Recall  $\pi_{a}(\boldsymbol{x})=\mathbb{P}(A=a \mid \boldsymbol{X}=\boldsymbol{x})$, which can be estimated by

$$\hat{\pi}_{a}(\boldsymbol{x})=\frac{\sum_{i=1}^{n} I\left(A_{i}=a, \boldsymbol{X}_{i}=\boldsymbol{x}\right)}{\sum_{i=1}^{n} I\left(\boldsymbol{X}_{i}=\boldsymbol{x}\right)} .$$


> The IPW estimator  is 
> $$\hat{\beta}_{a, \mathrm{IPW}}=\frac{1}{n} \sum_{i=1}^{n} \frac{I\left(A_{i}=a\right)}{\hat{\pi}_{a}\left(\boldsymbol{X}_{i}\right)} Y_{i} = \hat{\beta}_{a, O R},$$ 
> where we suppose $\hat{\pi}_{a}(\boldsymbol{x})>0$, which follows that $\hat{\beta}_{O R}=\hat{\beta}_{I P W}$.


We have, by adding and subtracting $\mu_{a}(\boldsymbol{X})$,

$$\begin{aligned}
& \sqrt{n}\left(\hat{\beta}_{a, \mathrm{IPW}}-\beta_{a}\right) \\
= & \frac{1}{\sqrt{n}} \sum_{i=1}^{n} \frac{I\left(A_{i}=a\right)}{\hat{\pi}_{a}\left(\boldsymbol{X}_{i}\right)}\left[Y_{i}-\mu_{a}\left(\boldsymbol{X}_{i}\right)\right]+\mu_{a}\left(\boldsymbol{X}_{i}\right)-\beta_{a} \\
= & \frac{1}{\sqrt{n}} \sum_{i=1}^{n}\left\{\frac{I\left(A_{i}=a\right)}{\pi_{a}\left(\boldsymbol{X}_{i}\right)}\left[Y_{i}-\mu_{a}\left(\boldsymbol{X}_{i}\right)\right]+\mu_{a}\left(\boldsymbol{X}_{i}\right)-\beta_{a}\right\}+R_{n} .
\end{aligned}$$

The residual term

$$R_{n}=\frac{1}{\sqrt{n}} \sum_{i=1}^{n} I\left(A_{i}=a\right)\left[\frac{1}{\hat{\pi}_{a}\left(\boldsymbol{X}_{i}\right)}-\frac{1}{\pi_{a}\left(\boldsymbol{X}_{i}\right)}\right]\left[Y_{i}-\mu_{a}\left(\boldsymbol{X}_{i}\right)\right] \xrightarrow{p} 0$$

as  $n \rightarrow \infty$. This is because  $\hat{\pi}_{a}(\boldsymbol{x})$  generally converges to  $\pi_{a}(\boldsymbol{x})$  at  $1 / \sqrt{n}$  rate and the other term  $I\left(A_{i}=a\right)\left[Y_{i}-\mu_{a}\left(\boldsymbol{X}_{i}\right)\right]$  is iid with mean $0$. This shows that  $\hat{\beta}_{a, \text { IPW }}$  admits asymptotic linear expansion with the influence function

$$\psi_{\beta_{a}}(\boldsymbol{D})=\frac{I(A=a)}{\pi_{a}(\boldsymbol{X})}\left[Y-\mu_{a}(\boldsymbol{X})\right]+\mu_{a}(\boldsymbol{X})-\beta_{a} =: m_{a}\left(\boldsymbol{D} ; \mu_{a}, \pi_{a}\right)-\beta_{a}.$$

> Therefore, $\sqrt{n}\left(\hat{\beta}_{a, O R}-\beta_{a}\right) \xrightarrow{d} \mathrm{~N}\left(0, \operatorname{Var}\left(\psi_{\beta_{a}}\left(\boldsymbol{D}_{i}\right)\right)\right)$.

Combining the OR estimator and the IPW estimator, we obtain a more efficient
and robust estimator $\hat{\beta}_{a, \mathrm{DR}}=\frac{1}{n} \sum_{i=1}^{n} m_{a}\left(\boldsymbol{D}_{i}, \hat{\mu}_{a}, \hat{\pi}_{a}\right)$.

---

### 4. Unmeasured Confounders

Let $\pi_{i}=\mathbb{P}\left(A_{i}=1 \mid \boldsymbol{C}_{i}\right), i \in\left[2 n_{1}\right]$ where $\boldsymbol{C}_{i}=\left(\boldsymbol{X}_{i}, Y_{i}(0), Y_{i}(1)\right)$.

> For a given value  $\Gamma \geq 1$, we have
> $$\frac{1}{\Gamma} \leq \frac{\pi_{i} \big/\left(1-\pi_{i}\right)}{\pi_{n_{1}+i} \big/\left(1-\pi_{n_{1}+i}\right)} \leq \Gamma, \forall i \in\left[n_{1}\right].$$


Then we have 
$$ \frac{1}{1+\Gamma} \leq \mathbb{P}\left(A_{i}=1, A_{n_{1}+i}=0 \mid \boldsymbol{C}_{\left[2 n_{1}\right]}, A_{i}+A_{n_{1}+i}=1\right)=\frac{\pi_{i}\left(1-\pi_{n_{1}+i}\right)}{\pi_{i}\left(1-\pi_{n_{1}+i}\right)+\pi_{n_{1}+i}\left(1-\pi_{i}\right)} \leq \frac{\Gamma}{1+\Gamma}, $$


which follows that $\Gamma=1$ recovers no unmeasured confounders.

Next we consider the randomisation distribution of the signed score statistic $T_{\psi}\left(\boldsymbol{D}_{\left[n_{1}\right]}\right)=\sum_{i=1}^{n_{1}} \operatorname{sgn}\left(D_{i}\right) \psi\left(\frac{\operatorname{rank}\left(\left|D_{i}\right|\right)}{n_{1}+1}\right)$ under Rosenbaum's sensitivity model. Given  $H_{0}$  and conditioning on  $\boldsymbol{C}_{\left[2 n_{1}\right]} $,

$$T\left.\bigg(\boldsymbol{A}_{\left[2 n_{1}\right]}, \boldsymbol{Y}_{\left[2 n_{1}\right]}(0)\right) \left\lvert\, \boldsymbol{A}_{\left[2 n_{1}\right]} \in M \bigg)\stackrel{d}{=} \sum_{i=1}^{n_{1}} S_{i} \psi\left(\frac{\operatorname{rank}\left(\left|Y_{i}(0)-Y_{n_{1}+i}(0)\right|\right)}{n_{1}+1}\right)\right.,$$

where  $S_{i}=\left(A_{i}-A_{n_{1}+i}\right) \cdot \operatorname{sgn}\left(Y_{i}(0)-Y_{n_{1}+i}(0)\right)$. $X$  stochastically dominates $Y$, written as  $X \succeq Y$, if  $\mathbb{P}(X>t) \geq \mathbb{P}(Y>t)$. Notice that $S_{i}$  stochastically dominates the following random variable

$$S_{i}^{-}=\left\{\begin{array}{ll}
-1, & \text { with probability } \Gamma /(1+\Gamma) \\
1, & \text { with probability } 1 /(1+\Gamma)
\end{array}\right.$$

This can be used to obtain a (sharp) bound on the $p$-value:

> Using the signed score statistic and given  $H_{0}$, $T\left(\boldsymbol{A}_{\left[2 n_{1}\right]}, \boldsymbol{Y}_{\left[2 n_{1}\right]}(0)\right) \succeq \sum_{i=1}^{n_{1}} S_{i}^{-} \psi\left(\frac{\operatorname{rank}\left(\left|D_{i}-\beta\right|\right)}{n_{1}+1}\right)$.

**Proof.** This follows from noticing  $\left|D_{i}-\beta\right|=\left|Y_{i}(0)-Y_{n_{1}+i}(0)\right|$  and the following property of stochastic ordering: If  $X_{i} \succeq Y_{i}$ and  $X_{i} \perp X_{j}, Y_{i} \perp Y_{j}$  for all  $i \neq j$, then  $$\sum_{i=1}^{n} X_{i} \succeq \sum_{i=1}^{n} Y_{i}.$$

---

In the rest, we consider structural specificity. First consider the unobserved confonders. Given iid observations of treatment  $A_{i}$, outcome  $Y_{i}$, instrumental variables  $\boldsymbol{Z}_{i}$, observed confounders  $\boldsymbol{X}_{i}$, and unobserved confounders  $\boldsymbol{U}_{i}$, we assume the structural equations for  $A$  and outcome  $Y$  are given by

$$\begin{array}{l}
A=\beta_{0 A}+\boldsymbol{\beta}_{Z A}^{T} \boldsymbol{Z}+\boldsymbol{\beta}_{X A}^{T} \boldsymbol{X}+\boldsymbol{\beta}_{U A}^{T} \boldsymbol{U}+\epsilon_{A}, \\
Y=\beta_{0 Y}+\beta_{A Y} A+\boldsymbol{\beta}_{X Y}^{T} \boldsymbol{X}+\boldsymbol{\beta}_{U Y}^{T} \boldsymbol{U}+\epsilon_{Y} .
\end{array}$$

By using  $\boldsymbol{Z} \perp \boldsymbol{U}, \epsilon_{A}, \epsilon_{Y}$, we obtain

$$\begin{array}{l}
\mathbb{E}[A \mid \boldsymbol{Z}, \boldsymbol{X}]=\tilde{\beta}_{0 A}+\boldsymbol{\beta}_{Z A}^{T} \boldsymbol{Z}+\tilde{\boldsymbol{\beta}}_{X A}^{T} \boldsymbol{X}, \\
\mathbb{E}[Y \mid \boldsymbol{Z}, \boldsymbol{X}]=\tilde{\beta}_{0 Y}+\beta_{A Y} \mathbb{E}[A \mid \boldsymbol{Z}, \boldsymbol{X}]+\tilde{\boldsymbol{\beta}}_{X Y}^{T} \boldsymbol{X}.
\end{array}$$

This motivates the **two-stage least squares estimator** of  $\beta_{A Y}$  :

(1) Estimate  $\mathbb{E}[A \mid \boldsymbol{Z}, \boldsymbol{X}]$  by a least squares regression of $A$  on $ \boldsymbol{Z}$  and  $\boldsymbol{X}$ . Let the fitted model be  $\hat{\mathbb{E}}[A \mid \boldsymbol{Z}, \boldsymbol{X}]$.

(2) Fit another regression of  $Y$  on  $\hat{\mathbb{E}}[A \mid \boldsymbol{Z}, \boldsymbol{X}]$  and  $\boldsymbol{X}$  by least squares, and let  $\hat{\beta}_{A Y}$  be the coefficient of  $\hat{\mathbb{E}}[A \mid \boldsymbol{Z}, \boldsymbol{X}]$.

Rigorously, we make the following assumptions about $Z$.

> Relevance:  $\boldsymbol{Z} \not \perp A $.
>
> Exogeneity: $ \boldsymbol{Z} \perp\{A(\boldsymbol{z}), Y(\boldsymbol{z}, a)\}$.
>
> Exclusion restriction:  $Y(\boldsymbol{z}, a)=Y(a)$.

Assume the causal effect of  $A$  on $Y$  is a constant  $\beta$, $Y(a)-Y(\tilde{a})=(a-\tilde{a}) \beta$.


Let  $\tilde{a}=0$, this gives $Y(0)=Y-\beta A$. Let  $\alpha=\mathbb{E}[Y(0)]$. The exogeneity and exclusion restriction imply that, for any function  $g(\boldsymbol{z})$,

$$\mathbb{E}[(Y-\alpha-\beta A) g(\boldsymbol{Z})]=\mathbb{E}[(Y(0)-\alpha) g(\boldsymbol{Z})]=0.$$

Let  $\hat{\alpha}=\bar{Y}-\beta \bar{A}$, where  $\bar{A}=\sum_{i=1}^{n} A_{i} / n$  and  $\bar{Y}=\sum_{i=1}^{n} Y_{i} / n$. The method of moments estimator of  $\beta$  is given by solving the empirical version of above (and with $ \alpha$  replaced by  $\hat{\alpha} $ ). After some algebra, we obtain

$$\hat{\beta}_{g}=\frac{\frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right) g\left(\boldsymbol{Z}_{i}\right)}{\frac{1}{n} \sum_{i=1}^{n}\left(A_{i}-\bar{A}\right) g\left(\boldsymbol{Z}_{i}\right)} .$$


This as an empirical estimator of  $\operatorname{Cov}(Y, g(\boldsymbol{Z})) / \operatorname{Cov}(A, g(\boldsymbol{Z}))$, the Wald ratio with  $Z$  replaced by  $g(\boldsymbol{Z})$  in. In view of this,  $g(\boldsymbol{Z})$  is a one-dimensional summary statistic of all the instrumental variables.

---

Then consider the  mediation analysis problem. In this case, the linear  $\mathrm{SEM}$  is

$$\begin{aligned}
M & =\beta_{A M} A+\boldsymbol{\beta}_{X M}^{T} \boldsymbol{X}+\epsilon_{M}, \\
Y & =\beta_{A Y} A+\beta_{M Y} M+\boldsymbol{\beta}_{X Y}^{T} \boldsymbol{X}+\epsilon_{Y} .
\end{aligned}$$

This shows that $Y(a, m) \perp A \mid \boldsymbol{X}$  and  $Y(m) \perp M \mid A, \boldsymbol{X}$, which follows that

$$\begin{aligned}
\mathbb{E}[Y(a, m)] & =\mathbb{E}\{\mathbb{E}[Y(a, m) \mid \boldsymbol{X}]\} =\mathbb{E}\{\mathbb{E}[Y(a, m) \mid A=a, \boldsymbol{X}]\} \\
& =\mathbb{E}\{\mathbb{E}[Y(m) \mid A=a, \boldsymbol{X}]\} =\mathbb{E}\{\mathbb{E}[Y(m) \mid A=a, M=m, \boldsymbol{X}]\} \\
& =\mathbb{E}\{\mathbb{E}[Y \mid A=a, M=m, \boldsymbol{X}]\}.
\end{aligned}$$

Therefore, we have
> $C D E(m) = \mathbb{E}[Y(1, m)-Y(0, m)] = \mathbb{E}[\mathbb{E}[Y \mid A=1, M=m, \boldsymbol{X}]]-\mathbb{E}[\mathbb{E}[Y \mid A=0, M=m, \boldsymbol{X}]]$.

Formally, 

> **Assumption of no unmeasured confounders**.
>
> (1) No unmeasured treatment-outcome confounders:  $Y(a, m) \perp A \mid \boldsymbol{X}$. 
>
> (2) No unmeasured mediator-outcome confounders:  $Y(m) \perp M \mid A, \boldsymbol{X}$. 
>
> (3) No unmeasured treatment-mediator confounders:  $M(a) \perp A \mid \boldsymbol{X}$.

Together with **assumption of no treatment-induced mediator-outcome confounding**, $\boldsymbol{X} \cap   d e(A)=\emptyset$, we can show that $Y(a, m) \perp M\left(a^{\prime}\right) \mid \boldsymbol{X}$, which follows that 

> $$\mathbb{E}[Y(1, M(0))] =  \sum_{m, \boldsymbol{x}} \mathbb{E}[Y \mid A=1, M=m, \boldsymbol{X}=\boldsymbol{x}] \cdot \mathbb{P}(M=m \mid A=0, \boldsymbol{X}=\boldsymbol{x}) \cdot \mathbb{P}(\boldsymbol{X}=\boldsymbol{x}).$$

---

### Reference

1. Shuxiao Chen. Minimax Rates and Adaptivity in Combining Experimental and Observational Data.
2. Qingyuan Zhao. Lecture Notes on Causal Inference. 
2. Joaquin Quiñonero-Candela. Dataset Shift In Machine Learning.
3. Geoff K. Nicholls. Bayes Methods.
4. Patrick J. Laub. Hawkes Processes.
5. Tomas Björk. An Introduction to Point Processes from a Martingale Point of View.