
<div align="center">

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dapivei/causal-infere/blob/main/sections/9_matching.ipynb)

</div>






$$
\begin{array}{c}
\textbf{INTRODUCTION TO CAUSAL INFERENCE}\\\\\\\\
\textbf{Daniela Pinto Veizaga, Xiang Pan, and Xiang Gao} \\
\textit{Center for Data Science, New York University} \\\\\\
\textit{Lab 9 Control Variables}
\end{array}
$$
---



# Control Variables

## High level idea

Why we want the control variables?
- The treatment effect is random given the control variables.

$U$: other factors that affect the outcome.

$Y$: outcome variable.

$S$: treatment variable.



If we have $U \perp S$, then we can use descriptive variables to estimate the treatment effect.

## Formal derivation

$$
Y(S, U) = \alpha_0 + \alpha_1 S + U
$$

Equivalent to:
$U \perp S | \emptyset$


If we have a non-empty set of control variables $C$, only by conditioning on $C$ we can make $U$ and $S$ independent.

$U \perp S | C$

We can have 
$$
\mathbb{E}[U \mid S, C] = \mathbb{E}[U \mid C]
$$



$$
\begin{aligned}
V & \equiv U-E[U \mid S, C] \\
& =U-E[U \mid C] \\
& =U-\left(\gamma_0+\gamma_1 C\right)
\end{aligned}
$$

$$
U = V + \gamma_0 + \gamma_1 C
$$




$$
\begin{aligned}
Y(S, U) & = \alpha_0 + \alpha_1 S + V + \gamma_0 + \gamma_1 C \\
        & = (\alpha_0 + \gamma_0) + \alpha_1 S + \gamma_1 C + V \\
        & = \tilde{\alpha}_0 + \alpha_1 S + \tilde{\alpha}_2 C + V
\end{aligned}
$$

We peeled off the effect of $C$ from the $U$.


# Conditional Independence Test

If we know that if we have $U \perp S | C$, and $U \mid C$ is linear, then we can use the control variables to estimate the treatment effect.

How should we test the assumption?

1. Regression: Regress $U$ on $S$ given $C$.

Given $C$, we can test if $\gamma_1$ is statistically different from 0.
$$
U = \gamma_0 + \gamma_1 S+ \epsilon
$$




2. Partial Correlation (continuous)

$$
r_{X Y \mid Z}=\frac{r_{X Y}-r_{X Z} \cdot r_{Y Z}}{\sqrt{\left(1-r_{X Z}^2\right)\left(1-r_{Y Z}^2\right)}}
$$

If we have CI, then $r_{S U \mid C} = 0$.

high-dim continuous CI test is non-trivial.

# Matching

To obtain treated and control groups with similar covariate distributions.

We have multiple subgroups based on the control variables, we can just

$$
\begin{aligned}
\operatorname{Matching}
& =  \sum_{c \in C} \left( \mathbb{E}[Y \mid S=1, C = c] - \mathbb{E}[Y \mid S=0, C = c] \right) \cdot \mathbb{P}(C = c)
\end{aligned}
$$











# Regression (weight is not the probability)


\begin{align}
\alpha_1=\sum_{k=1}^K E\left[Y(S=1, U)-Y(S=0, U) \mid \mathbf{C}=\mathbf{c}_k\right] W\left(\mathbf{C}=\mathbf{c}_k\right)
\end{align}

$W$ is the weight of the subgroup, which might not be the same as the probability of the subgroup.

$$
\begin{aligned}
W(C = c_k) \neq \mathbb{P}(C = c_k)
\end{aligned}
$$



Say we have two groups, 1 and 2.
What we are doing is to estimate $\beta$ (we are using the traditional linear regression notation here).

$$
Y = X \beta + \epsilon
$$





\begin{align}
X=\left[\begin{array}{l}
X_1 \\
X_2
\end{array}\right] \quad \text { and } \quad Y=\left[\begin{array}{l}
Y_1 \\
Y_2
\end{array}\right]
\end{align}

\begin{align}
X^T X=X_1^T X_1+X_2^T X_2 \quad \text { and } \quad X^T Y=X_1^T Y_1+X_2^T Y_2
\end{align}

\begin{align}
\hat{\beta}_{\text {pooled }}=\left(X_1^T X_1+X_2^T X_2\right)^{-1}\left(X_1^T Y_1+X_2^T Y_2\right)
\end{align}

\begin{align}
\hat{\beta}_{\text {pooled }}=W_1 \hat{\beta}_1+W_2 \hat{\beta}_2
\end{align}

\begin{align}
\hat{\beta}_1=\left(X_1^T X_1\right)^{-1} X_1^T Y_1 \quad \text { and } \quad \hat{\beta}_2=\left(X_2^T X_2\right)^{-1} X_2^T Y_2
\end{align}

\begin{align}
\hat{\beta}_{\text {pooled }}=\left(X_1^T X_1+X_2^T X_2\right)^{-1}\left(X_1^T X_1 \hat{\beta}_1+X_2^T X_2 \hat{\beta}_2\right)
\end{align}

\begin{align}
W_1=\left(X_1^T X_1+X_2^T X_2\right)^{-1} X_1^T X_1
\end{align}

\begin{align}
W_2=\left(X_1^T X_1+X_2^T X_2\right)^{-1} X_2^T X_2
\end{align}

Map back to our problem, the $X$ is the treatment variable, and the $Y$ is the outcome variable.

So larger treatment variance will have larger weight.

## Curse of Dimensionality (matching)

If you have large number of control variables, say k, and we assume the control variables are binary, then the number of subgroups you need to match is $2^k$.

If you have continuous control variables, you can discretize them into bins, and the number of subgroups you need to match is the number of bins.
1. What if there is only one in the bin?
   1. Increase the size of the bin (less accurate, loss of information)
   2. Remove the data point (less observations)
