# Lab 5: Causal Inference with Randomized Experiments

As we saw in class, the problem of causal inference is the fact that we cannot observe, for the same individual, the outcome when they are put under a treatment and the same outcome when they are not put under the treatment. This means that we cannot find the individual effect of the treatment, which is the difference of the outcomes in either circumstance. However, we can approximate the expected difference at the population level, the **average treatment effect**. We also touched upon the fact that we cannot do this easily with observational data, as our observations may be biased because of underlying relationships between the real treatment and unobserved variables, meaning that a simple regression of the outcome on the treatment will not render a faithful approximation of the real. The following example will help illustrate how this failure to consider selection bias can lead us to both type 1 and type 2 errors.

## Type 1 error:

Imagine we are researchers in the first phase of research for a new drug aimed at reducing blood cholesterol levels. Because this is the first phase, we would like to assess the safety of the treatment. Therefor, we want to quantify the effect the the regime has on lung health. Let $Y(d)$ be the potential outcome of an indicator of overall respiratory health. Let $D$ be the observed treatment variable, and assume first that treatment was not assigned randomly, but that it is correlated with underlying variables lifestyle and demographic characteristics. Formally, we have the following model:

\begin{gather*}
Y(0) = \theta_0+\epsilon_0\\
Y(1) = \theta_1+\epsilon_1\\
D = \mathbb{1}(\nu>0)\\
Y=Y(D)=(1-D)Y(0)+DY(1)
\end{gather*}

Here. $\theta_0$ and $\theta_1$ are the averages of their respective potential outcomes, $(\epsilon_0, \epsilon_1, \nu)$ is a random vector of variables that are jointly distributed with mean 0 and covariance $\Sigma$. $\epsilon_0$ and $\epsilon_1$ are determinants of overall lung health like lifestyle, exposure to environmental contaminants, diet, etc. For this specific example, the variable that determines treatment, $\nu$, is negativelly correlated with $\epsilon_0$ and $\epsilon_1$. One way to interpret this is that people with worse overall lifestyle patterns that provoke worse lung health may also tend to have worse blood cholesterol content, and therefor are more likely to seek the treatment in question.

For simulating the data, we will generate 100 000 observations of the variables in our model. We will assume that the average potential outcomes are $\theta_0 = \theta_1 =1$ and the covariance matrix is

$$
\Sigma = \left(\begin{matrix}
1 & 0 & -0.5\\
0 & 1 & -0.5\\
-0.5 & -0.5 & 1
\end{matrix}\right)
$$

We start by simulating the data

In [1]:
library(MASS)

In [3]:
N <- 100000
cov_matrix <- matrix(c(1, 0, -.5, 0, 1, -.5, -.5, -.5, 1), nrow = 3)
random_shocks <- mvrnorm(N, c(0, 0, 0), cov_matrix)
epsilon0 <- random_shocks[, 1]
epsilon1 <- random_shocks[, 2]
nu <- random_shocks[, 3] > 0
theta0 <- 1
theta1 <- 1
Y_0 <- theta0 + epsilon0
Y_1 <- theta1 + epsilon1
D_biased <- nu > 0
Y_biased <- (1 - D_biased) * Y_0 + D_biased * Y_1

Recall that the average treatment effect is defined as

$$
\delta=E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]
$$

In our example

$$
\delta = \theta_1-\theta_0
$$

Given the way we have defined our model, this should just be 0. In reality, we cannot observe both $Y(0)$ and $Y(1)$. However, we can play by our own rules when we are the ones simulating the data, and find the actual ATE. We should expect to see a very small number as the result

In [4]:
mean(Y_1) - mean(Y_0)

We can calculate the variance of the individual diferences and see that it is much larger than the average diference. This should hint that the divergence from the true value of zero comes from imprecission caused by the random vairiables $\epsilon_d$

In [5]:
sd(Y_1 - Y_0) / sqrt(N)

In practice, we would approximate $\theta_0, \theta_1$ by getting each group's mean. This is:

$$
\hat{\theta}_d = \frac{\sum_{\{i\in[1,...,n]:D_i=d\}}Y_i}{n_d} = \frac{\mathbb{E}_n[Y1(D=d)]}{\mathbb{E}_n[1(D=d)]}
$$

Then we can approximate the ATE with $\hat{\theta}_1-\hat{\theta}_0$. We can also get confidence intervals for these estimators

$$
Var(\hat{\theta}_1)=\frac{Var(Y|D=d)}{nP(D = d)}
$$

With this, our $\hat{\delta}$ estimate is distributed as:

$$
\hat{\delta}\sim_a\mathcal{N}(\delta, Var(\hat{\theta}_1) + Var(\hat{\theta}_2))
$$

In [6]:
mean(Y_biased * D_biased) / mean(D_biased) - mean(Y_biased * (1 - D_biased)) / mean(1 - D_biased)

In [7]:
ate_variance <- var(Y_biased[D_biased]) / mean(D_biased) / N + var(Y_biased[1 - D_biased]) / mean(1 - D_biased) / N
sqrt(ate_variance)

This is a very different result from the one above. First of all, we get a result that is relatively large in comparison to the real value of zero. Also, we get a very precise result that would imply high significance of our variable. This would lead us to reject the null hypothesis, even though it is true. For more clarity, we can run a regression of the recentered variables variables. Recentering is simply subtracting the mean.

We see the same results, plus a very large $t$ statistic and a very small $p$ value.

In [8]:
biased_data <- data.frame(Y = Y_biased - mean(Y_biased), D = D_biased - mean(D_biased))
summary(lm(Y ~ 0 + D, biased_data))


Call:
lm(formula = Y ~ 0 + D, data = biased_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0622 -0.6146  0.0026  0.6169  3.6934 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
D -0.80826    0.00578  -139.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9139 on 99999 degrees of freedom
Multiple R-squared:  0.1636,	Adjusted R-squared:  0.1636 
F-statistic: 1.956e+04 on 1 and 99999 DF,  p-value: < 2.2e-16


This happens because selection bias leads to an apparent negative relationship between the treatment and the outcome. In our research setting, we would interpret this finding as the drug having a negative effect on lung health, even though the true ATE is of 0. This relationship only arises because of underlying variables that determine both $Y(d)$ and $D$ simultaneously. A variable that affects both the propensity of treatment and the outcome is called a **confounder**, and not taking confounders into account can lead to the kinds of issues we are seeing in this example. One way of tackling the issue of confounding is to use randomized treatments. Now in our research example, instead of having a $D$ treatment that is correlated with some unobservables, we randomly assign the treatment to the population of interest. We can simulate that through our code.

In [9]:
D_random <- rnorm(N) > 0
Y_random <- (1 - D_random) * Y_0 + D_random * Y_1

We now estimate the ATE and its variance like before.

In [10]:
mean(Y_random * D_random) / mean(D_random) - mean(Y_random * (1 - D_random)) / mean(1 - D_random)

In [11]:
ate_variance <- var(Y_random[D_random]) / mean(D_random) / N + var(Y_random[1 - D_random]) / mean(1 - D_random) / N
sqrt(ate_variance)

We can see now that we get a much smaller value for the treatment effect, and that it is not even significant at the standard levels. To see this, we can conduct a regression

In [12]:
random_data <- data.frame(Y = Y_random - mean(Y_random), D = D_random - mean(D_random))
summary(lm(Y ~ 0 + D, random_data))


Call:
lm(formula = Y ~ 0 + D, data = random_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4277 -0.6736 -0.0007  0.6789  4.3606 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)
D -0.006709   0.006331   -1.06    0.289

Residual standard error: 1.001 on 99999 degrees of freedom
Multiple R-squared:  1.123e-05,	Adjusted R-squared:  1.231e-06 
F-statistic: 1.123 on 1 and 99999 DF,  p-value: 0.2892


The great utility of randomization is that it ensures that the realized treatment and the potential outcomes are uncorrelated, which leads to unbiased estimation.

## Type 2 error

Now, after having made sure that our treatment is safe, we can now look into its effectiveness in reducing blood cholesterol. Let the outcome now be inversely proportional to the blood cholesterol measure, such that a higher value represents lower cholesterol levels. For it to be considered effective it would have to have a positive and significant ATE. We will set the ATE for this example to 0.8. We will start with the observational setting, where we do not randomize the treatment.

In [13]:
theta0 <- 1
theta1 <- 1.8
Y_0 <- theta0 + epsilon0
Y_1 <- theta1 + epsilon1
D_biased <- nu > 0
Y_biased <- (1 - D_biased) * Y_0 + D_biased * Y_1

If we run a regression for the demeaned variables, we see that we get a small and insignificant estimator for our treatment effect. Again, the correlation between the treatment and the potential outcomes is causing us to get results that do not reflect reality. In this case we are led to not reject the null hypothesis, even though it is false and the true ATE is quite different from 0.

In [14]:
biased_data <- data.frame(Y = Y_biased - mean(Y_biased), D = D_biased - mean(D_biased))
summary(lm(Y ~ 0 + D, biased_data))


Call:
lm(formula = Y ~ 0 + D, data = biased_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0622 -0.6146  0.0026  0.6169  3.6934 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)
D -0.008259   0.005780  -1.429    0.153

Residual standard error: 0.9139 on 99999 degrees of freedom
Multiple R-squared:  2.042e-05,	Adjusted R-squared:  1.042e-05 
F-statistic: 2.042 on 1 and 99999 DF,  p-value: 0.153


However, with random assignment, we get much closer to the true ATE, and we can now (correctly) reject the null hypothesis

In [15]:
D_random <- rnorm(N) > 0
Y_random <- (1 - D_random) * Y_0 + D_random * Y_1

In [16]:
random_data <- data.frame(Y = Y_random - mean(Y_random), D = D_random - mean(D_random))
summary(lm(Y ~ 0 + D, random_data))


Call:
lm(formula = Y ~ 0 + D, data = random_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4628 -0.6752  0.0016  0.6799  4.3629 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
D 0.794609   0.006326   125.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 99999 degrees of freedom
Multiple R-squared:  0.1363,	Adjusted R-squared:  0.1363 
F-statistic: 1.578e+04 on 1 and 99999 DF,  p-value: < 2.2e-16


## Identification through strong ignorability:

Sometimes we don't have the means to conduct a full RCT. Most of the time we only have data available from surveys and administrative records. The root of the estimation bias is solved by randomization, but the lack of randomization is not the cause of the bias itself. We can find other procedures that can lead us to an unbiased estimate even with a biased treatment. Recall from class that we can identify treatment effects through strong ignorability. We call $D_i$ strongly ignorable conditional on a vector $X$ if it follows:

- Ignorability: For a given value of $X_i$, $D_i$ is unrelated to the potential outcomes
- Positivity: $P(D_i=1|X_i)\in(0, 1)$. For a given value of $X_i$, there are treated and untreated individuals.

The second condition is somewhat trivial but necessary to make sure that at any value of $X_i$ there exists a split, even if it is very unbalanced. The first condition, on the other hand, hearkens back to a topic we have explored before. Recall the partialling out procedure, where, for a variable $V_i$, we would construct $\tilde{V}_i$ such that

$$
\tilde{V}_i=V_i-E[V_i|X_i]
$$

We refer back to partialling out in this situation because, through it, we are able to "remove" the influence of $X$ (in this case, our $\epsilon_0$ and $\epsilon_1$ variables) on $D$ and on $Y$. Therefor, we would expect the regression between the variables $\tilde{Y}$ and $\tilde{D}$ to give us a better estimate even if the base variables are biased. Note that this is because we can observe $X$.

In [17]:
D_data <- data.frame(D = D_biased, eps0 = epsilon0, eps1 = epsilon1)
D_resid <- glm(D ~ 0 + eps0 + eps1, "binomial", D_data)$residuals
Y_data <- data.frame(Y = Y_biased, eps0 = epsilon0, eps1 = epsilon1)
Y_resid <- lm(Y ~ 0 + eps0 + eps1, Y_data)$residuals
resid_data <- data.frame(D = D_resid, Y = Y_resid)
summary(lm(Y ~ 0 + D, resid_data))


Call:
lm(formula = Y ~ 0 + D, data = resid_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.2670  0.9131  1.3979  1.8893 18.0051 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
D  0.09169    0.00173      53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.58 on 99999 degrees of freedom
Multiple R-squared:  0.02733,	Adjusted R-squared:  0.02732 
F-statistic:  2809 on 1 and 99999 DF,  p-value: < 2.2e-16
