Workgroup 2

# Potential Outcomes and RCTs

#### Group 3: Dube, V., Garay, E. Guerrero, J., Villalba, M.

## Multicolinearity

What is Multicollinearity?

In [None]:
# Example

## Analyzing RCT data with precision adjustment

## A crash course in good and bad controls (val)

In this section, we will explore different scenarios where we need to decide whether the inclusion of a control variable, denoted by _Z_, will help (or not) to improve the estimation of the **average treatment effect** (ATE) of treatment _X_ on outcome _Y_. The effect of observed variables will be represented by a continuous line, while that of unobserved variables will be represented by and discontinuous line.

In [None]:
install.packages("librarian", quiet = T)

: 

In [None]:
install.packages("tidyverse")

: 

In [None]:
install.packages("ggdag")

: 

In [None]:
# install.packages("librarian", quiet = T)
# install.packages("ggdag")
librarian::shelf(
    dagitty, tibble, stargazer, ggplot2, ggdag,
    quiet = T
)
theme_set(theme_void())

: 

In [9]:
# Load packages
library(dagitty)
library(ggdag)
library(tibble)

ERROR: Error in library(dagitty): there is no package called ‘dagitty’


In [3]:
# cleans workspace
rm(list = ls()) 

#### Good control (Blocking back-door paths)

**Model 1** 

We will assume that _X_ measures whether or not the student attends the extra tutoring session, that affects the student's grade (_Y_). Then, we have another observable variable, as hours of the student sleep (_Z_), that impacts _X_ and _Y_. Theory says that when controlling by _Z_, we block the back-door path from _X_ to _Y_. Thus, we see that in the second regression, the coefficient of _X_ is closer to the real one (2.9898 ≈ 3).

In [None]:
# cleans workspace
rm(list = ls())

# DAG

## specify edges
model <- dagitty("dag{x->y; z->x; z->y}")

## coordinates for plotting
coordinates(model) <-  list(
  x = c(x=1, y=3, z=2),
  y = c(x=1, y=1, z=2))

## ggplot
ggdag(model) + theme_dag()

: 

In [7]:
# Generate data
set.seed(24)
n <- 1000
Z <- rnorm(n)
X <- 5 * Z + rnorm(n)
Y <- 3 * X + 1.5 * Z + rnorm(n)

In [8]:
d <- tibble(X=X, Y=Y, Z=Z)

# Regressions and summary results
lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl")
)

ERROR: Error in tibble(X = X, Y = Y, Z = Z): could not find function "tibble"


**Model 2** 

We will assume that _X_ stands for the police salaries that affect the crime rate (_Y_). Then, we have another observable variable, as the policemen's supply (_Z_), that impacts _X_ but not _Y_. And, additionally, we know that there is an unobservable variable (_U_), as the preference for maintaining civil order, that affects _Z_ and _Y_. The theory says that when controlling by _Z_, we block (some) of the unobservable variable’s back-door path from _X_ to _Y_. Thus, we see that in the second regression, the coefficient of _X_ is equal to the real one (0.5).

In [None]:
g <- dagitty("dag {
    X <- Z <- U -> Y
    X -> Y
    }")
coordinates(g) <- list(
    x = c(X=1, Z=2, U=3, Y=4),
    y = c(X=2, Z=1, U=0, Y=2))

ggdag(g)

In [None]:
set.seed(24)
n <- 1000   
U <- rnorm(n)
Z <- 7 * U + rnorm(n)
X <- 2 * Z + rnorm(n)
Y <- 0.5 * X + 0.2 * U + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z, U=U)

#### Bad Control (M-bias)

**Model 7** 

Let us suppose that _X_ stands for a job training program aimed at reducing unemployment. Then, there is a first unobserved confounder, which could be the planning effort and good design of the job program (_U1_) that impacts directly on the participation in job training programs (_X_) and the proximity of job programs (that would be the bad control _Z_). Furthermore, we have another unobserved confounder (_U2_), as the soft skills of the unemployed, that affects the employment status of individuals (_Y_) and the likelihood of beeing in a job training program that is closer (_Z_). That is why including _Z_ in the second regression makes _X_ coefficient value further to the real one.

In [None]:
g <- dagitty("dag {
    X <- U1 -> Z <- U2 -> Y
    X -> Y
    }")
coordinates(g) <- list(
    x = c(X=1, U1=1, Z=2, U2=3, Y=3),
    y = c(X=1, U1=0, Z=0, U2=0, Y=1))

ggdag(g)

In [None]:
set.seed(1)
n <- 1000  
U1 <- rnorm(n)
U2 <- rnorm(n)
Z <- 0.3 * U1 + 0.9 * U2 + rnorm(n)
X <- 4 * U1 + rnorm(n)
Y <- 3 * X + U2 + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z, U1=U1, U2=U2)

In [None]:
lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl")
          )

#### Neutral Control (possibly good for precision)

**Model 8** 

In this scenario, we will assume that _X_ represents the implementation of a new government policy to provide subsidies and guidance for small companies. There is another variable, _Z_, that stands for the % inflation rate. And both _X_ and _Z_ affect _Y_, which represents the GDP growth rate of the country. Then, even if _Z_ does not impact _X_, its inclusion improves the precision of the ATE estimator (8.5643 is closer to 8.6).

In [None]:
g <- dagitty("dag {X -> Y <-Z}")
coordinates(g) <- list(
    x = c(X=1, Y=2, Z=3),
    y = c(X=1, Y=1, Z=0))

ggdag(g)

In [None]:
set.seed(24)
n <- 1000  
Z <- rnorm(n)
X <- rnorm(n)
Y <- 8.6 * X + 5 * Z + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z)

In [None]:
lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl")
          )

#### Bad Controls (Bias amplification)

**Model 10** 

Let us assume that _X_ measures the implementation of a housing program for young adults buying their first house, which impacts the average housing prices (_Y_). There is another observable variable, _Z_, that measures the expenditure of the program and affects only _X_. Also, there is an unobservable variable (_U_) that represents the preference of young adults to move from their parent's house and impacts only _X_ and _Y_. Therefore, the inclusion of _Z_ will "amplify the bias" of _U_ on _X_, so the ATE estimator will be worse. We can see that in the second regression, the estimator (0.8241) is much farther from the real value (0.8).

In [None]:
g <- dagitty("dag {
    Z -> X <- U -> Y
    X -> Y
    }")
coordinates(g) <- list(
    x = c(Z=0.5, X=1, U=1.5, Y=2),
    y = c(Z=0, X=1, U=0.5, Y=1))

ggdag(g)

In [None]:
set.seed(24)
n <- 1000  
Z <- rnorm(n)
U <- rnorm(n)
X <- 3 * Z + 6 * U + rnorm(n)
Y <- 0.8 * X + 0.2 * U + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z, U=U)

In [None]:
d <- tibble(X=X, Y=Y, Z=Z)

# Regressions and summary results
lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl")
)

ERROR: Error in tibble(X = X, Y = Y, Z = Z): could not find function "tibble"
