In [None]:
# Load necessary libraries
library(dplyr)
library(broom)  # For tidying model outputs
library(stats)  # For linear regression
library(lmtest)
library(sandwich)

# Analyzing RCT data with Precision Adjustment

## Data

In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt for further details on data. 
  

In [1]:
## loading the data
Penn <- as.data.frame(read.table("../../../data/penn_jae.dat", header=T ))
n <- dim(Penn)[1]
p_1 <- dim(Penn)[2]
Penn<- subset(Penn, tg==2 | tg==0)
attach(Penn)

In [5]:
# Observe the distribution control / treatment
T2 <- (tg==2)
summary(T2)

   Mode   FALSE    TRUE 
logical    3354    1745 

In [6]:
head(Penn)

Unnamed: 0_level_0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q5,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,10824,0,18,18,0,0,0,0,2,0,...,1,0,0,0,0,0,0,0,1,0
4,10824,0,1,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
5,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
12,10607,4,9,9,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
13,10831,0,27,27,0,0,0,0,1,0,...,1,0,0,0,1,1,0,1,0,0
14,10845,0,27,27,1,0,0,0,0,0,...,1,0,0,0,1,0,0,1,0,0


In [7]:
#summarize variables 
summary(Penn)

      abdt             tg           inuidur1        inuidur2    
 Min.   :10404   Min.   :0.000   Min.   : 1.00   Min.   : 0.00  
 1st Qu.:10600   1st Qu.:0.000   1st Qu.: 3.00   1st Qu.: 2.00  
 Median :10698   Median :0.000   Median :11.00   Median :10.00  
 Mean   :10695   Mean   :1.369   Mean   :13.05   Mean   :12.28  
 3rd Qu.:10796   3rd Qu.:4.000   3rd Qu.:25.00   3rd Qu.:23.00  
 Max.   :10880   Max.   :4.000   Max.   :52.00   Max.   :52.00  
     female          black          hispanic          othrace        
 Min.   :0.000   Min.   :0.000   Min.   :0.00000   Min.   :0.000000  
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.000000  
 Median :0.000   Median :0.000   Median :0.00000   Median :0.000000  
 Mean   :0.404   Mean   :0.122   Mean   :0.03256   Mean   :0.007256  
 3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:0.000000  
 Max.   :1.000   Max.   :1.000   Max.   :1.00000   Max.   :1.000000  
      dep               q1                q2           

### Model 
To evaluate the impact of the treatments on unemployment duration, we consider the linear regression model:

$$
Y =  D \beta_1 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W')' = 0,
$$

where $Y$ is  the  log of duration of unemployment, $D$ is a treatment  indicators,  and $W$ is a set of controls including age group dummies, gender, race, number of dependents, quarter of the experiment, location within the state, existence of recall expectations, and type of occupation.   Here $\beta_1$ is the ATE, if the RCT assumptions hold rigorously.


We also consider interactive regression model:

$$
Y =  D \alpha_1 + D W' \alpha_2 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W', DW')' = 0,
$$
where $W$'s are demeaned (apart from the intercept), so that $\alpha_1$ is the ATE, if the RCT assumptions hold rigorously.

Under RCT, the projection coefficient $\beta_1$ has
the interpretation of the causal effect of the treatment on
the average outcome. We thus refer to $\beta_1$ as the average
treatment effect (ATE). Note that the covariates, here are
independent of the treatment $D$, so we can identify $\beta_1$ by
just linear regression of $Y$ on $D$, without adding covariates.
However we do add covariates in an effort to improve the
precision of our estimates of the average treatment effect.

### Analysis

We consider 

*  classical 2-sample approach, no adjustment (CL)
*  classical linear regression adjustment (CRA)
*  interactive regression adjusment (IRA)

and carry out robust inference using the *estimatr* R packages. 

# Carry out covariate balance check

This is done using "lm_robust" command which unlike "lm" in the base command automatically does the correct Eicher-Huber-White standard errors, instead othe classical non-robus formula based on the homoscdedasticity command.

In [10]:
m <- lm(T2~(female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2)

coeftest(m, vcov = vcovHC(m, type="HC1"))


t test of coefficients:

                        Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)           0.32145725  0.16125607  1.9935 0.0462656 *  
female                0.10423328  0.13624779  0.7650 0.4442914    
black                 0.07164803  0.08288534  0.8644 0.3873969    
othrace               0.02801517  0.40496815  0.0692 0.9448502    
factor(dep)1         -0.07363340  0.20094637 -0.3664 0.7140574    
factor(dep)2         -0.10854072  0.15754307 -0.6890 0.4908810    
q2                   -0.02667937  0.16255836 -0.1641 0.8696419    
q3                   -0.00567387  0.16218178 -0.0350 0.9720934    
q4                    0.04334425  0.16233956  0.2670 0.7894821    
q5                    0.09386458  0.16157184  0.5809 0.5613028    
q6                   -0.22156423  0.15984049 -1.3862 0.1657604    
agelt35              -0.10923976  0.13323486 -0.8199 0.4123101    
agegt54              -0.43668630  0.13581268 -3.2154 0.0013111 ** 
durable              -0.12500967  0.

In [11]:
# get same dataframe

X <- as.data.frame( model.matrix(m) )

no_col <- attributes(m$coefficients)$names[is.na(m$coefficients)]

X1  <- X[, -which(names( X ) %in% no_col ) ]

save( X1, file = "../../../data/m_reg.RData")

In [12]:
X

Unnamed: 0_level_0,(Intercept),female,black,othrace,factor(dep)1,factor(dep)2,q2,q3,q4,q5,...,agelt35:agegt54,agelt35:durable,agelt35:lusd,agelt35:husd,agegt54:durable,agegt54:lusd,agegt54:husd,durable:lusd,durable:husd,lusd:husd
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,1,0,1,0,0
6,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
7,1,1,0,0,1,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
8,1,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


We see that that even though this is a randomized experiment, balance conditions are failed.

# 1.1 Model Specification

In [13]:
# model specifications


# no adjustment (2-sample approach)
formula_cl <- log(inuidur1)~T2

# adding controls

formula_cra <- log(inuidur1)~T2+ (female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2

# Omitted dummies: q1, nondurable, muld


ols.cl_reg <- lm(formula_cl)
ols.cra_reg <- lm(formula_cra)


ols.cl = coeftest(ols.cl_reg, vcov = vcovHC(ols.cl_reg, type="HC1"))
ols.cra = coeftest(ols.cra_reg, vcov = vcovHC(ols.cra_reg, type="HC1"))

print(ols.cl)
print(ols.cra)




t test of coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.056830   0.020955 98.1557  < 2e-16 ***
T4TRUE      -0.085455   0.035856 -2.3833  0.01719 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


t test of coefficients:

                       Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           2.6330806  0.3680995  7.1532 9.707e-13 ***
T4TRUE               -0.0796801  0.0355909 -2.2388 0.0252143 *  
female               -0.1146093  0.3163904 -0.3622 0.7171880    
black                -0.8990265  0.2402463 -3.7421 0.0001845 ***
othrace              -2.6269216  0.4541551 -5.7842 7.729e-09 ***
factor(dep)1         -0.7196978  0.5812182 -1.2383 0.2156788    
factor(dep)2         -0.0405584  0.3598395 -0.1127 0.9102631    
q2                   -0.1596700  0.3704317 -0.4310 0.6664596    
q3                   -0.5398906  0.3700794 -1.4589 0.1446691    
q4                   -0.4333540  0.3711943 -1.1675 0.2430809    
q5 

In [21]:
# get same dataframe for ols.cl

X <- as.data.frame( model.matrix(ols.cra_reg) )

no_col <- attributes(ols.cra_reg$coefficients)$names[is.na(ols.cra_reg$coefficients)]

X1  <- X[, -which(names( X ) %in% no_col ) ]


names(X1)[2] <- "T2"
save( X1, file = "../../../data/ols_cra_reg.RData")


In [12]:
class(X1)

In [13]:
head(X1)

(Intercept),T4,female,black,othrace,factor(dep)1,factor(dep)2,q2,q3,q4,...,q6:agegt54,q6:durable,agelt35:durable,agelt35:lusd,agelt35:husd,agegt54:durable,agegt54:lusd,agegt54:husd,durable:lusd,durable:husd
1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [31]:
# interactive regression model variables 

X <- model.matrix(~(female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2)[,-1] #without intercept



The interactive specificaiton corresponds to the approach introduced in Lin (2013).

## 1.3 Interactive regression model

In [14]:
#interactive regression model (IRA)

demean<- function(x){ x - mean(x)}
X = apply(X, 2, demean)

ols.ira_reg = lm(log(inuidur1) ~ T2*X) 
ols.ira= coeftest(ols.ira_reg, vcov = vcovHC(ols.ira_reg, type="HC1"))
print(ols.ira)


t test of coefficients:

                               Estimate Std. Error t value  Pr(>|t|)    
(Intercept)                   2.0576131  0.0207724 99.0553 < 2.2e-16 ***
T4TRUE                       -0.0755005  0.0356049 -2.1205 0.0340132 *  
Xfemale                      -0.6662823  0.4090214 -1.6290 0.1033843    
Xblack                       -0.8634862  0.2976698 -2.9008 0.0037385 ** 
Xothrace                     -3.8176881  0.9389101 -4.0661 4.856e-05 ***
Xfactor(dep)1                 0.0359264  0.6492658  0.0553 0.9558747    
Xfactor(dep)2                 0.2117556  0.4523267  0.4681 0.6397000    
Xq2                          -0.2546436  0.4564528 -0.5579 0.5769552    
Xq3                          -0.6212326  0.4560767 -1.3621 0.1732217    
Xq4                          -0.4799269  0.4572363 -1.0496 0.2939421    
Xq5                          -0.3718675  0.4549983 -0.8173 0.4138001    
Xq6                          -0.6770474  0.4532559 -1.4937 0.1353075    
Xagelt35                 

In [30]:
head(X)

female,black,othrace,factor(dep)1,factor(dep)2,q2,q3,q4,q5,q6,...,agelt35:agegt54,agelt35:durable,agelt35:lusd,agelt35:husd,agegt54:durable,agegt54:lusd,agegt54:husd,durable:lusd,durable:husd,lusd:husd
-0.4040008,-0.1219847,-0.007256325,-0.1121789,0.8362424,-0.2037654,-0.2355364,-0.2259267,0.7409296,-0.06295352,...,0,-0.07217101,-0.1321828,-0.1253187,-0.01804275,-0.03196705,-0.02647578,-0.05432438,-0.02804471,0
-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,-0.2259267,0.7409296,-0.06295352,...,0,-0.07217101,-0.1321828,-0.1253187,-0.01804275,-0.03196705,-0.02647578,-0.05432438,-0.02804471,0
-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,0.7740733,-0.2590704,-0.06295352,...,0,-0.07217101,-0.1321828,-0.1253187,-0.01804275,-0.03196705,-0.02647578,-0.05432438,-0.02804471,0
-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,0.7644636,-0.2259267,-0.2590704,-0.06295352,...,0,-0.07217101,-0.1321828,-0.1253187,-0.01804275,-0.03196705,-0.02647578,-0.05432438,-0.02804471,0
-0.4040008,-0.1219847,-0.007256325,0.8878211,-0.1637576,-0.2037654,-0.2355364,-0.2259267,0.7409296,-0.06295352,...,0,-0.07217101,-0.1321828,-0.1253187,0.98195725,0.96803295,-0.02647578,0.94567562,-0.02804471,0
0.5959992,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,-0.2259267,0.7409296,-0.06295352,...,0,-0.07217101,-0.1321828,-0.1253187,-0.01804275,0.96803295,-0.02647578,-0.05432438,-0.02804471,0


In [15]:
# get same dataframe for ols.ira
S <- as.data.frame( model.matrix(ols.ira_reg) )

no_col <- attributes(ols.ira_reg$coefficients)$names[is.na(ols.ira_reg$coefficients)]

S1  <- S[, -which(names( S ) %in% no_col ) ]

names(S1)[2] <- "T2"
save( S1, file = "../../../data/ols_ira_reg.RData")

# 1.3 Next we try out partialling out with lasso

In [16]:
T2 = demean(T2)
DX = model.matrix(~T2*X)[,-1]
head(DX)

Unnamed: 0,T4,X(Intercept),Xfemale,Xblack,Xothrace,Xfactor(dep)1,Xfactor(dep)2,Xq2,Xq3,Xq4,...,T4:Xagelt35:agegt54,T4:Xagelt35:durable,T4:Xagelt35:lusd,T4:Xagelt35:husd,T4:Xagegt54:durable,T4:Xagegt54:lusd,T4:Xagegt54:husd,T4:Xdurable:lusd,T4:Xdurable:husd,T4:Xlusd:husd
1,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,0.8362424,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,0.006174663,0.01093989,0.009060646,0.0185911,0.009597573,0
2,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,0.006174663,0.01093989,0.009060646,0.0185911,0.009597573,0
3,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,0.7740733,...,0,0.02469865,0.04523612,0.04288706,0.006174663,0.01093989,0.009060646,0.0185911,0.009597573,0
4,0.657776,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,0.7644636,-0.2259267,...,0,-0.04747236,-0.08694667,-0.08243163,-0.011868091,-0.02102716,-0.017415133,-0.03573327,-0.018447141,0
5,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,0.8878211,-0.1637576,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,-0.336049303,-0.33128407,0.009060646,-0.32363286,0.009597573,0
6,-0.342224,0,0.5959992,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,0.006174663,-0.33128407,0.009060646,0.0185911,0.009597573,0


In [17]:
library(hdm)

rlasso.ira = summary(rlassoEffects(DX, log(inuidur1), index = 1))

# rlassoEffects ( Partialling out )
# index = 1 (T2 treatment )
print(rlasso.ira)

"package 'hdm' was built under R version 4.1.3"


[1] "Estimates and significance testing of the effect of target variables"
   Estimate. Std. Error t value Pr(>|t|)  
T4  -0.07889    0.03555  -2.219   0.0265 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




In [18]:
# getting data
S = as.data.frame(DX)
save( S, file = "../../../data/rlasso_ira_reg.RData")

In [19]:
head(S)

Unnamed: 0_level_0,T4,X(Intercept),Xfemale,Xblack,Xothrace,Xfactor(dep)1,Xfactor(dep)2,Xq2,Xq3,Xq4,...,T4:Xagelt35:agegt54,T4:Xagelt35:durable,T4:Xagelt35:lusd,T4:Xagelt35:husd,T4:Xagegt54:durable,T4:Xagegt54:lusd,T4:Xagegt54:husd,T4:Xdurable:lusd,T4:Xdurable:husd,T4:Xlusd:husd
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,0.8362424,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,0.006174663,0.01093989,0.009060646,0.0185911,0.009597573,0
2,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,0.006174663,0.01093989,0.009060646,0.0185911,0.009597573,0
3,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,0.7740733,...,0,0.02469865,0.04523612,0.04288706,0.006174663,0.01093989,0.009060646,0.0185911,0.009597573,0
4,0.657776,0,-0.4040008,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,0.7644636,-0.2259267,...,0,-0.04747236,-0.08694667,-0.08243163,-0.011868091,-0.02102716,-0.017415133,-0.03573327,-0.018447141,0
5,-0.342224,0,-0.4040008,-0.1219847,-0.007256325,0.8878211,-0.1637576,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,-0.336049303,-0.33128407,0.009060646,-0.32363286,0.009597573,0
6,-0.342224,0,0.5959992,-0.1219847,-0.007256325,-0.1121789,-0.1637576,-0.2037654,-0.2355364,-0.2259267,...,0,0.02469865,0.04523612,0.04288706,0.006174663,-0.33128407,0.009060646,0.0185911,0.009597573,0


### Results

In [20]:
install.packages("xtable")
library(xtable)
table<- matrix(0, 2, 4)
table[1,1]<-  ols.cl[2,1]
table[1,2]<-  ols.cra[2,1]
table[1,3]<-  ols.ira[2,1]
table[1,4]<-  rlasso.ira[[1]][1]

table[2,1]<-  ols.cl[2,2]
table[2,2]<-  ols.cra[2,2]
table[2,3]<-  ols.ira[2,2]
table[2,4]<-  rlasso.ira[[1]][2]


colnames(table)<- c("CL","CRA","IRA", "IRA w Lasso")
rownames(table)<- c("estimate", "standard error")
tab<- xtable(table, digits=5)
tab

Installing package into 'C:/Users/Alexander/Documents/R/win-library/4.1'
(as 'lib' is unspecified)



package 'xtable' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Alexander\AppData\Local\Temp\Rtmp0c5LrI\downloaded_packages


"package 'xtable' was built under R version 4.1.3"


Unnamed: 0_level_0,CL,CRA,IRA,IRA w Lasso
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
estimate,-0.08545541,-0.07968012,-0.07550055,-0.07888608
standard error,0.03585569,0.03559092,0.03560489,0.0355513


% latex table generated in R 4.1.2 by xtable 1.8-4 package
% Fri Apr 12 09:22:19 2024
\begin{table}[ht]
\centering
\begin{tabular}{rrrrr}
  \hline
 & CL & CRA & IRA & IRA w Lasso \\ 
  \hline
estimate & -0.08546 & -0.07968 & -0.07550 & -0.07889 \\ 
  standard error & 0.03586 & 0.03559 & 0.03560 & 0.03555 \\ 
   \hline
\end{tabular}
\end{table}


Treatment group 4 experiences an average decrease of about $7.5\%$ in the length of unemployment spell.


Observe that regression estimators delivers estimates that are slighly more efficient (lower standard errors) than the simple 2 mean estimator, but essentially all methods have very similar standard errors. From IRA results we also see that there is not any statistically detectable heterogeneity.  We also see the regression estimators offer slightly lower estimates -- these difference occur perhaps to due minor imbalance in the treatment allocation, which the regression estimators try to correct.




## Plotting the coefficients

# A Crash Course in Good and Bad Controls

In [None]:
# Packages and libraries

install.packages(c("dagitty", "stats", "igraph", "ggplot2", "broom"))
library(dagitty)
library(stats)
library(igraph)
library(ggplot2)
library(broom)

### Model 1

Application with real data
One example where Model 1 could be applied is in studying the effect 
of education (E) on income (I), controlling for factors such as 
intelligence (L) that influence both education and income. This allows 
researchers to determine the extent to which changes in income can be 
attributed to differences in education levels, after accounting for the 
influence of intelligence. For example, if it is found that higher 
education levels are associated with higher incomes even after controlling 
for intelligence, it suggests that education has a direct effect on income 
beyond what can be explained by differences in intelligence levels. This 
insight can inform policy decisions or interventions that are focused at 
improving education and income levels.

In [None]:
model_1 <- dagitty("dag {
    L -> I
    L -> E
    E -> I
}")

plot(model_1)

set.seed(1234)
N <- 1000
L <- rnorm(N, 0, 1)
E <- 1.5 * L + rnorm(N, 0, 1)
I <- E + 3 * L + rnorm(N, 0, 1)

# Create dataframe
data <- data.frame(L, E, I)

# Regressions
no_control <- lm(I ~ E, data=data)
using_control <- lm(I ~ E + L, data=data)

# Summary results
summary_col <- summary(no_control, using_control)
print(summary_col)

### Model 7

Application with real data
One example where Model 7 could be applied is when researchers want to 
analize the effect of a new medication (X) on patient health outcomes 
(Y), controlling for the severity of the illness (U). However, the dummy 
variable of hospitalization (Z) could be mistakenly considered as a control 
variable, thinking it reflects the overall health status. Nevertheless, 
in reality, hospitalization might be a collider variable influenced by both 
the severity of illness and the effectiveness of the medication. In consequence, 
controlling for hospitalization could introduce bias into our estimation of 
the medication's effect on patient outcomes.

In [None]:
model_7 <- dagitty("dag {
  X -> Y
  Z -> Y
  X <- Z
  Z <- Y
}")

plot(model_7)


set.seed(5678)
n <- 1000
U_1 <- rnorm(n)
U_2 <- rnorm(n)

Z <- 0.4 * U_1 + 0.6 * U_2 + rnorm(n)
X <- 1.5 * U_2 + rnorm(n)
Y <- X + 1.5 * U_1 + 1.5 * Z + rnorm(n)

# Create dataframe
data <- data.frame(U_1, U_2, Z, X, Y)

# Regressions
no_control <- lm(Y ~ X, data)
using_control <- lm(Y ~ X + Z, data)

# Summary results
dfoutput <- bind_rows(tidy(no_control), tidy(using_control))
print(dfoutput)

### Model 8

Application with real data
One example where Model 8 could be applied is in the nutrition field, 
specifically when researchers want to study the effect of physical exercise (P) 
on weight loss (W), mediated by changes in metabolism (Z). In this scenario, 
physical exercise directly influences metabolism, which in turn affects weight 
loss. In this way, controlling for metabolism may reduce the variation in weight 
loss and improve the precision of the estimated effect of physical exercise on 
weight loss.

In [None]:
model_8 <- dagitty("dag {
  Z -> Y
  E -> Y
}")

plot(model_8)

set.seed(1614)
n <- 1000
Z <- rnorm(n)
E <- rnorm(n)
W <- E + 1.2 * Z + rnorm(n)

# Create dataframe
data <- data.frame(Z, E, W)

# Regressions
no_control <- lm(W ~ E, data)
using_control <- lm(W ~ E + Z, data)

# Summary results
dfoutput <- bind_rows(tidy(no_control), tidy(using_control))
print(dfoutput)

### Model 11

Application with real data
One example where model 11 could be applied in a real context is when 
researchers want to investigate about the effect of government policies (P) 
on economic growth (G), mediated by investment in infrastructure (I). For instance,
government policies directly influence investment in infrastructure, which 
subsequently affects economic growth. In this sense, controlling for investment in 
infrastructure, which acts as a mediator, might block the effect on the total effect 
of government policies on economic growth. This could introduce bias into our estimates, 
leading to inaccurate conclusions about the effectiveness of government policies in 
stimulating economic growth.

In [None]:
model_11 <- dagitty("dag {
  P -> I
  I -> G
}")

plot(model_11)

set.seed(777)
n <- 1000
P <- rnorm(n)
I <- 1.3 * P + rnorm(n)
G <- 3 * I + rnorm(n)

# Create dataframe
data <- data.frame(I, P, G)

# Regressions
no_control <- lm(G ~ P, data)
using_control <- lm(G ~ P + I, data)

# Summary results
dfoutput <- bind_rows(tidy(no_control), tidy(using_control))
print(dfoutput)

### Model

In [None]:
# Define the causal graphical model
sprinkler <- dagitty(nodes = "Z->Y<-X",
                     edges = c("X->Z", "X->Y"))

# Plot the causal graphical model
plot(sprinkler)

# Set Seed
set.seed(432)

# Sample size
n <- 1000

# Generate data
X <- rnorm(n, mean = 0, sd = 1)
Z <- X + rnorm(n, mean = 0, sd = 1)
Y <- 2*X + rnorm(n, mean = 0, sd = 1)

# Create dataframe
data <- data.frame(Z = Z, X = X, Y = Y)

# Regressions
no_control <- lm(Y ~ X, data = data)
using_control <- lm(Y ~ X + Z, data = data)

# Extract coefficients and standard errors
no_control_summary <- summary(no_control)$coefficients[, c("Estimate", "Std. Error")]
using_control_summary <- summary(using_control)$coefficients[, c("Estimate", "Std. Error")]

# Combine results
summary_output <- cbind(
  No_Control_Estimate = no_control_summary[, "Estimate"],
  No_Control_Std.Error = no_control_summary[, "Std. Error"],
  Using_Control_Estimate = using_control_summary[, "Estimate"],
  Using_Control_Std.Error = using_control_summary[, "Std. Error"]
)

# Print summary
print(summary_output)