# Analyzing RCT data with Precision Adjustment

## Data

In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt for further details on data. 
  

In [2]:
import pandas as pd

In [3]:
## loading the data
Penn = pd.read_csv("../data/penn_jae.dat" , sep='\s', engine='python')
n = Penn.shape[0]
p_1 = Penn.shape[1]
Penn = Penn[ (Penn['tg'] == 4) | (Penn['tg'] == 0) ]

In [4]:
Penn.shape

(5099, 24)

In [5]:
# Dependent variable
Penn['T4'] = (Penn[['tg']]==4).astype(int)

In [6]:
Penn['dep'] = Penn['dep'].astype( 'category' )

In [7]:
# Penn['dep'] = Penn['dep'].astype( str ).astype(object)

### Model 
To evaluate the impact of the treatments on unemployment duration, we consider the linear regression model:

$$
Y =  D \beta_1 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W')' = 0,
$$

where $Y$ is  the  log of duration of unemployment, $D$ is a treatment  indicators,  and $W$ is a set of controls including age group dummies, gender, race, number of dependents, quarter of the experiment, location within the state, existence of recall expectations, and type of occupation.   Here $\beta_1$ is the ATE, if the RCT assumptions hold rigorously.


We also consider interactive regression model:

$$
Y =  D \alpha_1 + D W' \alpha_2 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W', DW')' = 0,
$$
where $W$'s are demeaned (apart from the intercept), so that $\alpha_1$ is the ATE, if the RCT assumptions hold rigorously.

Under RCT, the projection coefficient $\beta_1$ has
the interpretation of the causal effect of the treatment on
the average outcome. We thus refer to $\beta_1$ as the average
treatment effect (ATE). Note that the covariates, here are
independent of the treatment $D$, so we can identify $\beta_1$ by
just linear regression of $Y$ on $D$, without adding covariates.
However we do add covariates in an effort to improve the
precision of our estimates of the average treatment effect.

### Analysis

We consider 

*  classical 2-sample approach, no adjustment (CL)
*  classical linear regression adjustment (CRA)
*  interactive regression adjusment (IRA)

and carry out robust inference using the *estimatr* R packages. 

# Carry out covariate balance check

This is done using "lm_robust" command which unlike "lm" in the base command automatically does the correct Eicher-Huber-White standard errors, instead othe classical non-robus formula based on the homoscdedasticity command.

In [8]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

In [9]:
formula2 = "T4~(female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd) ** 2"

In [10]:
y, X = patsy.dmatrices(formula2, Penn, return_type='dataframe')

In [11]:
len( list( X.columns.values ) )

120

In [12]:
no_columns = ['lusd:husd','agelt35:agegt54', 'q6:lusd','q6:husd', 'q5:q6','q4:q5','q4:q6', 'q3:q4', 'q3:q5',  'q3:q6', 'q2:q3', 'q2:q4','q2:q5','q2:q6', 'black:othrace' , 'black:q6' , 'othrace:q6']

In [13]:
no_columns = ['agelt35:agegt54',  'black:othrace' , 'black:q6' , 'othrace:q6']

In [14]:
len(no_columns)

4

In [15]:
X_new = X.drop(no_columns, axis = 1 )

In [16]:
sm.OLS( y, X_new ).fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(3)



Unnamed: 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,0.391,0.034,11.506,0.000,0.324,0.458
dep[T.1],0.056,0.060,0.942,0.346,-0.061,0.174
dep[T.2],0.028,0.053,0.537,0.591,-0.075,0.132
female,-0.027,0.037,-0.712,0.476,-0.099,0.046
female:dep[T.1],0.052,0.046,1.112,0.266,-0.039,0.142
...,...,...,...,...,...,...
agegt54:lusd,0.043,0.053,0.805,0.421,-0.062,0.147
agegt54:husd,0.035,0.043,0.810,0.418,-0.050,0.120
durable:lusd,0.092,0.058,1.590,0.112,-0.021,0.205
durable:husd,0.073,0.054,1.338,0.181,-0.034,0.179


We see that that even though this is a randomized experiment, balance conditions are failed.

# Model Specification

In [17]:
Penn

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,T4
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,0,1,0,,0
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
11,10607,4,9,9,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,,1
12,10831,0,27,27,0,0,0,0,1,0,...,0,0,1,1,0,1,0,0,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13904,10628,4,10,10,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,,1
13905,10523,4,4,4,0,0,1,0,2,0,...,0,0,0,0,0,0,0,1,,1
13906,10558,0,9,9,0,0,0,0,2,0,...,0,1,0,0,0,1,0,0,,0
13910,10817,4,4,4,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,,1


In [18]:
import numpy as np
# model specifications
Penn["log_inuidur1"] = np.log( Penn["inuidur1"] ) 

# no adjustment (2-sample approach)
formula_cl = 'log_inuidur1~T4'

# adding controls
formula_cra = 'log_inuidur1~T4+ (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)**2'
# Omitted dummies: q1, nondurable, muld


ols_cl = smf.ols( formula = formula_cl, data = Penn )
ols_cra = smf.ols( formula = formula_cra, data = Penn )


ols_cl_model = ols_cl.fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(3)
ols_cra_model = ols_cra.fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(3)

print(ols_cl_model)
print(ols_cra_model)

           Coef.  Std.Err.       t  P>|t|  [0.025  0.975]
Intercept  2.057     0.021  98.156  0.000   2.016   2.098
T4        -0.085     0.036  -2.383  0.017  -0.156  -0.015
              Coef.  Std.Err.       t  P>|t|  [0.025  0.975]
Intercept     1.668     0.088  19.026  0.000   1.496   1.840
dep[T.1]      0.042     0.153   0.276  0.782  -0.257   0.342
dep[T.2]      0.125     0.133   0.939  0.348  -0.136   0.387
T4           -0.077     0.035  -2.189  0.029  -0.146  -0.008
female        0.160     0.096   1.661  0.097  -0.029   0.348
...             ...       ...     ...    ...     ...     ...
agegt54:lusd  0.067     0.131   0.511  0.609  -0.190   0.324
agegt54:husd -0.173     0.110  -1.581  0.114  -0.388   0.042
durable:lusd -0.396     0.146  -2.716  0.007  -0.682  -0.110
durable:husd -0.110     0.137  -0.803  0.422  -0.379   0.159
lusd:husd     0.000     0.000     NaN    NaN   0.000   0.000

[121 rows x 6 columns]




The interactive specificaiton corresponds to the approach introduced in Lin (2013).

In [238]:
# No intercept
formula3 = "T4~(female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd) ** 2"
y, X = patsy.dmatrices(formula3, Penn, return_type='dataframe')
log_inuidur1 = np.log( Penn["inuidur1"] )
X = X.drop( 'Intercept', axis = 1)

def demean(X):
    output = X - np.mean(X)
    return output

X = X.apply( demean , axis = 0 )

In [239]:
columns = X.columns.to_list()

In [240]:
new_columns = []
for column in columns:
    new_string = column.replace(".", "_")
    new_string = new_string.replace("C(dep)", "C_dep")
    new_string = new_string.replace("[", "_")
    new_string = new_string.replace("]", "")
    new_columns.append(new_string)

In [241]:
X.columns = new_columns

In [242]:
def listToString(s):  
    i = 1
    # initialize an empty string 
    str1 = ""  
    
    # traverse in the string   
    for ele in s:
        if i ==1:
            str1 = ele
            i += 1
        else:
            str1 += " + " + ele   
    
    # return string   
    return str1

In [244]:
covars = listToString(X.columns.to_list())
len(X.columns.to_list())

119

In [245]:
# Creating the covariable T4*X
X['T4'] = y

In [246]:
X.shape

(5099, 120)

In [247]:
formula4 = f"T4 ~ T4*({covars})"

In [248]:
y, X_T4 = patsy.dmatrices(formula4, X, return_type='dataframe')

In [None]:
# Reset index to estimation
log_inuidur1 = np.log(Penn[ 'inuidur1' ])
ols_ira = sm.OLS( log_inuidur1, X_T4 )
ols_ira_est = ols_ira.fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(4)

print( ols_ira_est )

Next we try out partialling out with lasso

In [252]:
X_T4[ 'T4' ] = demean(X_T4[ 'T4' ])

Next we try out partialling out with lasso

### Results

Treatment group 4 experiences an average decrease of about $7.8\%$ in the length of unemployment spell.


Observe that regression estimators delivers estimates that are slighly more efficient (lower standard errors) than the simple 2 mean estimator, but essentially all methods have very similar standard errors. From IRA results we also see that there is not any statistically detectable heterogeneity.  We also see the regression estimators offer slightly lower estimates -- these difference occur perhaps to due minor imbalance in the treatment allocation, which the regression estimators try to correct.


