# Bootstraping - Scripts in R and Python

### Data

We analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000). These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period.

In [58]:
import pandas as pd
import numpy as np
import pyreadr

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

In [59]:
## loading the data
Penn = pd.read_csv("d:/Users/Manuela/Documents/GitHub/ECO224/Labs/data/penn_jae.dat" , sep='\s', engine='python')
#dimension 
n = Penn.shape[0]
#numero de covariables 
p_1 = Penn.shape[1]
Penn = Penn[ (Penn['tg'] == 4) | (Penn['tg'] == 0) ]
#we ubset the data for tg== 4 | tg==0 to compare treatment group 4 and the control group

In [60]:
#this columns were not dropped out :  Unnamed: 13, recall
Penn.columns
Penn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5099 entries, 0 to 13911
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   abdt         5099 non-null   int64  
 1   tg           5099 non-null   int64  
 2   inuidur1     5099 non-null   int64  
 3   inuidur2     5099 non-null   int64  
 4   female       5099 non-null   int64  
 5   black        5099 non-null   int64  
 6   hispanic     5099 non-null   int64  
 7   othrace      5099 non-null   int64  
 8   dep          5099 non-null   int64  
 9   q1           5099 non-null   int64  
 10  q2           5099 non-null   int64  
 11  q3           5099 non-null   int64  
 12  q4           5099 non-null   int64  
 13  Unnamed: 13  5099 non-null   int64  
 14  q5           5099 non-null   int64  
 15  q6           5099 non-null   int64  
 16  recall       5099 non-null   int64  
 17  agelt35      5099 non-null   int64  
 18  agegt54      5099 non-null   int64  
 19  durab

In [61]:
# Dependent variable
Penn['T4'] = (Penn[['tg']]==4).astype(int)

# Create category variable
Penn['dep'] = Penn['dep'].astype( 'category' )
Penn.head()

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,T4
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,0,1,0,,0
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
11,10607,4,9,9,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,,1
12,10831,0,27,27,0,0,0,0,1,0,...,0,0,1,1,0,1,0,0,,0


In [62]:
Penn['dep'].unique()

[2, 0, 1]
Categories (3, int64): [2, 0, 1]

### Model

To evaluate the impact of the treatments on unemployment duration, we consider the linear regression model:

$$
Y =  D \beta_1 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W')' = 0,
$$
where $Y$ is the log of duration of unemployment, $D$ is a treatment indicators, and $W$ is a set of controls including age group dummies, gender, race, number of dependents, quarter of the experiment, location within the state, existence of recall expectations, and type of occupation. Here $\beta_1$ is the ATE, if the RCT assumptions hold rigorously.

We also consider interactive regression model:

$$
Y =  D \alpha_1 + D W' \alpha_2 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W', DW')' = 0,
$$
where $W$'s are demeaned (apart from the intercept), so that $\alpha_1$ is the ATE, if the RCT assumptions hold rigorously.

Under RCT, the projection coefficient $\beta_1$ has the interpretation of the causal effect of the treatment on the average outcome. We thus refer to $\beta_1$ as the average treatment effect (ATE). Note that the covariates, here are independent of the treatment $D$, so we can identify $\beta_1$ by just linear regression of $Y$ on $D$, without adding covariates. However we do add covariates in an effort to improve the precision of our estimates of the average treatment effect.

### Analysis
We consider:

- classical 2-sample approach, no adjustment (CL)
- classical linear regression adjustment (CRA)
$$
Y =  D \beta_1 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W')' = 0,
$$
- interactive regression adjusment (IRA)

and carry out robust inference using the estimatr R packages.

## Carry out covariate balance check

Since the extensions have been called before, the following equation is used:

In [93]:
model = "log_inuidur1~T4 + (female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)"
model_results = smf.ols( model , data = Penn ).fit().get_robustcov_results(cov_type = "HC1")

### Regress treatment on all covariates

In [94]:
print(model_results.summary())
print( "Number of regressors in the basic model:",len(model_results.params), '\n')

                            OLS Regression Results                            
Dep. Variable:           log_inuidur1   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.035
Method:                 Least Squares   F-statistic:                     15.32
Date:                Fri, 12 Nov 2021   Prob (F-statistic):           6.43e-42
Time:                        16:28:48   Log-Likelihood:                -8128.2
No. Observations:                5099   AIC:                         1.629e+04
Df Residuals:                    5082   BIC:                         1.640e+04
Df Model:                          16                                         
Covariance Type:                  HC1                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       1.7723      0.050     35.154      

In [95]:
y = Penn[['T4']].reset_index( drop = True )

We see that even though this is a randomized experiment, balance conditions are failed.

## Model Specification

In [96]:
# model specifications
# take log of inuidur1
Penn["log_inuidur1"] = np.log( Penn["inuidur1"] ) 
log_inuidur1 = pd.DataFrame(np.log( Penn["inuidur1"] ) ).reset_index( drop = True )

# no adjustment (2-sample approach)
formula_cl = 'log_inuidur1~T4+ (female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)'

# adding controls
# formula_cra = 'log_inuidur1 ~ T4 + (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)**2'
# Omitted dummies: q1, nondurable, muld

ols_cl = smf.ols( formula = formula_cl, data = Penn ).fit().get_robustcov_results(cov_type = "HC1")

We first define a function equivalent to function alpha defined in the Lab. This function takes two arguements, data and indexes, this indexes are used to calculate the estimate for alpha for this bootstrap.

In [97]:
# similar to boot.fn in lab

def get_estimates(data,index):
    X = data['female'].loc[index]
    y = data['black'].loc[index]
    
    lr = LinearRegression()
    lr.fit(X.to_frame(),y)
    intercept = lr.intercept_
    coef = lr.coef_
    return [intercept,coef]

In [98]:
#modifying the boot mentioned that we used earlier
def boot(data,func,R):
    intercept = []
    coeff = []
    for i in range(R):
        intercept.append(func(data,get_indices(data,100))[0])
        coeff.append(func(data,get_indices(data,100))[1]) 
    intercept_statistics = {'estimated_value':np.mean(intercept),'std_error':np.std(intercept)}   
    coeff_statistices = {'estimated_value':np.mean(coeff),'std_error':np.std(coeff)}   
    return {'intercept':intercept_statistics,'coeff_statistices':coeff_statistices}

In [99]:
results = boot(Penn,get_estimates,1000)

In [100]:
print('Result for intercept ',results['intercept'])
print('Result for coefficient term ',results['coeff_statistices'])

Result for intercept  {'estimated_value': 0.11710340375975938, 'std_error': 0.0410269994497691}
Result for coefficient term  {'estimated_value': 0.011961512570975438, 'std_error': 0.06683200635344153}


In [101]:
# for lets see what the model predicts
import statsmodels.api as sm
X = Penn['female']
y = Penn['black']

X = sm.add_constant(X)
results = sm.OLS(y,X).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  black   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.430
Date:                Fri, 12 Nov 2021   Prob (F-statistic):              0.232
Time:                        16:29:08   Log-Likelihood:                -1539.0
No. Observations:                5099   AIC:                             3082.
Df Residuals:                    5097   BIC:                             3095.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1175      0.006     19.787      0.0

Standard error are less for estimations using model but still bootstrap estimates are mode preices, because they don't rely on assumptions, while there is a lot of assumptions when calculating std errors using sm(model)