This code is a translation of chapter five of ISL from https://github.com/hardikkamboj

# Workgroup 5: The Bootstrap

In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt for further details on data.

In [1]:
# import relevant libraries
import pandas as pd
import numpy as np

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# uploading the data
Penn = pd.read_csv("../data/penn_jae.dat" , sep='\s', engine='python')
print(Penn.shape)
Penn.head()

(13913, 24)


Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q5,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,0,0,1,0,
1,10635,2,7,3,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,
2,10551,5,18,6,1,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,


In [3]:
# subsetting the data: treatment group 4 and control group
Penn = Penn[ (Penn['tg'] == 4) | (Penn['tg'] == 0) ]
Penn['T4'] = (Penn[['tg']]==4).astype(int)

# Create category variable
Penn['dep'] = Penn['dep'].astype( 'category' )
#transform the inuidur1 varaible to log to run into the model 
Penn['inuidur1'] = np.log(Penn['inuidur1'])

Penn.head()

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,T4
0,10824,0,2.890372,18,0,0,0,0,2,0,...,0,0,0,0,0,0,1,0,,0
3,10824,0,0.0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
4,10747,0,3.295837,27,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
11,10607,4,2.197225,9,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,,1
12,10831,0,3.295837,27,0,0,0,0,1,0,...,0,0,1,1,0,1,0,0,,0


In [4]:
# creating a function to get a sample with repetition
def get_indices(data,num_samples):
    return  np.random.choice(data.index, num_samples, replace=True)

In [5]:
get_indices(Penn, 10)

array([12508,  1822, 10993,   274,  8707,   166,  4458,    38,  5408,
        9612], dtype=int64)

## Estimates

In [6]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import numpy as np

In [7]:
# OLS estimation

model = "inuidur1~T4+ (female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)"
model_results = smf.ols( model , data = Penn ).fit().get_robustcov_results(cov_type = "HC1")
print(model_results.summary())

                            OLS Regression Results                            
Dep. Variable:               inuidur1   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.035
Method:                 Least Squares   F-statistic:                     15.32
Date:                Wed, 08 Dec 2021   Prob (F-statistic):           6.43e-42
Time:                        09:01:49   Log-Likelihood:                -8128.2
No. Observations:                5099   AIC:                         1.629e+04
Df Residuals:                    5082   BIC:                         1.640e+04
Df Model:                          16                                         
Covariance Type:                  HC1                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       1.7723      0.050     35.154      

In [8]:
# a function to get estimates consiring bootstraping
def get_estimates(data,index):
    data_1 = data.loc[index]

    model = "inuidur1~T4+ (female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)"

    model_results = smf.ols( model , data = data_1 ).fit().get_robustcov_results(cov_type = "HC1")

    T4_coef = model_results.summary2().tables[1]['Coef.']['T4']
    female_coef = model_results.summary2().tables[1]['Coef.']['female']
    black_coef = model_results.summary2().tables[1]['Coef.']['black']

    return [T4_coef,female_coef, black_coef]

In [9]:
# a function to run the model 1,000 times with differents samples and get a mean of the coefficient of the 
# the relevant variables (T4, FEMALE and BLACK).
def boot(data,func,R):
    T4 = []
    female = []
    black = []
    for i in range(R):
        T4.append(func(data,get_indices(data,5099))[0])
        female.append(func(data,get_indices(data,5099))[1]) 
        black.append(func(data,get_indices(data,5099))[2]) 
    T4_statistics = {'estimated_value':np.mean(T4),'std_error':np.std(T4)}   
    female_statistices = {'estimated_value':np.mean(female),'std_error':np.std(female)}   
    black_statistices = {'estimated_value':np.mean(black),'std_error':np.std(black)} 
    return {'T4_statistics':T4_statistics,'female_statistics':female_statistices,'black_statistics':black_statistices}

In [10]:
# saving results
results = boot(Penn,get_estimates,1000)

In [11]:
# showing results
print('Result for T4 ',results['T4_statistics'])
print('Result for Female ',results['female_statistics'])
print('Result for Black ',results['black_statistics'])

Result for T4  {'estimated_value': -0.07709764832185141, 'std_error': 0.03552098274204299}
Result for Female  {'estimated_value': 0.13777038248282247, 'std_error': 0.034136740471159534}
Result for Black  {'estimated_value': -0.30842590152320065, 'std_error': 0.0629824407883496}


Finally we save the results on a table.

In [12]:
table_r = np.zeros((2, 6))
table_r[0,0] = model_results.summary2().tables[1]['Coef.']['T4']
table_r[0,1] = results['T4_statistics']['estimated_value']

table_r[0,2] = model_results.summary2().tables[1]['Coef.']['female']
table_r[0,3] = results['female_statistics']['estimated_value']

table_r[0,4] = model_results.summary2().tables[1]['Coef.']['black']
table_r[0,5] = results['black_statistics']['estimated_value']


table_r[1,0] = model_results.summary2().tables[1]['Std.Err.']['T4']
table_r[1,1] = results['T4_statistics']['std_error']

table_r[1,2] = model_results.summary2().tables[1]['Std.Err.']['T4']
table_r[1,3] = results['female_statistics']['std_error']

table_r[1,4] = model_results.summary2().tables[1]['Std.Err.']['black']
table_r[1,5] = results['black_statistics']['std_error']


table_r = pd.DataFrame(table_r, columns = ["T4", "T4_boot", "Female", "Female_boot",'Black','Black_boot'], \
                      index = ["estimate","standard error"])
table_r

Unnamed: 0,T4,T4_boot,Female,Female_boot,Black,Black_boot
estimate,-0.076206,-0.077098,0.138128,0.13777,-0.307905,-0.308426
standard error,0.035211,0.035521,0.035211,0.034137,0.059723,0.062982


It shows that the bootstrap estimates are similar to the coefficient of the model with the original dataset.