## Workgroup 5

Members:
* Diego Gómez
* Alexander Pacheco

In this workgroup we will use bootstraping tools and causal trees. For both analysis you have to use the Pennsylvania re-employment bonus experiment (penn_jae.dat in the data folder). You have to subset your data for tg== 4 | tg==0, so we are going to compare treatment group 4 and the control group.

# Bootstraping - Python

For the bootstrap section you have to use the next equation: log(inuidur1)~T4 (female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd) No quadratic tem is required!!! Next, you have to compute the standard errors of 1,000 bootstrap estimates for the T4, female and black variables. Describe in detail each step you follow, and what lines of code you changed. Finally, present your results in a table.

In [1]:
# We import some relevant packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# We import the data
Penn = pd.read_csv("../data/penn_jae.dat" , sep='\s', engine='python')
print(Penn.shape)
Penn.head()

(13913, 24)


Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q5,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,0,0,1,0,
1,10635,2,7,3,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,
2,10551,5,18,6,1,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,


In [3]:
# We subset the data for tg==4 | tg==0
data= Penn[ (Penn['tg'] == 4) | (Penn['tg'] == 0) ]
print(data.shape)

(5099, 24)


In [6]:
# We take log to log_inuidur1 and define T4 as an integer variable which is 1 for the treatment group 4 
data['log_inuidur1'] = np.log( data["inuidur1"] )
data['T4']=(data[['tg']]==4).astype(int)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,log_inuidur1,T4
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,1,0,,2.890372,0
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,,0.0,0
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,1,0,0,,3.295837,0
11,10607,4,9,9,0,0,0,0,0,0,...,1,0,0,0,0,0,1,,2.197225,1
12,10831,0,27,27,0,0,0,0,1,0,...,0,1,1,0,1,0,0,,3.295837,0


* We define the function "get_indices". We will randomly select the data with replacement through this function.

In [7]:
def get_indices(data,num_samples):
    return  np.random.choice(data.index, num_samples, replace=True)

In [8]:
get_indices(data,5099)

array([13215, 10361,  9114, ..., 11485,  1496,  4157], dtype=int64)

* We define the function that will estimate

In [9]:
def get_estimates(data,index):
    X = data[['T4','female','black','othrace','dep','q2','q3','q4','q5','q6','agelt35','agegt54','durable','lusd','husd']].loc[index]
    y = data['log_inuidur1'].loc[index]
    
    lr = LinearRegression()
    lr.fit(X,y)
    coef = lr.coef_
    return [coef]

In [10]:
# We define n as the number of observations
n = data.shape[0]

# We have to compute the standard errors of 1,000 bootstrap estimates for the T4, female and black variables

def boot(data,function,R):
    T4 = []
    female = []
    black = []
    
    for i in range(R):
        T4.append(function(data,get_indices(data,n))[0][0]) 
        female.append(function(data,get_indices(data,n))[0][1])
        black.append(function(data,get_indices(data,n))[0][2])
        
    stats_T4 = {'mean':np.mean(T4),'std_error':np.std(T4)}   
    stats_female = {'mean':np.mean(female),'std_error':np.std(female)}
    stats_black = {'mean':np.mean(black),'std_error':np.std(black)}
    
    return {'statistics_T4':stats_T4,'statistics_female':stats_female,'statistics_black':stats_black}

In [11]:
# We now obtaim the standard errors of 1,000 bootstrap estimates
excersice = boot(data,get_estimates,1000)
excersice

{'statistics_T4': {'mean': -0.0774861648570279,
  'std_error': 0.035203046877774674},
 'statistics_female': {'mean': 0.1378877135584742,
  'std_error': 0.03405110541798197},
 'statistics_black': {'mean': -0.30934617040168066,
  'std_error': 0.05905903174379431}}

In [13]:
# We resume and present our results in the next table
table = np.zeros((3, 2))
table[0,0] = excersice['statistics_T4']['mean']
table[1,0] = excersice['statistics_female']['mean']
table[2,0] = excersice['statistics_black']['mean']

table[0,1] = excersice['statistics_T4']['std_error']
table[1,1] = excersice['statistics_female']['std_error']
table[2,1] = excersice['statistics_black']['std_error']

table = pd.DataFrame(table, columns = ["Mean","Standard Error"], \
                      index = ["T4", "Female", "Black"])
table

Unnamed: 0,Mean,Standard Error
T4,-0.077486,0.035203
Female,0.137888,0.034051
Black,-0.309346,0.059059
