In [2]:
library(ISLR2)
set.seed(1)
library(boot)

### Workgroup 5

- Sonia Rosmery Asto
- Daniel Carrillo
- Nicole Linares

## Bootstraping - Scripts in R 

First, we load the data.

In [3]:
## loading the data
Penn <- as.data.frame(read.table("../data/penn_jae.dat", header=T ))
n <- dim(Penn)[1]
p_1 <- dim(Penn)[2]
Penn<- subset(Penn, tg== 4| tg==0)
attach(Penn)

We select the treatment group number 4

In [4]:
T4<- (tg==4)
summary(T4)

   Mode   FALSE    TRUE 
logical    3354    1745 

In [4]:
head(Penn)

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q5,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld
1,10824,0,18,18,0,0,0,0,2,0,...,1,0,0,0,0,0,0,0,1,0
4,10824,0,1,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
5,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
12,10607,4,9,9,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
13,10831,0,27,27,0,0,0,0,1,0,...,1,0,0,0,1,1,0,1,0,0
14,10845,0,27,27,1,0,0,0,0,0,...,1,0,0,0,1,0,0,1,0,0


In [5]:
dim(Penn)

We create the `boot.fn` function to create bootstrap estimates. We will use this function not only to estimate the bootstrap estimation, but also to estimate the function with the linear regression and compare the values of the coefficients and standard errors. 

As the number of observations is 5099, the samples will be according to this number. The function will allow us to compute the estimates of the betas associated to the variables. 

In [6]:
boot.fn<- function(data, index)
    coef(lm(log(inuidur1)~ T4 + (female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd), data=data, subset=index))

In [7]:
lm(log(inuidur1)~ T4 + (female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd),data=Penn)


Call:
lm(formula = log(inuidur1) ~ T4 + (female + black + othrace + 
    factor(dep) + q2 + q3 + q4 + q5 + q6 + agelt35 + agegt54 + 
    durable + lusd + husd), data = Penn)

Coefficients:
 (Intercept)        T4TRUE        female         black       othrace  
    2.178462     -0.071692      0.126368     -0.293768     -0.472445  
factor(dep)1  factor(dep)2            q2            q3            q4  
    0.029867      0.096187      0.073678     -0.038507     -0.054949  
          q5            q6       agelt35       agegt54       durable  
   -0.144178      0.003361     -0.162772      0.229667      0.126557  
        lusd          husd  
   -0.175353     -0.105225  


In [8]:
boot.fn(Penn, 1:5099)

In [9]:
sample(5099,5099,replace=T)

In [10]:
set.seed(1)
boot.fn(Penn, sample(5099,5099,replace=T))

With the function `boot()` we will compute the standard errors of 1,000 bootstrap estimates for $\beta_0$ (intercept) and for all the other variables. 

In [11]:
alo <- boot(Penn, boot.fn, 1000)

In [12]:
alo


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Penn, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
         original        bias    std. error
t1*   2.178462326  0.0010812733  0.15469174
t2*  -0.071692484 -0.0003483223  0.03552280
t3*   0.126368328  0.0007853971  0.03518438
t4*  -0.293767980  0.0007320124  0.06028143
t5*  -0.472445058 -0.0049861202  0.24369741
t6*   0.029866899  0.0007690384  0.05555582
t7*   0.096186517 -0.0011770733  0.04531520
t8*   0.073678072 -0.0010033034  0.15235277
t9*  -0.038506537 -0.0003146298  0.15005091
t10* -0.054949195 -0.0008440703  0.15097113
t11* -0.144177912 -0.0015435867  0.15021731
t12*  0.003361318 -0.0038177923  0.16140880
t13* -0.162772168  0.0017786902  0.03816477
t14*  0.229666708  0.0002651347  0.05836239
t15*  0.126557359  0.0001236963  0.04853387
t16* -0.175352572 -0.0006751971  0.04095015
t17* -0.105224727 -0.0007444597  0.04512871

In [13]:
ols <- lm(log(inuidur1)~ T4 + (female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd))

In [14]:
ols_1 <-summary(ols, data = Penn)$coef
ols_1

Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),2.178462326,0.15901507,13.69972271,5.63126e-42
T4TRUE,-0.071692484,0.03546326,-2.02159887,0.04327012
female,0.126368328,0.03482493,3.62867401,0.0002876799
black,-0.29376798,0.05297556,-5.54534899,3.081966e-08
othrace,-0.472445058,0.1983975,-2.38130547,0.01728801
factor(dep)1,0.029866899,0.05414025,0.55165799,0.581207
factor(dep)2,0.096186517,0.04686228,2.05253621,0.04016868
q2,0.073678072,0.15682593,0.46980799,0.6385124
q3,-0.038506537,0.15647802,-0.24608272,0.8056281
q4,-0.054949195,0.15656019,-0.35097809,0.7256193


The results between the bootstrap approach and the linear regression are slightly different. The standard error from the bootstrap estimation are a little bit higher with the variables $T4$, $female$ and $black$. As we can see, in the linear estimation the standard errors for those variables are $0.03546326$, $0.03482493$ and $0.05297556$. And for the other method are $0.03420464$, $0.03540692$ and $0.057385124$. We can trust more in the bootstraping method, even tho the std's are higher, since it doesn't depends on assumptions that linear regression does. 

In [15]:
table<- matrix(0, 2, 4)

table[1,1]<-ols_1[1,2]
table[1,2]<-ols_1[2,2]
table[1,3]<-ols_1[3,2]
table[1,4]<-ols_1[4,2]

table[2,1]<- 0.15511158
table[2,2]<- 0.03420464
table[2,3]<- 0.03540692
table[2,4]<- 0.05738512

colnames(table) <- c('intercept','T4TRUE','female','black')
rownames(table)<- c("linear_std", 'boot_std')

table

Unnamed: 0,intercept,T4TRUE,female,black
linear_std,0.1590151,0.03546326,0.03482493,0.05297556
boot_std,0.1551116,0.03420464,0.03540692,0.05738512


## Part 2
The second part of the R script (Causal Tree - Only script in R), can be found in the file Group5_lab5_R_2. The work was divided in two parts because there were problems with the code when both parts were put together.