# EEP C118 Section 3: R Demonstration

Let's practice running multiple linear regression in R. Suppose we want to know the relationship between hours slept and hours worked. To do this, we will use `sleep.dta` which contains the relevant data. Remember to read in `.dta` files, we need to use the `haven` package.

In [1]:
library(tidyverse)
library(haven)
sleepdata <- read_dta("sleep75.dta")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.3     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
head(sleepdata)

age,black,case,clerical,construc,educ,earns74,gdhlth,inlf,leis1,⋯,spwrk75,totwrk,union,worknrm,workscnd,exper,yngkid,yrsmarr,hrwage,agesq
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
32,0,1,0,0,12,0,0,1,3529,⋯,0,3438,0,3438,0,14,0,13,7.070004,1024
31,0,2,0,0,14,9500,1,1,2140,⋯,0,5020,0,5020,0,11,0,0,1.429999,961
44,0,3,0,0,17,42500,1,1,4595,⋯,1,2815,0,2815,0,21,0,0,20.529997,1936
30,0,4,0,0,12,42500,1,1,3211,⋯,1,3786,0,3786,0,12,0,12,9.619998,900
64,0,5,0,0,14,2500,1,1,4052,⋯,1,2580,0,2580,0,44,0,33,2.75,4096
41,0,6,0,0,12,0,1,1,4812,⋯,0,1205,0,0,1205,23,0,23,19.249998,1681


Uh oh. The data set is kind of long and hence Jupyter isn't showing us the middle columns. But we can also get a list of the columns by calling `colnames()`

In [3]:
colnames(sleepdata)

In [3]:
summary(sleepdata$yngkid)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1289  0.0000  1.0000 

Looks like _sleep_ and _totwork_ are our main variables of interest. But note that in this data set, it happens to the that _sleep_ describes minutes of sleep per week.

In [4]:
summary(sleepdata$sleep)
summary(sleepdata$totwrk)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    755    3015    3270    3266    3532    4695 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1554    2288    2123    2692    6415 

Let's make a new variable called _sleephrs_ that is hours slept per night.

In [5]:
sleepdata$sleephrs<-sleepdata$sleep/(7*60)
summary(sleepdata$sleephrs)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.798   7.179   7.787   7.777   8.410  11.179 

Let's do the same thing for _totwrk_, making a variable called _wrkhrs_.

In [6]:
sleepdata$wrkhrs<-sleepdata$totwrk/(7*60)
summary(sleepdata$wrkhrs)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.699   5.448   5.055   6.409  15.274 

Great! Now let's run a regression and look at the results.

In [7]:
slr<-lm(sleephrs~wrkhrs, data=sleepdata)
summary(slr)


Call:
lm(formula = sleephrs ~ wrkhrs, data = sleepdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7856 -0.5720  0.0117  0.5965  3.1898 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.53899    0.09265  92.165   <2e-16 ***
wrkhrs      -0.15075    0.01674  -9.005   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.003 on 704 degrees of freedom
Multiple R-squared:  0.1033,	Adjusted R-squared:  0.102 
F-statistic: 81.09 on 1 and 704 DF,  p-value: < 2.2e-16


So it seems like for every additional hour a person works, they sleep about 0.15 hours less. But is this causal? Not necessarily. Recall SLR 4, that $E[u\vert x]=0$. Is this likely to be true here? What are some confounders that we might be picking up in $u$ with this specification? Maybe age, health status, gender, etc. Let's do a multiple linear regression where we controll for some more factors.

In [12]:
mlr<-lm(sleephrs~wrkhrs+gdhlth+age+male, data=sleepdata)
summary(mlr)


Call:
lm(formula = sleephrs ~ wrkhrs + gdhlth + age + male, data = sleepdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5354 -0.5677  0.0094  0.6252  3.2351 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.452901   0.196528  43.011   <2e-16 ***
wrkhrs      -0.163110   0.018035  -9.044   <2e-16 ***
gdhlth      -0.221831   0.121370  -1.828   0.0680 .  
age          0.005919   0.003329   1.778   0.0758 .  
male         0.205585   0.081706   2.516   0.0121 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.995 on 701 degrees of freedom
Multiple R-squared:  0.1208,	Adjusted R-squared:  0.1157 
F-statistic: 24.07 on 4 and 701 DF,  p-value: < 2.2e-16


When we control for these other factors, the co-efficient on _wrkhrs_ stays fairly consistent. This tells us that even after holding health status, age, and gender constant, that working more leads to less sleep. However, we still might be able to think of other confounders contained in _u_ even in this specification. To be confident in our result, let's add some more covariates and see what happens.

In [19]:
mlr2<-lm(sleephrs~wrkhrs+gdhlth+age+male+lhrwage+clerical+marr+black+earns74+union+exper, data=sleepdata)
summary(mlr2)


Call:
lm(formula = sleephrs ~ wrkhrs + gdhlth + age + male + lhrwage + 
    clerical + marr + black + earns74 + union + exper, data = sleepdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3431 -0.5948 -0.0021  0.5899  3.0868 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.643e+00  4.339e-01  19.919  < 2e-16 ***
wrkhrs      -1.531e-01  2.104e-02  -7.279 1.25e-12 ***
gdhlth      -1.756e-01  1.378e-01  -1.274    0.203    
age         -9.461e-03  1.983e-02  -0.477    0.633    
male         1.016e-01  1.077e-01   0.944    0.346    
lhrwage      4.543e-02  8.452e-02   0.537    0.591    
clerical     7.525e-02  1.178e-01   0.639    0.523    
marr         1.464e-01  1.117e-01   1.311    0.190    
black       -1.613e-01  1.944e-01  -0.830    0.407    
earns74     -8.388e-06  5.986e-06  -1.401    0.162    
union        5.240e-02  1.045e-01   0.501    0.616    
exper        1.260e-02  1.821e-02   0.692    0.489    
---
Signif. codes:  0 ‘***’ 0.

Wow, our coefficient estimate on _wrkhrs_ is still quite stable, making us more confident that working 1 more hour leads to about 0.15 less hours of sleep a night. However, note that in our `mlr` regression, being a male was associated with .20 additional hours of sleep a night, but when we add in other covariates in `mlr2`, this coefficient shrinks and it is no longer statistically signficant.