This notebook contains an example for teaching.

## Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question,
but we could begin to investigate from a predictive perspective.

In the following wage example, $Y$ is the hourly wage of a worker and $X$ is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:


* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data


The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015.  We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors;  individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below $3$. 

The variable of interest $Y$ is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size $n=5150$.

## Data analysis

We start by loading the data set.

In [1]:
load("../data/wage2015_subsample_inference.Rdata")
dim(data)

Let's have a look at the structure of the data.

In [2]:
str(data)

'data.frame':	5150 obs. of  20 variables:
 $ wage : num  9.62 48.08 11.06 13.94 28.85 ...
 $ lwage: num  2.26 3.87 2.4 2.63 3.36 ...
 $ sex  : num  1 0 0 1 1 1 1 0 1 1 ...
 $ shs  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hsg  : num  0 0 1 0 0 0 1 1 1 0 ...
 $ scl  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ clg  : num  1 1 0 0 1 1 0 0 0 1 ...
 $ ad   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ mw   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ so   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ we   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ne   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ exp1 : num  7 31 18 25 22 1 42 37 31 4 ...
 $ exp2 : num  0.49 9.61 3.24 6.25 4.84 ...
 $ exp3 : num  0.343 29.791 5.832 15.625 10.648 ...
 $ exp4 : num  0.24 92.35 10.5 39.06 23.43 ...
 $ occ  : Factor w/ 369 levels "10","20","40",..: 159 136 269 23 99 86 226 232 184 146 ...
 $ occ2 : Factor w/ 22 levels "1","2","3","4",..: 11 10 19 1 6 5 17 17 13 10 ...
 $ ind  : Factor w/ 236 levels "370","380","390",..: 204 117 12 165 231 176 171 135 210 201 ...
 $ ind2 : Factor w/ 

We are constructing the output variable $Y$ and the matrix $Z$ which includes the characteristics of workers that are given in the data.

In [3]:
Y <- log(data$wage)
n <- length(Y)
Z <- data[-which(colnames(data) %in% c("wage","lwage"))]
p <- dim(Z)[2]

cat("Number of observation:", n, '\n')
cat( "Number of raw regressors:", p)

Number of observation: 5150 
Number of raw regressors: 18

For the outcome variable *wage* and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

In [21]:
library(xtable)
options(xtable.floating = FALSE)
options(xtable.timestamp = "")

ERROR: Error in library(xtable): there is no package called ‘xtable’


In [4]:
library(xtable)
Z_subset <- data[which(colnames(data) %in% c("lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"))]
table <- matrix(0, 12, 1)
table[1:12,1]   <- as.numeric(lapply(Z_subset,mean))
rownames(table) <- c("Log Wage","Sex","Some High School","High School Graduate","Some College","College Graduate", "Advanced Degree","Midwest","South","West","Northeast","Experience")
colnames(table) <- c("Sample mean")
tab<- xtable(table, digits = 2)
tab

ERROR: Error in library(xtable): there is no package called ‘xtable’


E.g., the share of female workers in our sample is ~44% ($sex=1$ if female).

Alternatively, we can also print the table as latex.

In [5]:
print(tab, type="latex")

ERROR: Error in print(tab, type = "latex"): object 'tab' not found


## Prediction Question

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model**, enables us to approximate the real relationship by a
 more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

Now, let us fit both models to our data by running ordinary least squares (ols):

In [6]:
# 1. basic model
basic <- lwage~ (sex + exp1 + shs + hsg+ scl + clg + mw + so + we +occ2+ind2)
regbasic <- lm(basic, data=data)
regbasic # estimated coefficients
cat( "Number of regressors in the basic model:",length(regbasic$coef), '\n') # number of regressors in the Basic Model



Call:
lm(formula = basic, data = data)

Coefficients:
(Intercept)          sex         exp1          shs          hsg          scl  
   3.722235    -0.072857     0.008568    -0.592798    -0.504337    -0.411994  
        clg           mw           so           we        occ22        occ23  
  -0.182216    -0.027541    -0.034454     0.017249    -0.076472    -0.034678  
      occ24        occ25        occ26        occ27        occ28        occ29  
  -0.096202    -0.187915    -0.414933    -0.045987    -0.377847    -0.215752  
     occ210       occ211       occ212       occ213       occ214       occ215  
  -0.010623    -0.455834    -0.307589    -0.361440    -0.499495    -0.464482  
     occ216       occ217       occ218       occ219       occ220       occ221  
  -0.233715    -0.412588    -0.340418    -0.241480    -0.212628    -0.288413  
     occ222        ind23        ind24        ind25        ind26        ind27  
  -0.422394    -0.116836    -0.244493    -0.273533    -0.249368    -0.139588

Number of regressors in the basic model: 51 


##### Note that the basic model consists of $51$ regressors.

In [7]:
# 2. flexible model
flex <- lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
regflex <- lm(flex, data=data)
regflex # estimated coefficients
cat( "Number of regressors in the flexible model:",length(regflex$coef)) # number of regressors in the Flexible Model


Call:
lm(formula = flex, data = data)

Coefficients:
(Intercept)          sex          shs          hsg          scl          clg  
  3.8602606   -0.0695532   -0.1233089   -0.5289024   -0.2920581   -0.0411641  
      occ22        occ23        occ24        occ25        occ26        occ27  
  0.1613397    0.2101514    0.0708570   -0.3960076   -0.2310611    0.3147249  
      occ28        occ29       occ210       occ211       occ212       occ213  
 -0.1875417   -0.3390270    0.0209545   -0.6424177   -0.0674774   -0.2329781  
     occ214       occ215       occ216       occ217       occ218       occ219  
  0.2562009   -0.1938585   -0.0551256   -0.4156093   -0.4822168   -0.2579412  
     occ220       occ221       occ222        ind23        ind24        ind25  
 -0.3010203   -0.4271811   -0.8694527   -1.2473654   -0.0948281   -0.5293860  
      ind26        ind27        ind28        ind29       ind210       ind211  
 -0.6221688   -0.5047497   -0.7295442   -0.8025334   -0.5805840   -0.9852350 

Number of regressors in the flexible model: 246

Note that the flexible model consists of $246$ regressors.

Try Lasso next

In [8]:
library(hdm)
flex <- lwage ~ sex + shs+hsg+scl+clg+occ2+ind2+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
lassoreg<- rlasso(flex, data=data)

sumlasso<- summary(lassoreg)


ERROR: Error in library(hdm): there is no package called ‘hdm’


Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [9]:
# Assess the predictive performance

sumbasic <- summary(regbasic)
sumflex <- summary(regflex)

#  R-squared 
R2.1 <- sumbasic$r.squared
cat("R-squared for the basic model: ", R2.1, "\n")
R2.adj1 <- sumbasic$adj.r.squared
cat("adjusted R-squared for the basic model: ", R2.adj1, "\n")

R2.2 <- sumflex$r.squared
cat("R-squared for the flexible model: ", R2.2, "\n")
R2.adj2 <- sumflex$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adj2, "\n")

R2.L <- sumlasso$r.squared
cat("R-squared for the lasso with flexible model: ", R2.L, "\n")
R2.adjL <- sumlasso$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adjL, "\n")

R-squared for the basic model:  0.3100465 
adjusted R-squared for the basic model:  0.3032809 
R-squared for the flexible model:  0.3511099 
adjusted R-squared for the flexible model:  0.3186919 


ERROR: Error in eval(expr, envir, enclos): object 'sumlasso' not found


In [10]:
# calculating the MSE
MSE1 <- mean(sumbasic$res^2)
cat("MSE for the basic model: ", MSE1, "\n")
p1 <- sumbasic$df[1] # number of regressors
MSE.adj1 <- (n/(n-p1))*MSE1
cat("adjusted MSE for the basic model: ", MSE.adj1, "\n")

MSE2 <-mean(sumflex$res^2)
cat("MSE for the flexible model: ", MSE2, "\n")
p2 <- sumflex$df[1]
MSE.adj2 <- (n/(n-p2))*MSE2
cat("adjusted MSE for the flexible model: ", MSE.adj2, "\n")


MSEL <-mean(sumlasso$res^2)
cat("MSE for the lasso flexible model: ", MSEL, "\n")
pL <- length(sumlasso$coef)
MSE.adjL <- (n/(n-pL))*MSEL
cat("adjusted MSE for the lasso flexible model: ", MSE.adjL, "\n")

MSE for the basic model:  0.2244251 
adjusted MSE for the basic model:  0.2266697 
MSE for the flexible model:  0.2110681 
adjusted MSE for the flexible model:  0.221656 


ERROR: Error in mean(sumlasso$res^2): object 'sumlasso' not found


In [11]:
library(xtable)
table <- matrix(0, 3, 5)
table[1,1:5]   <- c(p1,R2.1,MSE1,R2.adj1,MSE.adj1)
table[2,1:5]   <- c(p2,R2.2,MSE2,R2.adj2,MSE.adj2)
table[3,1:5]   <- c(pL,R2.L,MSEL,R2.adjL,MSE.adjL)
colnames(table)<- c("p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$")
rownames(table)<- c("basic reg","flexible reg", "lasso flex")
tab<- xtable(table, digits =c(0,0,2,2,2,2))
print(tab,type="latex") # type="latex" for printing table in LaTeX
tab

ERROR: Error in library(xtable): there is no package called ‘xtable’


Considering all measures above, the flexible model performs slightly better than the basic model. 

One procedure to circumvent this issue is to use **data splitting** that is described and applied in the following.

## Data Splitting

Measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophiscticated version of splitting that we can consider).
- Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
- Use the testing sample for evaluation. Predict the $\mathtt{wage}$  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models. 

In [12]:
#splitting the data

set.seed(1) # to make the results replicable (generating random numbers)
random_2 <- sample(1:n, floor(n*4/5))
# draw (4/5)*n random numbers from 1 to n without replacing them
train <- data[random,] # training sample
test <- data[-random,] # testing sample
dim(train)

ERROR: Error in `[.data.frame`(data, random, ): object 'random' not found


In [13]:
# basic model
# estimating the parameters in the training sample
regbasic <- lm(basic, data=train)
regbasic

ERROR: Error in is.data.frame(data): object 'train' not found


In [14]:
# calculating the out-of-sample MSE
trainregbasic <- predict(regbasic, newdata=test)
trainregbasic

ERROR: Error in predict.lm(regbasic, newdata = test): object 'test' not found


In [15]:
y.test <- log(test$wage)
MSE.test1 <- sum((y.test-trainregbasic)^2)/length(y.test)
R2.test1<- 1- MSE.test1/var(y.test)

cat("Test MSE for the basic model: ", MSE.test1, " ")

cat("Test R2 for the basic model: ", R2.test1)

ERROR: Error in eval(expr, envir, enclos): object 'test' not found


In the basic model, the $MSE_{test}$ is quite closed to the $MSE_{sample}$.

In [16]:
# flexible model
# estimating the parameters
#options(warn=-1)
regflex <- lm(flex, data=train)

# calculating the out-of-sample MSE
trainregflex<- predict(regflex, newdata=test)
y.test <- log(test$wage)
MSE.test2 <- sum((y.test-trainregflex)^2)/length(y.test)
R2.test2<- 1- MSE.test2/var(y.test)

cat("Test MSE for the flexible model: ", MSE.test2, " ")

cat("Test R2 for the flexible model: ", R2.test2)

ERROR: Error in is.data.frame(data): object 'train' not found


In [17]:
length(y.test)

ERROR: Error in eval(expr, envir, enclos): object 'y.test' not found


In the flexible model, the discrepancy between the $MSE_{test}$ and the $MSE_{sample}$ is not large.

It is worth to notice that the $MSE_{test}$ vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample $MSE$, the basic model using ols regression performs is about as well (or slightly better) than the flexible model. 


Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (*least absolute shrinkage and selection operator*) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors $p$ is relatively large in relation to $n$. 

Note that the out-of-sample $MSE$ on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

In [18]:
# flexible model using lasso

# estimating the parameters
library(hdm)
reglasso <- rlasso(flex, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglasso<- predict(reglasso, newdata=test)
MSE.lasso <- sum((y.test-trainreglasso)^2)/length(y.test)
R2.lasso<- 1- MSE.lasso/var(y.test)


cat("Test MSE for the lasso on flexible model: ", MSE.lasso, " ")

cat("Test R2 for the lasso flexible model: ", R2.lasso)

ERROR: Error in library(hdm): there is no package called ‘hdm’


Finally, let us summarize the results:

In [19]:
table2 <- matrix(0, 3,2)
table2[1,1]   <- MSE.test1
table2[2,1]   <- MSE.test2
table2[3,1]   <- MSE.lasso
table2[1,2]   <- R2.test1
table2[2,2]   <- R2.test2
table2[3,2]   <- R2.lasso

rownames(table2)<- c("basic reg","flexible reg","lasso regression")
colnames(table2)<- c("$MSE_{test}$", "$R^2_{test}$")
tab2 <- xtable(table2, digits =3)
tab2

ERROR: Error in eval(expr, envir, enclos): object 'MSE.test1' not found


In [20]:
print(tab2,type="latex") # type="latex" for printing table in LaTeX

ERROR: Error in print(tab2, type = "latex"): object 'tab2' not found
