Members:

- Claudia Vivas
- Diego Gomez
- Alexander Pacheco

# Question 1

The rationale of sample splitting lies in the need to externally validate the estimation results about a certain regressed variable that can be obtained with a regression with a (total) sample. This procedure consists, as its name suggests, in splitting/separating the sample in two subgroups (training and test, respectively), with each one being balanced —each one similar in their statistical moments to the other one. This is done to generate estimates with a first subgroup that can be used subsequently, using the regressors of  the second subgroup, to project de set of values of the variable analyzed for this second group, therefore obtaining a measure of error between the projection of this variable for the second group and the actual second (splitted) sample variable. This measure of error can be statistically significant or insignificant, which could be taken as the results being externally consistent (i.e., replicable to other contexts) or inconsistent, respectively. The algorithm to be followed to perform sample splitting is the next:
1. Split the total sample in training and test sample, this is randomly divide and the proportion of each group to the total is chosen by the researcher.
2. The second step is to estimate a vector of beta coefficients $\beta_{estimated}$ using the train data and predict $Y_{estimated}$ using $X_{test}$.
3. The third step, this last result will be compared with the $Y_{test}$ to determine if the model fits the data correctly in order to predict the observed results.

# Question 2

We start by loading the data set.

In [1]:
load("../data/wage2015_subsample_inference.Rdata")
dim(data)

We focus on people who did not go to college, so we are not considering any other observations.

In [2]:
data_1 <- subset(data, shs==1 | hsg ==1);

dim(data_1)

Here we are constructing the output variable $Y$ and the matrix $Z$ which includes the characteristics of workers that are given in the data.

In [3]:
Y <- log(data$wage)
n <- dim(data_1)[1]  
Z <- data_1[-which(colnames(data_1) %in% c("wage","lwage"))]   # aqui estarán todas las covariables (variables control)
          # aqui estamos diciendo que queremos que nos incluya todas las variables menos la columna wage y lwage
p <- dim(Z)[2]  # para obtener el numero de columnas = variables 

cat("Number of observation:", n, '\n')
cat( "Number of raw regressors:", p)

Number of observation: 1376 
Number of raw regressors: 18

## Prediction Question

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators). That is,  sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2.


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model**, enables us to approximate the real relationship by a
 more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.


#### Now, let us fit both models to our data by running ordinary least squares (ols):

## Basic model

In [27]:
# 1. basic model
basic <- lwage~ (sex + exp1 + shs + hsg+ scl + clg + mw + so + we +occ2+ind2)
                # occ2 y ind2 son variables del tipo factor, por lo que entran consus varias categorias como dummyes en la regresión
regbasic <- lm(basic, data=data_1)
regbasic # estimated coefficients
cat( "Number of regressors in the basic model:",length(regbasic$coef), '\n') # number of regressors in the Basic Model


Call:
lm(formula = basic, data = data_1)

Coefficients:
(Intercept)          sex         exp1          shs          hsg          scl  
  2.8330066   -0.0733094    0.0075742   -0.0811342           NA           NA  
        clg           mw           so           we        occ22        occ23  
         NA   -0.0431882   -0.1091620    0.0129620   -0.1961261   -0.0086113  
      occ24        occ25        occ26        occ27        occ28        occ29  
  0.0005078    0.2615289   -0.3510072   -0.1900342   -0.6616521   -0.3013316  
     occ210       occ211       occ212       occ213       occ214       occ215  
 -0.0576220   -0.4176903   -0.4663571   -0.4219896   -0.5527766   -0.4747648  
     occ216       occ217       occ218       occ219       occ220       occ221  
 -0.2381724   -0.3529422   -0.3976108   -0.1181885   -0.1053967   -0.1737437  
     occ222        ind23        ind24        ind25        ind26        ind27  
 -0.3479965    0.1742747    0.0504201    0.0585330    0.0348081    0.23795

Number of regressors in the basic model: 51 


## Flexible model

In [28]:
# 2. flexible model
flex <- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2
regflex <- lm(flex, data=data_1)
regflex # estimated coefficients
cat( "Number of regressors in the flexible model:",length(regflex$coef)) # number of regressors in the Flexible Model


Call:
lm(formula = flex, data = data_1)

Coefficients:
  (Intercept)           exp1           exp2           exp3           exp4  
    1.601e+01     -3.411e+00      3.874e+01     -2.133e+01      7.337e+00  
          shs            hsg            scl            clg          occ22  
   -5.723e-01             NA             NA             NA     -2.441e+00  
        occ23          occ24          occ25          occ26          occ27  
   -4.590e+01      7.898e+00     -5.081e+01      1.803e+01     -9.997e-01  
        occ28          occ29         occ210         occ211         occ212  
   -1.024e+01     -2.064e+01     -3.857e+00      2.756e-01      1.765e+00  
       occ213         occ214         occ215         occ216         occ217  
   -2.087e+00     -8.440e-01     -1.059e+01     -1.232e-01     -4.549e+00  
       occ218         occ219         occ220         occ221         occ222  
   -3.445e-01     -6.630e+00     -2.868e+00     -1.239e+00     -1.932e+00  
        ind23          ind24    

Number of regressors in the flexible model: 979

Note that the flexible model consists of $979$ regressors.

Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [33]:
# Assess the predictive performance

sumbasic <- summary(regbasic)
sumflex <- summary(regflex)

#  R-squared 
R2.1 <- sumbasic$r.squared
cat("R-squared for the basic model: ", R2.1, "\n")
R2.adj1 <- sumbasic$adj.r.squared
cat("adjusted R-squared for the basic model: ", R2.adj1, "\n")

R2.2 <- sumflex$r.squared
cat("R-squared for the flexible model: ", R2.2, "\n")
R2.adj2 <- sumflex$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adj2, "\n")

R-squared for the basic model:  0.1802381 
adjusted R-squared for the basic model:  0.1512255 
R-squared for the flexible model:  0.507044 
adjusted R-squared for the flexible model:  0.2315028 


 Here we can see that "flexible model" more closely matches the data than "basic model", that applies for both R-squared and Adjusted R-squared.

Next, let's see the $MSE_{sample}$:

In [34]:
# calculating the MSE
MSE1 <- mean(sumbasic$res^2)
cat("MSE for the basic model: ", MSE1, "\n")
p1 <- sumbasic$df[1] # p1 = number of regressors
MSE.adj1 <- (n/(n-p1))*MSE1
cat("adjusted MSE for the basic model: ", MSE.adj1, "\n")

MSE2 <-mean(sumflex$res^2)
cat("MSE for the flexible model: ", MSE2, "\n")
p2 <- sumflex$df[1]
MSE.adj2 <- (n/(n-p2))*MSE2
cat("adjusted MSE for the flexible model: ", MSE.adj2, "\n")

MSE for the basic model:  0.2082191 


ERROR: Error in eval(expr, envir, enclos): objeto 'n' no encontrado


 Related to the previous results, we have that the MSE is lower in the "flexible model" than the "basic model".

Finally, we will present the results in a table format:

In [35]:
library(xtable)
table <- matrix(0, 2, 5)
table[1,1:5]   <- c(p1,R2.1,MSE1,R2.adj1,MSE.adj1)
table[2,1:5]   <- c(p2,R2.2,MSE2,R2.adj2,MSE.adj2)
colnames(table)<- c("p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$")
rownames(table)<- c("basic reg","flexible reg")
tab<- xtable(table, digits =c(0,0,2,2,2,2))
print(tab,type="latex") # type="latex" for printing table in LaTeX
tab

ERROR: Error in eval(expr, envir, enclos): objeto 'MSE.adj1' no encontrado


Considering all measures above, the flexible model performs slightly better than the basic model. Adding a comment, both in the R and in the Python scripts we conclude that the flexible model is better than the basic model, however, the values for the $MSE$ and $R ^ 2$ (sample and adjusted) are slightly different, due to the value assigned to lambda.


## Question 3

In first instance, the application of the Frisch-Waugh-Lovell (F-B-L) theorem can be used to reduce the sample size required to get an estimation, consequentyl reducing the risk of overfitting. However, the theorem can be used for the sake of proving the theorem itself too. This theorem states the following: "In the linear least squares regression of vector y on two sets of variables, X1 and X2, the subvector b2 is the set of coefficients obtained when the residuals from a regression of y on X1 alone are regressed on the set of residuals obtained when each column of X2 is regressed on X1" (Greene 2018: 36). This theorem works for estimation methods others than the OLS estimation method. In particular, Lasso method, "a method that combines the least-squares loss with an l_{1} - constraint, or bound on the sum of the absolute values of the coefficients" (Hastie et al 2015: 8)

The steps that we will follow for each cases are:
1. We will return to the basic and flexible model with Lasso
2. We will follow the Partialling-Out algorithm using lasso
  According to the F-B-L theorem, the coefficients found for the variable $ sex $ should be the same in the regression of the complete model as using Partialling-Out. Let's check it out.

## Case 1: Partialling-Out using lasso 1

In [4]:
library(sandwich) 

In [23]:
# Lasso 1
library(hdm)
Lasso1 <- lwage ~ sex + (exp1 + shs + hsg+ scl + clg + mw + so + we +occ2+ind2)
lassoreg1<- rlasso(Lasso1, data=data_1)
sumlasso<- summary(lassoreg1)



Call:
rlasso.formula(formula = Lasso1, data = data_1)

Post-Lasso Estimation:  TRUE 

Total number of variables: 50
Number of selected variables: 6 

Residuals: 
     Min       1Q   Median       3Q      Max 
-1.37681 -0.29491 -0.01412  0.27657  3.47488 

            Estimate
(Intercept)    2.667
sex           -0.098
exp1           0.008
shs            0.000
hsg            0.000
scl            0.000
clg            0.000
mw             0.000
so             0.000
we             0.000
occ22          0.000
occ23          0.000
occ24          0.000
occ25          0.000
occ26          0.000
occ27          0.000
occ28          0.000
occ29          0.000
occ210         0.000
occ211         0.000
occ212         0.000
occ213        -0.240
occ214        -0.312
occ215        -0.302
occ216         0.000
occ217         0.000
occ218         0.000
occ219         0.000
occ220         0.000
occ221         0.000
occ222         0.000
ind23          0.000
ind24          0.000
ind25          0.000
ind26    

Unlike Python, in R we can look at the coefficient of $ sex $ variable, this is $ -0.098 $.

In [21]:
# Partialling-Out using lasso for the basic model

# models
Lasso1.y <- lwage ~  (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2 + ind2) # model for Y
Lasso1.d <- sex ~ (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2 + ind2) # model for D

# partialling-out the linear effect of W from Y         
L1.Y <- rlasso(Lasso1.y, data=data_1, post=FALSE)$res
# partialling-out the linear effect of W from D         
L1.D <- rlasso(Lasso1.d, data=data_1, post=FALSE)$res

# regression of Y on D after partialling-out the effect of W
partial.lasso1.fit <- rlasso(L1.Y~L1.D, post=FALSE)
partial.lasso1.est <- summary(partial.lasso1.fit)$coef[2]

cat("Coefficient for D via partialling-out", partial.lasso1.est)


Call:
rlasso.formula(formula = L1.Y ~ L1.D, post = FALSE)

Post-Lasso Estimation:  FALSE 

Total number of variables: 1
Number of selected variables: 1 

Residuals: 
     Min       1Q   Median       3Q      Max 
-1.32875 -0.29382 -0.01959  0.28458  3.47743 

            Estimate
(Intercept)    0.000
              -0.043

Residual standard error: 0.4752
Multiple R-squared:  0.003896
Adjusted R-squared:  0.003171
Joint significance test:
 the sup score statistic for joint significance test is 0.5125 with a p-value of 0.012
Coefficient for D via partialling-out -0.04331474

In [19]:
# Partialling-Out using lasso

# models
Lasso1.y <- lwage ~  (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2 + ind2) # model for Y
Lasso1.d <- sex ~ (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2 + ind2) # model for D

# partialling-out the linear effect of W from Y         
L1.Y <- rlasso(Lasso1.y, data=data_1, post=FALSE)$res
# partialling-out the linear effect of W from D         
L1.D <- rlasso(Lasso1.d, data=data_1, post=FALSE)$res

# regression of Y on D after partialling-out the effect of W
partial.lasso1.fit <- lm(L1.Y~L1.D)
partial.lasso1.est <- summary(partial.lasso1.fit)$coef[2,1]

cat("Coefficient for D via partialling-out", partial.lasso1.est)

Coefficient for D via partialling-out -0.08248591

Lasso regression can be applied to the three regressions involved in the theorem proof. However, here is useful to note that, as the last step of this proof, which consists in regressing the residuals from the regression of the vector $log(wage)$ on $W = exp1 + shs + hsg+ scl + clg + mw + so + we + occ2 + ind2$ on the ressiduals  from the regression of the vector $sex$ on $W = exp1 + shs + hsg+ scl + clg + mw + so + we + occ2 + ind2$, can be completed without lasso, due to the fact that several parameters from the sample with $p>n$ regressors have already become zero and now no reduction of parameters is necessary.

We tried both methods and got different results. On the one hand, if we use lasso regression for all the steps we obtain that the coefficient of $ sex $ variable is $ -0.04331474 $, in the second case, we use OLS regression for the last step and we obtain that the coefficient of $ sex $ variable is $ -0.08248591 $. Comparing both results with the coefficient of the complete model, $ -0.098 $, we conclude that the second coefficient found is closer, so regressing the last step with an OLS model gives better results.

## Case 2: Partialling-Out using lasso 2 

In [8]:
# Lasso2
library(hdm)
Lasso2 <- lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2
lassoreg2<- rlasso(Lasso2, data=data_1)
sumlasso<- summary(lassoreg2)


Call:
rlasso.formula(formula = Lasso2, data = data_1)

Post-Lasso Estimation:  TRUE 

Total number of variables: 979
Number of selected variables: 12 

Residuals: 
      Min        1Q    Median        3Q       Max 
-1.585175 -0.304508 -0.006954  0.274110  3.558977 

              Estimate
(Intercept)      2.627
sex              0.000
exp1             0.005
exp2             0.000
exp3             0.000
exp4             0.000
shs              0.000
hsg              0.000
scl              0.000
clg              0.000
occ22            0.000
occ23            0.000
occ24            0.000
occ25            0.000
occ26            0.000
occ27            0.000
occ28            0.000
occ29            0.000
occ210           0.000
occ211           0.000
occ212           0.000
occ213          -0.225
occ214          -0.270
occ215           0.000
occ216           0.000
occ217           0.000
occ218           0.000
occ219           0.000
occ220           0.000
occ221           0.000
occ222           0.

Unlike Python, in R we can look at the coefficient of $ sex $ variable, this is $ 0 $.

In [15]:
# Partialling-Out using lasso for the flexible model

# models
Lasso2.y <- lwage ~  (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 # model for Y
Lasso2.d <- sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 # model for D

# partialling-out the linear effect of W from Y         
L2.Y <- rlasso(Lasso2.y, data=data, post=FALSE)$res
# partialling-out the linear effect of W from D         
L2.D <- rlasso(Lasso2.d, data=data, post=FALSE)$res

# regression of Y on D after partialling-out the effect of W
partial.lasso2.fit <- rlasso(L2.Y~L2.D, post=FALSE)
partial.lasso2.est <- summary(partial.lasso2.fit)$coef[2]

cat("Coefficient for D via partialling-out", partial.lasso2.est)


Call:
rlasso.formula(formula = L2.Y ~ L2.D, post = FALSE)

Post-Lasso Estimation:  FALSE 

Total number of variables: 1
Number of selected variables: 1 

Residuals: 
      Min        1Q    Median        3Q       Max 
-1.992287 -0.286221 -0.009855  0.279900  3.442342 

            Estimate
(Intercept)    0.000
              -0.044

Residual standard error: 0.4782
Multiple R-squared:  0.003139
Adjusted R-squared:  0.002946
Joint significance test:
 the sup score statistic for joint significance test is 0.892 with a p-value of     0
Coefficient for D via partialling-out -0.04442814

In [17]:
# Partialling-Out using lasso

# models
Lasso2.y <- lwage ~  (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 # model for Y
Lasso2.d <- sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 # model for D

# partialling-out the linear effect of W from Y         
L2.Y <- rlasso(Lasso2.y, data=data, post=FALSE)$res
# partialling-out the linear effect of W from D         
L2.D <- rlasso(Lasso2.d, data=data, post=FALSE)$res

# regression of Y on D after partialling-out the effect of W
partial.lasso2.fit <- lm(L2.Y~L2.D)
partial.lasso2.est <- summary(partial.lasso2.fit)$coef[2,1]

cat("Coefficient for D via partialling-out", partial.lasso2.est)

Coefficient for D via partialling-out -0.0638049

Lasso regression can be applied to the three regressions involved in the theorem proof. However, here is useful to note that, as the last step of this proof, which consists in regressing the residuals from the regression of the vector $log(wage)$ on $W = (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2$ on the ressiduals  from the regression of the vector $sex$ on $W = (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2$, can be completed without lasso, due to the fact that several parameters from the sample with $p>n$ regressors have already become zero and now no reduction of parameters is necessary.

We tried both methods and got different results. On the one hand, if we use lasso regression for all the steps we obtain that the coefficient of $ sex $ variable is $ -0.04442814 $, in the second case, we use OLS regression for the last step and we obtain that the coefficient of $ sex $ variable is $ -0.0638049 $. Comparing both results with the coefficient of the complete model, $ 0 $, We conclude that the first coefficient found is closer, so regressing all the steps with the Lasso model gives better results.

Finally, it is necessary to highlight that the coefficients found by partialling-out using lasso are different in the R and Python scripts since different packages are used which assign different values to the lambda.