##### Theorem 1: FWL
 Let $\boldsymbol{y} = D\boldsymbol{\beta}_1 + W\boldsymbol{\beta}_2 + \boldsymbol{\mu}$ and let $\boldsymbol{\epsilon^Y}$, $\boldsymbol{\epsilon^D}$ be $\boldsymbol{y}$ and $D$ residualized from $W$ respectively. $\boldsymbol{\hat{\beta}}_1$ can be estimated by running OLS of $\boldsymbol{\epsilon^Y}$ against $\boldsymbol{\epsilon^D}$. 

<u>Proof</u>: 
From the original regression, the objective function is the following.
$$
\begin{align*}
    \text{min  }\boldsymbol{e}'\boldsymbol{e} &= (\boldsymbol{y} - D\boldsymbol{\hat{\beta}}_1 - W\boldsymbol{\hat{\beta}}_2)'(\boldsymbol{y} - D\boldsymbol{\hat{\beta}}_1 - W\boldsymbol{\hat{\beta}}_2) \\
    &= \boldsymbol{y}'\boldsymbol{y} - 2\boldsymbol{y}'D\boldsymbol{\hat{\beta}}_1 - 2\boldsymbol{y}'W\boldsymbol{\hat{\beta}}_2 + 2\boldsymbol{\hat{\beta}}_1'D'W\boldsymbol{\hat{\beta}}_2 + \boldsymbol{\hat{\beta}}_1'D'D\boldsymbol{\hat{\beta}}_1 + \boldsymbol{\hat{\beta}}_2'W'W\boldsymbol{\hat{\beta}}_2
\end{align*}
$$
The first order conditions are given by the following system
$$
\begin{bmatrix}
D'D & D'W \\
W'D & W'W \\
\end{bmatrix}
\begin{bmatrix}
\boldsymbol{\hat{\beta}}_1 \\
\boldsymbol{\hat{\beta}}_2 \\
\end{bmatrix}
=
\begin{bmatrix}
D'\boldsymbol{y} \\
W'\boldsymbol{y}
\end{bmatrix}
$$
Lastly, solving for $\boldsymbol{\hat{\beta}}_2$ in the 2nd equation and replacing in the 1st yields the desired result. 
$$
\begin{align*}
    \boldsymbol{\hat{\beta}}_2 &= (W'W)^{-1}[W'\boldsymbol{y} - W'D\boldsymbol{\hat{\beta}}_1] \\
    &= (W'W)^{-1}W'[\boldsymbol{y} - D\boldsymbol{\hat{\beta}}_1]
\end{align*}
$$
In the 1st equation,
$$D'D\boldsymbol{\hat{\beta}}_1 + D'W\boldsymbol{\hat{\beta}}_2 = D'\boldsymbol{y}$$
$$D'D\boldsymbol{\hat{\beta}}_1 + D'W(W'W)^{-1}W'[\boldsymbol{y} - D\boldsymbol{\hat{\beta}}_1] = D'\boldsymbol{y}$$
$$D'D\boldsymbol{\hat{\beta}}_1 + D'P_W[\boldsymbol{y} - D\boldsymbol{\hat{\beta}}_1] = D'\boldsymbol{y}$$
$$D'D\boldsymbol{\hat{\beta}}_1 - D'P_WD\boldsymbol{\hat{\beta}}_1 = D'\boldsymbol{y} - D'P_W\boldsymbol{y}$$
$$D'(I-P_W)D\boldsymbol{\hat{\beta}}_1 = D'(I-P_W)\boldsymbol{y}$$
$$D'M_WD\boldsymbol{\hat{\beta}}_1 = D'M_W\boldsymbol{y}$$
$$D'M_W'M_WD\boldsymbol{\hat{\beta}}_1 = D'M_W'M_W\boldsymbol{y}$$
$$\boldsymbol{\hat{\beta}}_1 = (D'M_W'M_WD)^{-1}D'M_W'M_W\boldsymbol{y}$$
$$\boldsymbol{\hat{\beta}}_1 = ({\epsilon^{D}}^{'}{\epsilon^{D}})^{-1}\epsilon^D\boldsymbol{\epsilon^Y}$$
where $P_W$ is $W$'s projection matrix and $M_W$ its residual-maker matrix. 


##### Theorem 2: CEF minimizes MSE
Let $Y=m(X)+$ where $m(X)=E[Y|X]$ is the CEF and g(X) any other function. The CEF $m(x)$ minimizes E[(Y-g(X))].

<u>Proof</u>: 
$$
\begin{align*}
    E[(Y-g(X))^2] &= E[(Y-m(X)+m(X)-g(X))^2] \\
    &= E[(Y-m(X))^2] + E[(m(X)-g(X))^2] + 2E[Y-m(X)]E[m(X)-g(X)]
\end{align*}
$$
By the Law of Iterated Expectations, the last term is equal to zero since
$$
\begin{align*}
    E[Y-m(X)]E[m(X)-g(X)] &= E[E[Y-m(X)]E[m(X)-g(X)] |X] \\ 
    &= E[(E[Y|X] - m(X) )( m(X) - g(X) ) |X] \\
    &= E[( 0 )( m(X) - g(X) ) |X] = 0
\end{align*}
$$
So $E[(Y-g(X))^2] = E[(Y-m(X))^2] + E[(m(X)-g(X))^2]$. Since the second term is the expectation of a non-negative variable, 
$$E[(Y-g(X))^2] \geq E[(Y-m(X))^2]$$ for any function $g(X)$. This shows that $g(X)=m(X)$ is where the MSE is minimized.  

In [2]:
install.packages("glmnet")
library(glmnet)

"dependency 'lattice' is not available"also installing the dependencies 'Matrix', 'survival', 'RcppEigen'

"unable to access index for repository https://cran.r-project.org/bin/windows/contrib/3.6:
  no fue posible abrir la URL 'https://cran.r-project.org/bin/windows/contrib/3.6/PACKAGES'"Packages which are only available in source form, and may need
  compilation of C/C++/Fortran: 'Matrix' 'survival' 'RcppEigen'
  'glmnet'


  These will not be installed


ERROR: Error in library(glmnet): there is no package called 'glmnet'


In [8]:
#---(1) Data preparation------#
#-----------------------------#
#Load data; perform 80/20 split; stablish the index (observations) for the training and test subsets
getwd()
q3 = get(load("../../data/wage2015_subsample_inference.Rdata"))
rm(data)

n = dim(q3)[1]
nvar = dim(q3)[2]
q3$id = seq(1, n, 1) #id variable

#--Split----#
set.seed(1234)
random = sample(1:n, floor(n*8/10), replace = F)  #80/20 split

#Training set
train = q3[random, ]
index_train =  q3[random, ]$id   #Original indexes for Training set
ntrain = dim(train)[1]

#Testing set
test = q3[-random, ]
index_test =  q3[-random, ]$id   #Original indexes for Testing set
ntest = dim(test)[1]

print(intersect(index_train, index_test)) #--> 0

numeric(0)


In [9]:
#---(2) Lambda range------#
#-------------------------#
lambdas = seq(0.1, 0.5, 0.1)

In [11]:
#---(3) Partition------#
#----------------------#
#Divide the training set it 5 folds
k=5
obsfold = ntrain/k  #824 obs. per fold
cutoff = c(obsfold, obsfold*2, obsfold*3, obsfold*4, obsfold*5) 

#--Folds----#
index_f1 = index_train[1:cutoff[1]]
index_f2 = index_train[(cutoff[1]+1):cutoff[2]]
index_f3 = index_train[(cutoff[2]+1):cutoff[3]]
index_f4 = index_train[(cutoff[3]+1):cutoff[4]]
index_f5 = index_train[(cutoff[4]+1):cutoff[5]]

In [12]:
#---(4) Lasso function------#    #glmnet()
#---------------------------#
#we'll use glmnet()

#---(5) CV loop------#
#--------------------#
folds = cbind(index_f1, index_f2, index_f3, index_f4, index_f5)  #matrix of fold's indexes; each column is are the indexes of a single fold
folds[,-1]

MSE_mat = matrix(data=NA, nrow=5, ncol=length(lambdas))     #each row is a fold; each col is a lambda numer


for (p in 1:length(lambdas)){
  
for (i in 1:k) {
  index_training = folds[, -i]   #train with each observation of each fold except for current i 
  index_validation = folds[, i]  #validate the model with current i and get MSE
  
  x = q3[index_training, !names(q3) %in% c("wage", "lwage", "id")]
  y = q3$wage[index_training]
  
  model = glmnet(x, y, alpha=1, lambda=lambdas[p])
  coefficients = coef(model)
  
  
  #Validation
  x_validation = q3[index_validation, !names(q3) %in% c("wage", "lwage", "id")]
  y_validation = q3$wage[index_validation]
  
  pred = predict(model, newx=as.matrix(x_validation))   #Take unused fold and predict y_hat
  mse = mean((y_validation - pred)^2)                   #Compare with y, take MSE
  MSE_mat[p,i] = mse
  
}
  
}
colnames(MSE_mat) = c("lambda1", "lambda2", "lambda3", "lambda4", "lambda5")
print(MSE_mat) 


index_f2,index_f3,index_f4,index_f5
2327,2654,1499,903
2761,3222,1468,2940
1348,37,3142,1919
233,1274,2256,1864
4027,4380,4714,3585
3299,4553,2423,3297
132,1846,3331,1557
5061,3119,1080,531
3370,3632,4603,4422
4555,2554,2844,1534


ERROR: Error in glmnet(x, y, alpha = 1, lambda = lambdas[p]): no se pudo encontrar la función "glmnet"


In [14]:
#---(6) Optimal lambda------#
#---------------------------#
means_mse = apply(MSE_mat, 2, mean)
optimal_pos = which.min(means_mse)

#OPTIMAL LAMBDA
optimal_lambda = lambdas[optimal_pos]
cat("The average MSE of each lambda value was: ", means_mse)
cat("The optimal lambda from our list", lambdas, "is ", optimal_lambda)

The optimal lambda from our list 0.1 0.2 0.3 0.4 0.5 is  

In [None]:
#---(7) Model Training------#
#---------------------------#
#Using the optimal_lambda to train a model in the initial training set (80%)
x = q3[index_train, !names(q3) %in% c("wage", "lwage", "id")]
y = q3$wage[index_train]

model = glmnet(x, y, alpha=1, lambda=optimal_lambda)

#Validating the model using the test set (20%)
x_test = q3[index_test, !names(q3) %in% c("wage", "lwage", "id")]
y_test = q3$wage[index_test]

pred = predict(model, newx=as.matrix(x_test))
MSE = mean((y_test - pred)^2)
cat("The MSE of the final model evaluated in the test sample (20% split) is: ", MSE)

In [None]:
#---(8) Results------#
#--------------------#
#From the initial list c(0.1, 0.2, 0.3, 0.4, 0.5), the optimal lambda value was 0.5
#Given that the average MSE for each lambda value was 5167.2899,  8620.5168, 25934.3097,  1377.3341,   824.8698, 
#respectively, this suggest we could get a lower average MSE by using a higher lambda

#After performing cross-validation, its performance on the test set which accounts for 20% of the original data yields a MSE of 454.8124