#### <p style='text-align: right;'>**WORKGROUP 2 - GROUP 5** 

# 1. EXPLAINING THE IDEA OF SAMPLE SPLITTING

<p style='text-align: justify;'> When we perform a regression, the estimated coefficients are adjusted so that they are reduced to the minimum mean squared error, i.e., these coefficients perform well within the sample. However, if we want to know the predictive power of the estimated coefficients, it is necessary to measure their out-of-sample performance. To do this, we use the idea of sample splitting, which tells us that we have to divide our sample into two groups: training and test. This division is random. However, the proportion of each group to the total is chosen by the researcher. First, the training sample is used to estimate the coefficients of our model to know the prediction rule. Then, the test sample is used to evaluate the quality of the prediction rule, i.e., we find the predicted values of the endogenous variable using the coefficients obtained from the training sample. Finally, we have to calculate the out-of-sample mean square error ($MSE_{test}$) and $R^{2}_{test}$ in the test sample.

<p style='text-align: justify;'> For example, suppose we have the following wage model:  $lwage \sim sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2$ and data size is $n$. First, we have to divide the data into two groups randomly. This division can be $4/5n$ as the training sample, $M_{train}$, and  $1/5n$ as the test sample, $M_{test}$,. After that, we regress the model on $M_{train}$ and obtain the estimated coefficients $\hat\beta$. With the estimated coefficients, we predict $lwage$ in $M_{test}$, and we calculate the out-of-sample mean square error $MSE_{test}$ and $R^{2}_{test}$ in $M_{test}$.After that, we can analyze whether the $MSE_{test}$ is quite closed to the $MSE_{sample}$ (obtained from regressing the model on the complete sample) or the discrepancy is quite large.

#### Import necessary packages

In [225]:
import pandas as pd
import numpy as np
import pyreadr

# Package for latex table 
import array_to_latex as a2l

# Import packages for OLS regression
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Import relevant packages for lasso 
from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

#### Import data

In [226]:
rdata_read = pyreadr.read_r("../data/wage2015_subsample_inference.Rdata")
data = rdata_read[ 'data' ]
type(data)
data.shape
data

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
10,9.615385,2.263364,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,7.0,0.49,0.343,0.2401,3600,11,8370,18
12,48.076923,3.872802,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,31.0,9.61,29.791,92.3521,3050,10,5070,9
15,11.057692,2.403126,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,3.24,5.832,10.4976,6260,19,770,4
18,13.942308,2.634928,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,25.0,6.25,15.625,39.0625,420,1,6990,12
19,28.846154,3.361977,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,22.0,4.84,10.648,23.4256,2015,6,9470,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32620,14.769231,2.692546,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,9.0,0.81,0.729,0.6561,4700,16,4970,9
32624,23.076923,3.138833,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,12.0,1.44,1.728,2.0736,4110,13,8680,20
32626,38.461538,3.649659,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,11.0,1.21,1.331,1.4641,1550,4,3680,6
32631,32.967033,3.495508,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,10.0,1.00,1.000,1.0000,2920,9,6570,11


#### Focus on people who did not go to college $(shs=1 \lor hsg=1)$

In [227]:
#Seleccionamos las variables que utilizaremos
m = data[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1",'exp2',"exp3","exp4",'occ', 'occ2','ind','ind2'] ]

#Filtramos las observaciones que tengan "some high school"
some_highs=m[m['shs']==1]
some_highs.shape

#Filtramos las observaciones que tengan "high school graduate"
highs_grad=m[m['hsg']==1]
highs_grad.shape

#Concatenamos las observaciones filtradas anteriormente
high_school=pd.concat([some_highs,highs_grad])
high_school.shape
high_school

Unnamed: 0_level_0,lwage,sex,shs,hsg,scl,clg,ad,ne,mw,so,we,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
500,2.145581,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,37.0,13.69,50.653,187.4161,6800,19,490,2
540,2.345602,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,26.0,6.76,17.576,45.6976,1e+05,21,1e+05,5
691,2.701619,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,29.0,8.41,24.389,70.7281,4220,14,7070,13
843,2.263364,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,28.0,7.84,21.952,61.4656,4020,13,8680,20
1775,2.245200,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,8.0,0.64,0.512,0.4096,4230,14,7690,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32580,2.563469,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,15.0,2.25,3.375,5.0625,2010,6,9370,22
32590,2.599837,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,0.64,0.512,0.4096,4720,16,8590,19
32599,3.117780,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,15.0,2.25,3.375,5.0625,9620,22,5390,9
32603,2.822980,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,1.21,1.331,1.4641,7150,20,8770,21


#### Let's have a look at the structure of  "high_school" data.

In [228]:
high_school.info()
high_school.describe()

<class 'pandas.core.frame.DataFrame'>
Index: 1376 entries, 500 to 32631
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   lwage   1376 non-null   float64 
 1   sex     1376 non-null   float64 
 2   shs     1376 non-null   float64 
 3   hsg     1376 non-null   float64 
 4   scl     1376 non-null   float64 
 5   clg     1376 non-null   float64 
 6   ad      1376 non-null   float64 
 7   ne      1376 non-null   float64 
 8   mw      1376 non-null   float64 
 9   so      1376 non-null   float64 
 10  we      1376 non-null   float64 
 11  exp1    1376 non-null   float64 
 12  exp2    1376 non-null   float64 
 13  exp3    1376 non-null   float64 
 14  exp4    1376 non-null   float64 
 15  occ     1376 non-null   category
 16  occ2    1376 non-null   category
 17  ind     1376 non-null   category
 18  ind2    1376 non-null   category
dtypes: category(4), float64(15)
memory usage: 202.2+ KB


Unnamed: 0,lwage,sex,shs,hsg,scl,clg,ad,ne,mw,so,we,exp1,exp2,exp3,exp4
count,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0,1376.0
mean,2.718562,0.321948,0.087209,0.912791,0.0,0.0,0.0,0.223837,0.286337,0.291424,0.198401,17.190044,4.029529,11.434386,36.158301
std,0.504167,0.467393,0.282244,0.282244,0.0,0.0,0.0,0.416966,0.452213,0.454584,0.398941,10.369836,4.464939,17.304596,67.243707
min,1.213542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625
25%,2.396896,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.81,0.729,0.6561
50%,2.682075,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,1.96,2.744,3.8416
75%,3.000573,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,25.0,6.25,15.625,39.0625
max,6.270697,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681


#### We are constructing the output variable Y and the matrix Z which includes the characteristics of workers that are given in the data.

In [229]:
Y = np.log2(high_school['lwage'])
n = len(Y)
z = high_school.loc[:, ~high_school.columns.isin(['wage', 'lwage','Unnamed: 0'])]
p = z.shape[1]

print("Number of observation:", n, '\n')
print( "Number of raw regressors:", p)

Number of observation: 1376 

Number of raw regressors: 18


#### For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

In [230]:
subset_hs = high_school.loc[:, high_school.columns.isin(["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])]
table = subset_hs.mean(axis=0)
table = pd.DataFrame(data=table, columns={"Sample mean":"0"} )
table.index
index1 = list(table.index)
index2 = ["Log Wage","Sex","Some High School","High School Graduate",\
          "Some College","College Graduate", "Advanced Degree","Midwest",\
          "South","West","Northeast","Experience"]

In [231]:
table = table.rename(index=dict(zip(index1,index2)))
table

Unnamed: 0,Sample mean
Log Wage,2.718562
Sex,0.321948
Some High School,0.087209
High School Graduate,0.912791
Some College,0.0
College Graduate,0.0
Advanced Degree,0.0
Midwest,0.223837
South,0.286337
West,0.291424


In [9]:
print(table.to_latex())

\begin{tabular}{lr}
\toprule
{} &  Sample mean \\
\midrule
Log Wage             &     2.718562 \\
Sex                  &     0.321948 \\
Some High School     &     0.087209 \\
High School Graduate &     0.912791 \\
Some College         &     0.000000 \\
College Graduate     &     0.000000 \\
Advanced Degree      &     0.000000 \\
Midwest              &     0.223837 \\
South                &     0.286337 \\
West                 &     0.291424 \\
Northeast            &     0.198401 \\
Experience           &    17.190044 \\
\bottomrule
\end{tabular}



# 2. PREDICTION: BASIC VS FLEXIBLE MODEL

We employ two different specifications for prediction:

1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors.

## 2.1  Basic Model


In [232]:
# Regression model
basic = 'lwage ~ sex + exp1 + hsg + mw + so + we + occ2+ ind2'

# Regress
basic_results = smf.ols(basic , data=high_school).fit()
basic_coef = basic_results.summary2().tables[1]['Coef.']['sex']

# Print estimated coefficients
print(basic_results.summary2()) 

                 Results: Ordinary least squares
Model:              OLS              Adj. R-squared:     0.151    
Dependent Variable: lwage            AIC:                1841.7485
Date:               2021-09-17 16:27 BIC:                2092.6415
No. Observations:   1376             Log-Likelihood:     -872.87  
Df Model:           47               F-statistic:        6.212    
Df Residuals:       1328             Prob (F-statistic): 9.07e-33 
R-squared:          0.180            Scale:              0.21575  
-------------------------------------------------------------------
                Coef.   Std.Err.     t     P>|t|    [0.025   0.975]
-------------------------------------------------------------------
Intercept       2.9918    0.1009  29.6406  0.0000   2.7938   3.1898
occ2[T.10]     -0.0576    0.1208  -0.4770  0.6334  -0.2946   0.1794
occ2[T.11]     -0.4177    0.1110  -3.7617  0.0002  -0.6355  -0.1999
occ2[T.12]     -0.4664    0.1257  -3.7087  0.0002  -0.7130  -0.2197
occ2[T

In [233]:
print("Gender Coefficient in the basic model: ", basic_coef)
#Print number of regressors in the Basic Model
print( "Number of regressors in the basic model:",len(basic_results.params), '\n')

Gender Coefficient in the basic model:  -0.07330944628393145
Number of regressors in the basic model: 48 



## 2.2  Flexible Model

In [246]:
# Regression model
flex = 'lwage ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we + occ2 + ind2)**2'
#flex = 'lwage ~ sex + (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'

# Regress
flex_results = smf.ols(flex , data=high_school).fit()
flex_results.summary2()
#flex_coef = flex_results.summary2().tables[1]['Coef.']['sex']

# Print estimated coefficients
print(flex_results.summary2()) 


                    Results: Ordinary least squares
Model:                OLS                Adj. R-squared:       0.232    
Dependent Variable:   lwage              AIC:                  2033.9233
Date:                 2021-09-17 17:09   BIC:                  4616.0296
No. Observations:     1376               Log-Likelihood:       -522.96  
Df Model:             493                F-statistic:          1.840    
Df Residuals:         882                Prob (F-statistic):   2.24e-15 
R-squared:            0.507              Scale:                0.19534  
------------------------------------------------------------------------
                       Coef.   Std.Err.    t    P>|t|    [0.025   0.975]
------------------------------------------------------------------------
Intercept               5.5884   2.6826  2.0832 0.0375    0.3234 10.8534
occ2[T.10]             -1.5768   1.0075 -1.5651 0.1179   -3.5542  0.4005
occ2[T.11]              1.5942   1.7284  0.9224 0.3566   -1.7980  4.9864

In [235]:
#print("Gender Coefficient in the flexible model: ", flex_coef)

#Print number of regressors in the Flexible Model
print( "Number of regressors in the flexible model:",len(flex_results.params), '\n')

Number of regressors in the flexible model: 826 



In [236]:
# Assess the predictive performance
R2_b = basic_results.rsquared
print("R-squared for the basic model: ", R2_b, "\n")
R2_adj_b = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj_b, "\n")


R2_f = flex_results.rsquared
print("R-squared for the flexible model: ", R2_f, "\n")
R2_adj_f = flex_results.rsquared_adj
print("adjusted R-squared for the flexible model: ", R2_adj_f, "\n")

R-squared for the basic model:  0.18023814876721023 

adjusted R-squared for the basic model:  0.15122549288773646 

R-squared for the flexible model:  0.5070440013634931 

adjusted R-squared for the flexible model:  0.23150283659274729 



In [248]:
# Calculating the MSE
MSE_b =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE_b, "\n")
pb = len(basic_results.params) # number of regressors

MSE_adj_b  = (n/(n-pb))*MSE_b
print("Adjusted MSE for the basic model: ", MSE_adj_b, "\n")

MSE_f =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE_f, "\n")
pf = len(flex_results.params) # number of regressors
n = len(lwage)

#MSE_adj_f  = (n/(n-pf))*MSE_f
MSE_adj_f  = flex_results.scale
print("adjusted MSE for the flexible model: ", MSE_adj_f, "\n")


MSE for the basic model:  0.20821908377460796 

Adjusted MSE for the basic model:  0.21574507475441307 

MSE for the flexible model:  0.12521056721892074 

adjusted MSE for the flexible model:  0.19533984182906483 



#### Summary of the results of two models

In [249]:
table_p= np.zeros((2, 5))
table_p[0,0:5] = [pb, R2_b, MSE_b, R2_adj_b, MSE_adj_b]
table_p[1,0:5] = [pf, R2_f, MSE_f, R2_adj_f, MSE_adj_f]

table_p = pd.DataFrame(table_p, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["Basic Reg","Flexible Reg"])
table_p

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
Basic Reg,48.0,0.180238,0.208219,0.151225,0.215745
Flexible Reg,826.0,0.507044,0.125211,0.231503,0.19534


The table above shows that in terms of prediction, the flexible model performs better because it has a higher $R^2_{adjusted}$ and lower $MSE_{adjusted}$ than the basic model.

### Data Splittling

In [214]:
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0,n, size=math.floor(n))
high_school["random"] = random
random   

array([ 684,  559, 1216, ..., 1294,  573, 1367])

In [215]:
high_school_r = high_school.sort_values(by=['random'])
high_school_r.head()

Unnamed: 0_level_0,lwage,sex,shs,hsg,scl,clg,ad,ne,mw,so,we,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2,random
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7297,3.274965,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,14.0,1.96,2.744,3.8416,9130,22,6380,10,0
12727,2.851151,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,12.0,1.44,1.728,2.0736,4030,13,8680,20,0
6841,2.733368,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,10.0,1.0,1.0,1.0,6440,19,770,4,3
9147,3.179655,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,16.0,2.56,4.096,6.5536,2010,6,8370,18,3
19858,2.733368,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,0.81,0.729,0.6561,5520,17,4870,9,3


In [216]:
# Create training and testing sample 
train = high_school_r[ : math.floor(n*4/5)]    # training sample
test =  high_school_r[ math.floor(n*4/5) : ]   # testing sample
print(train.shape)
print(test.shape)

(1100, 20)
(276, 20)


* #### Basic Model

In [217]:
# Estimating the parameters in the training sample
basic_results_t = smf.ols(basic, data=train).fit()
print(basic_results_t.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.212
Model:                            OLS   Adj. R-squared:                  0.177
Method:                 Least Squares   F-statistic:                     6.014
Date:                Fri, 17 Sep 2021   Prob (F-statistic):           1.77e-30
Time:                        13:30:22   Log-Likelihood:                -630.16
No. Observations:                1100   AIC:                             1356.
Df Residuals:                    1052   BIC:                             1596.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.8794      0.103     27.825      0.0

In [218]:
lwage_test = test["lwage"].values
test = sm.add_constant(test)   #add constant 

# Predict out of sample
lwage_pred =  basic_results_t.predict(test) 
print(lwage_pred)

rownames
30118    2.802625
18894    2.475869
25472    2.522232
29372    2.679515
23605    2.655664
           ...   
2674     2.814701
7650     2.724863
14109    2.928647
1496     2.878688
18895    2.549583
Length: 276, dtype: float64


In [219]:
#Computing R2 and MSE of test sample
MSE_test_b = np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
R2_test_b  = 1 - MSE_test_b/np.var(lwage_test)

print("Test MSE for the basic model: ", MSE_test_b, " ")
print("Test R2 for the basic model: ", R2_test_b)

Test MSE for the basic model:  0.3213686771335853  
Test R2 for the basic model:  0.03619815734367482


* #### Flexible model

In [223]:

# Estimating the parameters in the training sample
flex_results_t = smf.ols(flex , data=train).fit()
print(flex_results_t.summary())
# Predict out of sample
lwage_flex_pred =  flex_results_t.predict(test)
lwage_test = test["lwage"].values

#Computing R2 and MSE of test sample
MSE_test_f = np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
R2_test_f  = 1 - MSE_test_f/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_test_f, " ")
print("Test R2 for the flexible model: ", R2_test_f)

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.244
Method:                 Least Squares   F-statistic:                     1.767
Date:                Fri, 17 Sep 2021   Prob (F-statistic):           1.52e-11
Time:                        13:33:31   Log-Likelihood:                -307.38
No. Observations:                1100   AIC:                             1541.
Df Residuals:                     637   BIC:                             3857.
Df Model:                         462                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 3.13

In [222]:
table_ds = np.zeros((2, 2))
table_ds[0,0] = MSE_test_b
table_ds[1,0] = MSE_test_f
table_ds[0,1] = R2_test_b
table_ds[1,1] = R2_test_f

table_ds = pd.DataFrame(table_ds, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg"])
table_ds

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.321369,0.036198
flexible reg,8.74001,-25.211757


The data splitting of the flexible model shows that the R2 is negative. This problem may be due to a large number of regressors in the flexible model and the small size of the test sample.

# 3. PARTIALLING-OUT USING LASSO

We based on the following link [Inference with Lasso](https://prod-edxapp.edx-cdn.org/assets/courseware/v1/d4c26e68eab1fb6ab47d5654d8212091/asset-v1:MITxPRO+DSx+2T2018+type@asset+block/Regression2.3.pdf)

We have the following equation :  $ Y =D \beta_1+W \beta_2+\epsilon\ $ , where $D$ is the target regressor and $W$ are controls.

We define variables $Y = log(wage)$ , $D = sex$.

So let us explain this in more details.
* First, we run the Lasso regression of $Y_i$ on $W_i$ and of $D_i$ on $W_i$ and keep the resulting residuals, called $\widetilde{Y_i}$ and $\widetilde{D_i}$
* Second, we run the least squares of $\widetilde{Y_i}$ on $\widetilde{D_i}$
* The resulting estimated regression coefficient is our estimator $\hat{\beta_1}$

## 3.1 Case: Partialling-Out using lasso 1 :
####  Matrix $ W = exp1 + hsg + mw + so + we + occ2+ ind2 $

* #### With $\alpha = 0.1$

In [196]:
# Set penalty value
alpha1=0.1/np.log(len(lwage))
alpha1

0.01383712264009341

First, Lasso regression of $Y$ on $W$ and keep residuals $\widetilde{Y_i}$

In [185]:
# Get exogenous variables 
y_w1 = 'lwage ~ exp1 + hsg + mw + so + we + occ2+ ind2'
W1 = smf.ols(y_w1 , data=high_school).exog
W1.shape

(1376, 47)

In [186]:
# Set endogenous variable
lwage = high_school["lwage"]
lwage.shape

(1376,)

In [187]:
# Lasso regression Y en W
reg_y_w1 = linear_model.Lasso(alpha = alpha1)
reg_y_w1.fit(W1, lwage)

# Get Predicted values
lwage_lasso_fitted1 = reg_y_w1.fit(W1, lwage).predict( W1 )

#Get residuals (Y~)
resid_y_w1 = high_school["lwage"] -  lwage_lasso_fitted1
resid_y_w1

rownames
500     -0.748635
540     -0.463153
691     -0.115274
843     -0.428509
1775    -0.408542
           ...   
32580   -0.160476
32590   -0.069724
32599    0.393835
32603    0.130112
32631    0.810409
Name: lwage, Length: 1376, dtype: float64

Second , Lasso regression of $D$ on $W$ and keep residuals $\widetilde{D_i}$

In [188]:
# Get exogenous variables
d_w1 = 'sex ~ exp1 + hsg + mw + so + we + occ2+ ind2'
W1 = smf.ols(d_w1 , data=high_school).exog
W1.shape

# Set endogenous variable
sex= high_school['sex']
sex.shape

(1376,)

In [189]:
# Lasso regression D on W
reg_d_w1 = linear_model.Lasso(alpha = alpha1)
reg_d_w1.fit(W1, sex)

# Get Predicted values
sex_lasso_fitted1 = reg_d_w1.fit(W1, sex).predict( W1 )

#Get residuals (D~)
resid_d_w1 = high_school["sex"] -  sex_lasso_fitted1
resid_d_w1

rownames
500     -0.192310
540     -0.270353
691     -0.347164
843     -0.346544
1775     0.665853
           ...   
32580   -0.338486
32590    0.665853
32599   -0.245463
32603   -0.222070
32631   -0.335387
Name: sex, Length: 1376, dtype: float64

Finally, least squares regression of $\widetilde{Y_i}$ on $\widetilde{D_i}$

In [190]:
#Set a dataframe with resulting residuals 
resid1 = np.zeros((1376, 2))
resid1[:, 0] = resid_y_w1
resid1[:, 1] = resid_d_w1
residuals1 = pd.DataFrame( resid1, columns = ['resid_y_w1', 'resid_d_w1'])
residuals1

Unnamed: 0,resid_y_w1,resid_d_w1
0,-0.748635,-0.192310
1,-0.463153,-0.270353
2,-0.115274,-0.347164
3,-0.428509,-0.346544
4,-0.408542,0.665853
...,...,...
1371,-0.160476,-0.338486
1372,-0.069724,0.665853
1373,0.393835,-0.245463
1374,0.130112,-0.222070


In [191]:
# OLS regression of (Y~) on (D~)
y_d1 = smf.ols( formula = 'resid_y_w1 ~ resid_d_w1', data = residuals1 )
y_d_summ1 = y_d1.fit().summary2()

y_d_coef1 = y_d_summ1.tables[1]['Coef.'][1]
HCV_coef1 = y_d1.fit().cov_HC0
y_d1_se = np.power( HCV_coef1.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 1): ", y_d_coef1)
print( "Robust standard error via partialling-out using lasso (case 1) is:", y_d1_se )
y_d_summ1

Gender Coefficient via partialling-out using lasso (case 1):  -0.10314603281349857
Robust standard error via partialling-out using lasso (case 1) is: 0.030675549913684742


0,1,2,3
Model:,OLS,Adj. R-squared:,0.007
Dependent Variable:,resid_y_w1,AIC:,1935.5713
Date:,2021-09-17 13:13,BIC:,1946.0251
No. Observations:,1376,Log-Likelihood:,-965.79
Df Model:,1,F-statistic:,11.31
Df Residuals:,1374,Prob (F-statistic):,0.000794
R-squared:,0.008,Scale:,0.23867

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,0.0000,0.0132,0.0000,1.0000,-0.0258,0.0258
resid_d_w1,-0.1031,0.0307,-3.3624,0.0008,-0.1633,-0.0430

0,1,2,3
Omnibus:,246.863,Durbin-Watson:,1.868
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1201.109
Skew:,0.753,Prob(JB):,0.0
Kurtosis:,7.322,Condition No.:,2.0


We compute $R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$, $MSE_{adjusted}$

In [192]:
#Obtaining R2 and R2_adj of the partialling-out using lasso (case 1)
R2_l1 = y_d1.fit().rsquared
print("R-squared for the partialling out using lasso (case 1 - alpha = 0.1): ", R2_l1, "\n")

R2_adj_l1 = y_d1.fit().rsquared_adj
print("Adjusted R-squared for the partialling out using lasso (case 1 - alpha = 0.1)", R2_adj_l1, "\n")

R-squared for the partialling out using lasso (case 1 - alpha = 0.1):  0.008160955622151023 

Adjusted R-squared for the partialling out using lasso (case 1 - alpha = 0.1) 0.007439093144438025 



In [193]:
#Obtaining MSE and MSE_adj of the partialling-out using lasso (case 1)
MSE_l1 =  np.mean(y_d1.fit().resid**2)
print("MSE for the partialling out using lasso (case 1 - alpha = 0.1): ", MSE_l1, "\n")

p1 = len(y_d1.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l1  = (n/(n-p1))*MSE_l1
print("Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.1): ", MSE_adj_l1, "\n")

MSE for the partialling out using lasso (case 1 - alpha = 0.1):  0.23832526411590682 

Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.1):  0.2386721713416942 



### Now, we'll try a higher penalty values

* #### With $\alpha = 0.2$

In [138]:
# Set penalty value
alpha1_2=0.2

# Lasso regression Y on W
reg_y_w1 = linear_model.Lasso(alpha = alpha1_2)
reg_y_w1.fit(W1, lwage)

# Get Predicted values
lwage_lasso_fitted1 = reg_y_w1.fit(W1, lwage).predict( W1 )

#Get residuals
resid_y_w1 = high_school["lwage"] -  lwage_lasso_fitted1

# Lasso regression D on W
reg_d_w1 = linear_model.Lasso(alpha = alpha1_2)
reg_d_w1.fit(W1, sex)

# Get Predicted values
sex_lasso_fitted1 = reg_d_w1.fit(W1, sex).predict( W1 )

#Get residuals
resid_d_w1 = high_school["sex"] -  sex_lasso_fitted1

#Set a dataframe with resulting residuals 
resid1 = np.zeros((1376, 2))
resid1[:, 0] = resid_y_w1
resid1[:, 1] = resid_d_w1
residuals1 = pd.DataFrame( resid1, columns = ['resid_y_w1', 'resid_d_w1'])

#OLS regression of residuals of Y and on the residuals of D
y_d1 = smf.ols( formula = 'resid_y_w1 ~ resid_d_w1', data = residuals1 )
y_d_summ1 = y_d1.fit().summary2()

y_d_coef1_2 = y_d_summ1.tables[1]['Coef.'][1]
HCV_coef1 = y_d1.fit().cov_HC0
y_d1_se = np.power( HCV_coef1.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 1 - alpha = 0.2): ", y_d_coef1_2)
print( "Robust standard error via partialling-out using lasso (case 1 - alpha = 0.2) is:", y_d1_se )

#Obtaining R2 and R2_adj of the partial regression using lasso (case 1)
R2_l1_2 = y_d1.fit().rsquared
print("R-squared for the partialling out using lasso (case 1 - alpha = 0.2): ", R2_l1_2, "\n")

R2_adj_l1_2 = y_d1.fit().rsquared_adj
print("Adjusted R-squared for the partialling out using lasso (case 1- alpha = 0.2)", R2_adj_l1_2, "\n")

MSE_l1_2 =  np.mean(y_d1.fit().resid**2)
print("MSE for the partialling out using lasso (case 1 - alpha = 0.2): ", MSE_l1_2, "\n")

p1_2 = len(y_d1.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l1_2 = (n/(n-p1_2))*MSE_l1_2
print("Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.2): ", MSE_adj_l1_2, "\n")

Gender Coefficient via partialling-out using lasso (case 1 - alpha = 0.2):  -0.12267278814089315
Robust standard error via partialling-out using lasso (case 1 - alpha = 0.2) is: 0.028028894186864443
R-squared for the partialling out using lasso (case 1 - alpha = 0.2):  0.013299193400496345 

Adjusted R-squared for the partialling out using lasso (case 1- alpha = 0.2) 0.012581070542709294 

MSE for the partialling out using lasso (case 1 - alpha = 0.2):  0.24372784918385726 

Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.2):  0.2440826204344888 



* #### With $\alpha = 0.3$

In [139]:
# Set penalty value
alpha1_3=0.3

# Lasso regression Y on W
reg_y_w1 = linear_model.Lasso(alpha = alpha1_3)
reg_y_w1.fit(W1, lwage)

# Get Predicted values
lwage_lasso_fitted1 = reg_y_w1.fit(W1, lwage).predict( W1 )

#Get residuals
resid_y_w1 = high_school["lwage"] -  lwage_lasso_fitted1

# Lasso regression D on W
reg_d_w1 = linear_model.Lasso(alpha = alpha1_3)
reg_d_w1.fit(W1, sex)

# Get Predicted values
sex_lasso_fitted1 = reg_d_w1.fit(W1, sex).predict( W1 )

#Get residuals
resid_d_w1 = high_school["sex"] -  sex_lasso_fitted1

#Set a dataframe with resulting residuals 
resid1 = np.zeros((1376, 2))
resid1[:, 0] = resid_y_w1
resid1[:, 1] = resid_d_w1
residuals1 = pd.DataFrame( resid1, columns = ['resid_y_w1', 'resid_d_w1'])

#OLS regression of residuals of Y and on the residuals of D
y_d1 = smf.ols( formula = 'resid_y_w1 ~ resid_d_w1', data = residuals1 )
y_d_summ1 = y_d1.fit().summary2()

y_d_coef1_3 = y_d_summ1.tables[1]['Coef.'][1]
HCV_coef1 = y_d1.fit().cov_HC0
y_d1_se = np.power( HCV_coef1.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 1 - alpha = 0.3): ", y_d_coef1_3)
print( "Robust standard error via partialling-out using lasso (case 1 - alpha = 0.3) is:", y_d1_se )

#Obtaining R2 and R2_adj of the partial regression using lasso (case 1)
R2_l1_3 = y_d1.fit().rsquared
print("R-squared for the partialling out using lasso (case 1 - alpha = 0.3): ", R2_l1_3, "\n")

R2_adj_l1_3 = y_d1.fit().rsquared_adj
print("Adjusted R-squared for the partialling out using lasso (case 1- alpha = 0.3)", R2_adj_l1_3, "\n")

MSE_l1_3 =  np.mean(y_d1.fit().resid**2)
print("MSE for the partialling out using lasso (case 1 - alpha = 0.3): ", MSE_l1_3, "\n")

p1_3 = len(y_d1.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l1_3 = (n/(n-p1_3))*MSE_l1_3
print("Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.3): ", MSE_adj_l1_3, "\n")

Gender Coefficient via partialling-out using lasso (case 1 - alpha = 0.3):  -0.12227526001471534
Robust standard error via partialling-out using lasso (case 1 - alpha = 0.3) is: 0.02805105698181906
R-squared for the partialling out using lasso (case 1 - alpha = 0.3):  0.013188296079184147 

Adjusted R-squared for the partialling out using lasso (case 1- alpha = 0.3) 0.012470092510100672 

MSE for the partialling out using lasso (case 1 - alpha = 0.3):  0.2442144151848688 

Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.3):  0.2445698946829545 



* #### With $\alpha = 0.4$

In [140]:
# Set penalty value
alpha1_4=0.4

# Lasso regression Y on W
reg_y_w1 = linear_model.Lasso(alpha = alpha1_4)
reg_y_w1.fit(W1, lwage)

# Get Predicted values
lwage_lasso_fitted1 = reg_y_w1.fit(W1, lwage).predict( W1 )

#Get residuals
resid_y_w1 = high_school["lwage"] -  lwage_lasso_fitted1

# Lasso regression D on W
reg_d_w1 = linear_model.Lasso(alpha = alpha1_4)
reg_d_w1.fit(W1, sex)

# Get Predicted values
sex_lasso_fitted1 = reg_d_w1.fit(W1, sex).predict( W1 )

#Get residuals
resid_d_w1 = high_school["sex"] -  sex_lasso_fitted1

#Set a dataframe with resulting residuals 
resid1 = np.zeros((1376, 2))
resid1[:, 0] = resid_y_w1
resid1[:, 1] = resid_d_w1
residuals1 = pd.DataFrame( resid1, columns = ['resid_y_w1', 'resid_d_w1'])

#OLS regression of residuals of Y and on the residuals of D
y_d1 = smf.ols( formula = 'resid_y_w1 ~ resid_d_w1', data = residuals1 )
y_d_summ1 = y_d1.fit().summary2()

y_d_coef1_4 = y_d_summ1.tables[1]['Coef.'][1]
HCV_coef1 = y_d1.fit().cov_HC0
y_d1_se = np.power( HCV_coef1.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 1 - alpha = 0.4): ", y_d_coef1_4)
print( "Robust standard error via partialling-out using lasso (case 1 - alpha = 0.4) is:", y_d1_se )

#Obtaining R2 and R2_adj of the partial regression using lasso (case 1)
R2_l1_4 = y_d1.fit().rsquared
print("R-squared for the partialling out using lasso (case 1 - alpha = 0.4): ", R2_l1_4, "\n")

R2_adj_l1_4 = y_d1.fit().rsquared_adj
print("Adjusted R-squared for the partialling out using lasso (case 1- alpha = 0.4)", R2_adj_l1_4, "\n")

MSE_l1_4 =  np.mean(y_d1.fit().resid**2)
print("MSE for the partialling out using lasso (case 1 - alpha = 0.4): ", MSE_l1_4, "\n")

p1_4 = len(y_d1.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l1_4 = (n/(n-p1_3))*MSE_l1_4
print("Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.4): ", MSE_adj_l1_4, "\n")

Gender Coefficient via partialling-out using lasso (case 1 - alpha = 0.4):  -0.12187773188853754
Robust standard error via partialling-out using lasso (case 1 - alpha = 0.4) is: 0.028084282414635776
R-squared for the partialling out using lasso (case 1 - alpha = 0.4):  0.013068283290403104 

Adjusted R-squared for the partialling out using lasso (case 1- alpha = 0.4) 0.012349992375767394 

MSE for the partialling out using lasso (case 1 - alpha = 0.4):  0.24488703601943335 

Adjusted MSE for the partialling out using lasso (case 1- alpha = 0.4):  0.24524349458714723 



#### The following table shows the results with different $\alpha$

In [213]:
table_l1 = np.zeros((4, 5))
table_l1 [0,0:5] = [y_d_coef1, R2_l1, MSE_l1, R2_adj_l1, MSE_adj_l1 ]
table_l1 [1,0:5] = [y_d_coef1_2, R2_l1_2, MSE_l1_2, R2_adj_l1_2, MSE_adj_l1_2 ]
table_l1 [2,0:5] = [y_d_coef1_3, R2_l1_3, MSE_l1_3, R2_adj_l1_3, MSE_adj_l1_3 ]
table_l1 [3,0:5] = [y_d_coef1_4, R2_l1_4, MSE_l1_4, R2_adj_l1_4, MSE_adj_l1_4 ]
table_l1  = pd.DataFrame(table_l1 , columns = ["Gender Coefficient","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["partiallling out-lasso alpha = 0.013", "partiallling out-lasso alpha = 0.2", "partiallling out-lasso alpha = 0.3", "partiallling out-lasso alpha = 0.4"])
table_l1 

Unnamed: 0,Gender Coefficient,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
partiallling out-lasso alpha = 0.013,-0.103146,0.008161,0.238325,0.007439,0.238672
partiallling out-lasso alpha = 0.2,-0.122673,0.013299,0.243728,0.012581,0.244083
partiallling out-lasso alpha = 0.3,-0.122275,0.013188,0.244214,0.01247,0.24457
partiallling out-lasso alpha = 0.4,-0.121878,0.013068,0.244887,0.01235,0.245243


It can be seen that while the penalty is higher, the $R^2_{adjusted}$ decreases a little and the $MSE_{adjusted}$ increases a little.

## 3.2 Case : Partialling-Out using lasso 2 :
#### Matrix $W = (exp1+exp2+exp3+exp4+ hsg+occ2+ind2+mw+so+we)^{2}$

#### Since the alpha values do not converge until  $\alpha = 0.4$ , we have decided to perform the partialling out using lasso with an  $\alpha = 0.4$

In [201]:
# Set penalty value
alpha2=0.4

* First, Lasso regression of $Y$ on $W$ and keep residuals $\widetilde{Y_i}$

In [202]:
# Get exogenous variables 
y_w2 = 'lwage ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(y_w2 , data=high_school).exog
W2.shape

# Set endogenous variable
lwage = high_school["lwage"]
lwage.shape

(1376,)

In [203]:
#### Lasso regression
reg_y_w2 = linear_model.Lasso(alpha = alpha2)
reg_y_w2.fit(W2, lwage)

# Get Predicted values
lwage_lasso_fitted2 = reg_y_w2.fit(W2, lwage).predict( W2 )
lwage_lasso_fitted2

#Get residuals (Y~)
resid_y_w2 = high_school["lwage"] -  lwage_lasso_fitted2
resid_y_w2

rownames
500     -0.797724
540     -0.469858
691     -0.066532
843     -0.519673
1775    -0.408546
           ...   
32580   -0.123039
32590   -0.054448
32599    0.431272
32603    0.157571
32631    0.835533
Name: lwage, Length: 1376, dtype: float64

* Second , Lasso regression of $D$ on $W$ and keep residuals $\widetilde{D_i}$

In [204]:
# Get exogenous variables
d_w2 = 'sex ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(d_w2 , data=high_school).exog
W2.shape

# Set endogenous variable
sex= high_school['sex']
sex.shape

(1376,)

In [205]:
# Lasso regression
reg_d_w2 = linear_model.Lasso(alpha = alpha2)
reg_d_w2.fit(W2, sex)

# Get Predicted values
sex_lasso_fitted2 = reg_d_w2.fit(W2, sex).predict( W2 )
sex_lasso_fitted2

#Get residuals (D~)
resid_d_w2 = high_school["sex"] -  sex_lasso_fitted2
resid_d_w2

rownames
500     -0.063440
540     -0.309617
691     -0.324403
843     -0.336025
1775     0.688802
           ...   
32580   -0.311838
32590    0.688634
32599   -0.306663
32603   -0.310033
32631   -0.311260
Name: sex, Length: 1376, dtype: float64

* Finally, least squares regression of $\widetilde{Y_i}$ on $\widetilde{D_i}$

In [206]:
#Set a dataframe with resulting residuals 
resid2 = np.zeros((1376, 2))
resid2[:, 0] = resid_y_w2
resid2[:, 1] = resid_d_w2
residuals2 = pd.DataFrame( resid2, columns = ['resid_y_w2', 'resid_d_w2'])
residuals2

Unnamed: 0,resid_y_w2,resid_d_w2
0,-0.797724,-0.063440
1,-0.469858,-0.309617
2,-0.066532,-0.324403
3,-0.519673,-0.336025
4,-0.408546,0.688802
...,...,...
1371,-0.123039,-0.311838
1372,-0.054448,0.688634
1373,0.431272,-0.306663
1374,0.157571,-0.310033


In [207]:
# OLS regression of (Y~) on (D~)
y_d2 = smf.ols( formula = 'resid_y_w2 ~ resid_d_w2', data = residuals2 )
y_d_summ2 = y_d2.fit().summary2()

y_d_coef2 = y_d_summ2.tables[1]['Coef.'][1]
HCV_coef2 = y_d2.fit().cov_HC0
y_d2_se = np.power( HCV_coef2.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.4): ", y_d_coef2)

print( "Robust standard error via partialling-out using lasso (case 2 - alpha = 0.4) is:", y_d2_se )
y_d_summ2


Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.4):  -0.11070244242931247
Robust standard error via partialling-out using lasso (case 2 - alpha = 0.4) is: 0.028474419714432674


0,1,2,3
Model:,OLS,Adj. R-squared:,0.01
Dependent Variable:,resid_y_w2,AIC:,1931.6781
Date:,2021-09-17 13:16,BIC:,1942.132
No. Observations:,1376,Log-Likelihood:,-963.84
Df Model:,1,F-statistic:,14.59
Df Residuals:,1374,Prob (F-statistic):,0.00014
R-squared:,0.011,Scale:,0.238

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,0.0000,0.0132,0.0000,1.0000,-0.0258,0.0258
resid_d_w2,-0.1107,0.0290,-3.8192,0.0001,-0.1676,-0.0538

0,1,2,3
Omnibus:,256.937,Durbin-Watson:,1.864
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1338.148
Skew:,0.767,Prob(JB):,0.0
Kurtosis:,7.581,Condition No.:,2.0


In [208]:
#Obtaining R2 and R2_adj of the partial regression using lasso (case 2)
R2_l2 = y_d2.fit().rsquared
print("R-squared for the partialling out using lasso (case 2 - alpha = 0.4): ", R2_l2, "\n")

R2_adj_l2 = y_d2.fit().rsquared_adj
print("adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.4)", R2_adj_l2, "\n")

R-squared for the partialling out using lasso (case 2 - alpha = 0.4):  0.010504277990547428 

adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.4) 0.009784120987629485 



In [209]:
#Obtaining MSE and MSE_adj of the partialling-out using lasso (case 2)
MSE_l2 =  np.mean(y_d2.fit().resid**2)
print("MSE for the partialling out using lasso (case 2 - alpha = 0.4): ", MSE_l2, "\n")

p2 = len(y_d2.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l2  = (n/(n-p2))*MSE_l2
print("adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.4): ", MSE_adj_l2, "\n")

MSE for the partialling out using lasso (case 2 - alpha = 0.4):  0.23765192268691798 

adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.4):  0.23799784979417696 



### Now, we'll try a higher penalty values

* #### With $\alpha = 0.5$

In [157]:
# Set penalty value
alpha2_2=0.5

# Get exogenous variables 
y_w2 = 'lwage ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(y_w2 , data=high_school).exog
W2.shape

# Set endogenous variable
lwage = high_school["lwage"]
lwage.shape

# Lasso regression
reg_y_w2 = linear_model.Lasso(alpha = alpha2_2)
reg_y_w2.fit(W2, lwage)

# Get Predicted values
lwage_lasso_fitted2 = reg_y_w2.fit(W2, lwage).predict( W2 )

#Get residuals (Y~)
resid_y_w2 = high_school["lwage"] -  lwage_lasso_fitted2

# Get exogenous variables
d_w2 = 'sex ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(d_w2 , data=high_school).exog

# Set endogenous variable
sex= high_school['sex']

# Lasso regression
reg_d_w2 = linear_model.Lasso(alpha = alpha2_2)
reg_d_w2.fit(W2, sex)

# Get Predicted values
sex_lasso_fitted2 = reg_d_w2.fit(W2, sex).predict( W2 )

#Get residuals (D~)
resid_d_w2 = high_school["sex"] -  sex_lasso_fitted2

#Set a dataframe with resulting residuals 
resid2 = np.zeros((1376, 2))
resid2[:, 0] = resid_y_w2
resid2[:, 1] = resid_d_w2
residuals2 = pd.DataFrame( resid2, columns = ['resid_y_w2', 'resid_d_w2'])

# OLS regression of (Y~) on (D~)
y_d2 = smf.ols( formula = 'resid_y_w2 ~ resid_d_w2', data = residuals2 )
y_d_summ2 = y_d2.fit().summary2()

y_d_coef2_2 = y_d_summ2.tables[1]['Coef.'][1]
HCV_coef2 = y_d2.fit().cov_HC0
y_d2_se = np.power( HCV_coef2.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.5): ", y_d_coef2_2)

print( "Robust standard error via partialling-out using lasso (case 2 - alpha = 0.5) is:", y_d2_se )

#Obtaining R2 and R2_adj of the partial regression using lasso (case 2)
R2_l2_2 = y_d2.fit().rsquared
print("R-squared for the partialling out using lasso (case 2 - alpha = 0.5): ", R2_l2_2, "\n")

R2_adj_l2_2 = y_d2.fit().rsquared_adj
print("adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.5)", R2_adj_l2_2, "\n")

#Obtaining MSE and MSE_adj of the partialling-out using lasso (case 2)
MSE_l2_2 =  np.mean(y_d2.fit().resid**2)
print("MSE for the partialling out using lasso (case 2 - alpha = 0.5): ", MSE_l2_2, "\n")

p2 = len(y_d2.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l2_2  = (n/(n-p2))*MSE_l2_2
print("adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.5): ", MSE_adj_l2_2, "\n")

Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.5):  -0.11406250176206961
Robust standard error via partialling-out using lasso (case 2 - alpha = 0.5) is: 0.028113573015171478
R-squared for the partialling out using lasso (case 2 - alpha = 0.5):  0.011405222102324641 

adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.5) 0.01068572080836716 

MSE for the partialling out using lasso (case 2 - alpha = 0.5):  0.23899430166713861 

adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.5):  0.23934218274671235 



* #### With $\alpha = 0.6$

In [160]:
# Set penalty value
alpha2_3=0.6

# Get exogenous variables 
y_w2 = 'lwage ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(y_w2 , data=high_school).exog
W2.shape

# Set endogenous variable
lwage = high_school["lwage"]
lwage.shape

# Lasso regression
reg_y_w2 = linear_model.Lasso(alpha = alpha2_3)
reg_y_w2.fit(W2, lwage)

# Get Predicted values
lwage_lasso_fitted2 = reg_y_w2.fit(W2, lwage).predict( W2 )

#Get residuals (Y~)
resid_y_w2 = high_school["lwage"] -  lwage_lasso_fitted2

# Get exogenous variables
d_w2 = 'sex ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(d_w2 , data=high_school).exog

# Set endogenous variable
sex= high_school['sex']

# Lasso regression
reg_d_w2 = linear_model.Lasso(alpha = alpha2_3)
reg_d_w2.fit(W2, sex)

# Get Predicted values
sex_lasso_fitted2 = reg_d_w2.fit(W2, sex).predict( W2 )

#Get residuals (D~)
resid_d_w2 = high_school["sex"] -  sex_lasso_fitted2

#Set a dataframe with resulting residuals 
resid2 = np.zeros((1376, 2))
resid2[:, 0] = resid_y_w2
resid2[:, 1] = resid_d_w2
residuals2 = pd.DataFrame( resid2, columns = ['resid_y_w2', 'resid_d_w2'])

# OLS regression of (Y~) on (D~)
y_d2 = smf.ols( formula = 'resid_y_w2 ~ resid_d_w2', data = residuals2 )
y_d_summ2 = y_d2.fit().summary2()

y_d_coef2_3 = y_d_summ2.tables[1]['Coef.'][1]
HCV_coef2 = y_d2.fit().cov_HC0
y_d2_se = np.power( HCV_coef2.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.6): ", y_d_coef2_3)

print( "Robust standard error via partialling-out using lasso (case 2 - alpha = 0.6) is:", y_d2_se )

#Obtaining R2 and R2_adj of the partial regression using lasso (case 2)
R2_l2_3 = y_d2.fit().rsquared
print("R-squared for the partialling out using lasso (case 2 - alpha = 0.6): ", R2_l2_3, "\n")

R2_adj_l2_3 = y_d2.fit().rsquared_adj
print("adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.6)", R2_adj_l2_3, "\n")

#Obtaining MSE and MSE_adj of the partialling-out using lasso (case 2)
MSE_l2_3 =  np.mean(y_d2.fit().resid**2)
print("MSE for the partialling out using lasso (case 2 - alpha = 0.6): ", MSE_l2_3, "\n")

p2 = len(y_d2.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l2_3  = (n/(n-p2))*MSE_l2_3
print("adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.6): ", MSE_adj_l2_3, "\n")

Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.6):  -0.11706251268161472
Robust standard error via partialling-out using lasso (case 2 - alpha = 0.6) is: 0.028059713766126995
R-squared for the partialling out using lasso (case 2 - alpha = 0.6):  0.012053411435687056 

adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.6) 0.01133438189524738 

MSE for the partialling out using lasso (case 2 - alpha = 0.6):  0.2399144416094907 

adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.6):  0.2402636620485147 



* #### With $\alpha = 0.7$

In [159]:
# Set penalty value
alpha2_4=0.7

# Get exogenous variables 
y_w2 = 'lwage ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(y_w2 , data=high_school).exog
W2.shape

# Set endogenous variable
lwage = high_school["lwage"]
lwage.shape

# Lasso regression
reg_y_w2 = linear_model.Lasso(alpha = alpha2_4)
reg_y_w2.fit(W2, lwage)

# Get Predicted values
lwage_lasso_fitted2 = reg_y_w2.fit(W2, lwage).predict( W2 )

#Get residuals (Y~)
resid_y_w2 = high_school["lwage"] -  lwage_lasso_fitted2

# Get exogenous variables
d_w2 = 'sex ~ (exp1 + exp2 + exp3 + exp4 + hsg + mw + so + we +  occ2 + ind2)**2'
W2 = smf.ols(d_w2 , data=high_school).exog

# Set endogenous variable
sex= high_school['sex']

# Lasso regression
reg_d_w2 = linear_model.Lasso(alpha = alpha2_4)
reg_d_w2.fit(W2, sex)

# Get Predicted values
sex_lasso_fitted2 = reg_d_w2.fit(W2, sex).predict( W2 )

#Get residuals (D~)
resid_d_w2 = high_school["sex"] -  sex_lasso_fitted2

#Set a dataframe with resulting residuals 
resid2 = np.zeros((1376, 2))
resid2[:, 0] = resid_y_w2
resid2[:, 1] = resid_d_w2
residuals2 = pd.DataFrame( resid2, columns = ['resid_y_w2', 'resid_d_w2'])

# OLS regression of (Y~) on (D~)
y_d2 = smf.ols( formula = 'resid_y_w2 ~ resid_d_w2', data = residuals2 )
y_d_summ2 = y_d2.fit().summary2()

y_d_coef2_4 = y_d_summ2.tables[1]['Coef.'][1]
HCV_coef2 = y_d2.fit().cov_HC0
y_d2_se = np.power( HCV_coef2.diagonal() , 0.5)[1]

print("Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.7): ", y_d_coef2_4)

print( "Robust standard error via partialling-out using lasso (case 2 - alpha = 0.7) is:", y_d2_se )

#Obtaining R2 and R2_adj of the partial regression using lasso (case 2)
R2_l2_4 = y_d2.fit().rsquared
print("R-squared for the partialling out using lasso (case 2 - alpha = 0.7): ", R2_l2_4, "\n")

R2_adj_l2_4 = y_d2.fit().rsquared_adj
print("adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.7)", R2_adj_l2_4, "\n")

#Obtaining MSE and MSE_adj of the partialling-out using lasso (case 2)
MSE_l2_4 =  np.mean(y_d2.fit().resid**2)
print("MSE for the partialling out using lasso (case 2 - alpha = 0.7): ", MSE_l2_4, "\n")

p2 = len(y_d2.fit().params) # number of regressors
n = len(lwage)

MSE_adj_l2_4  = (n/(n-p2))*MSE_l2_4
print("adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.7): ", MSE_adj_l2_4, "\n")

Gender Coefficient via partialling-out using lasso (case 2 - alpha = 0.7):  -0.11936143755038078
Robust standard error via partialling-out using lasso (case 2 - alpha = 0.7) is: 0.028054966910758376
R-squared for the partialling out using lasso (case 2 - alpha = 0.7):  0.012542396759087526 

adjusted R-squared for the partialling out using lasso (case 2 - alpha = 0.7) 0.011823723103162598 

MSE for the partialling out using lasso (case 2 - alpha = 0.7):  0.2410960312225982 

adjusted MSE for the partialling out using lasso (case 2 - alpha = 0.7):  0.24144697158827885 



#### The following table shows the results with different $\alpha$

In [210]:
table_l2 = np.zeros((4, 5))
table_l2[0,0:5] = [y_d_coef2, R2_l2, MSE_l2, R2_adj_l2, MSE_adj_l2 ]
table_l2[1,0:5] = [y_d_coef2_2, R2_l2_2, MSE_l2_2, R2_adj_l2_2, MSE_adj_l2_2 ]
table_l2[2,0:5] = [y_d_coef2_3, R2_l2_3, MSE_l2_3, R2_adj_l2_3, MSE_adj_l2_3 ]
table_l2[3,0:5] = [y_d_coef2_4, R2_l2_4, MSE_l2_4, R2_adj_l2_4, MSE_adj_l2_4 ]
table_l2 = pd.DataFrame(table_l2, columns = ["Gender Coefficient","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["partiallling out-lasso alpha = 0.4", "partiallling out-lasso alpha = 0.5", "partiallling out-lasso alpha = 0.6", "partiallling out-lasso alpha = 0.7"])
table_l2

Unnamed: 0,Gender Coefficient,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
partiallling out-lasso alpha = 0.4,-0.110702,0.010504,0.237652,0.009784,0.237998
partiallling out-lasso alpha = 0.5,-0.114063,0.011405,0.238994,0.010686,0.239342
partiallling out-lasso alpha = 0.6,-0.117063,0.012053,0.239914,0.011334,0.240264
partiallling out-lasso alpha = 0.7,-0.119361,0.012542,0.241096,0.011824,0.241447
