## GRUPO 8 

#### *Integrantes:* 

1. Gianfranco Soria (20163509)
2. Erick Morales (20163041)
3. Andrea Clavo (20176040)
4. Sandra Martínez (20173026)

_____
## Question 1
_____
Briefly explain the idea of sample splitting to evaluate the performance
of prediction rules to a fellow student and show how to use it on the wage data.

The idea of separating the sample is to have two representative subsamples.
The TR subsample and the TS subsample. The TR subsample is separated
for model selection and the TS subsample for model validation. That
is, the model is generated using the first subsample as the training
sample and tested on the second subsample. Now, to know the proportion
of each subsample, 80\% of the total sample is usually used for the
training subsample (TR) and the remaining 20\% for the test subsample
(TS) (varies depending on the use of splitting) to avoid underfitting
problems - which is when the number of observations in the training
subsample is insufficient and, therefore, the model has little predictive
value - or overfitting - which is when the model has learned from
the randomness of a certain data set, so that the model will no longer
be exportable to another data set. An example of the use of splitting:
Suppose that two models can be used to predict wages using worker
characteristics: a basic model and a flexible model. How do we know
which model captures the data better? The training subsample will
be used to estimate the parameters of the Basic Model and the Flexible
Model and the test subsample will be used for evaluation. Therefore,
you will predict the wage of each observation in the test sample based
on the parameters estimated in the training sample. And finally, we
will calculate the mean square error of $MSE_{test}$ prediction based
on the test sample (TS) for both prediction models. 

Note: If you use Python, a uniform variable can be generated for splitting
so that all data have the same weight.

_____
## Question 2
_____
Replicate the PM1_Notebook1_Prediction_newdata (R and Python) JN but follow the next instructions:
- Focus on people who did not go to college (use the next variables shs, hsg)
- Basic model: 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
- Flexible model: 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

## Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question,
but we could begin to investigate from a predictive perspective.

In the following wage example, $Y$ is the hourly wage of a worker and $X$ is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:


* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015.  We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors;  individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below $3$. 

The variable of interest $Y$ is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size $n=5150$.

## Data analysis

We start by loading the data set.

In [1]:
import pandas as pd
import numpy as np
import pyreadr

In [2]:
rdata_read = pyreadr.read_r("../data/wage2015_subsample_inference.Rdata")
data = rdata_read[ 'data' ]
data.shape

(5150, 20)

In [3]:
data.columns

Index(['wage', 'lwage', 'sex', 'shs', 'hsg', 'scl', 'clg', 'ad', 'mw', 'so',
       'we', 'ne', 'exp1', 'exp2', 'exp3', 'exp4', 'occ', 'occ2', 'ind',
       'ind2'],
      dtype='object')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   wage    5150 non-null   float64 
 1   lwage   5150 non-null   float64 
 2   sex     5150 non-null   float64 
 3   shs     5150 non-null   float64 
 4   hsg     5150 non-null   float64 
 5   scl     5150 non-null   float64 
 6   clg     5150 non-null   float64 
 7   ad      5150 non-null   float64 
 8   mw      5150 non-null   float64 
 9   so      5150 non-null   float64 
 10  we      5150 non-null   float64 
 11  ne      5150 non-null   float64 
 12  exp1    5150 non-null   float64 
 13  exp2    5150 non-null   float64 
 14  exp3    5150 non-null   float64 
 15  exp4    5150 non-null   float64 
 16  occ     5150 non-null   category
 17  occ2    5150 non-null   category
 18  ind     5150 non-null   category
 19  ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 736.3+ KB


In [6]:
data.describe()

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681


We focus on Focus on people who did not go to college (using variables shs, hsg)

In [7]:
data = data[(data['shs'] == 1) | (data['hsg'] == 1)]
print(data.shape) 
data

(1376, 20)


Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
15,11.057692,2.403126,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,3.24,5.832,10.4976,6260,19,770,4
43,19.230769,2.956512,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,42.0,17.64,74.088,311.1696,5120,17,7280,14
44,19.230769,2.956512,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,37.0,13.69,50.653,187.4161,5240,17,5680,9
47,12.000000,2.484907,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,9.61,29.791,92.3521,4040,13,8590,19
73,17.307692,2.851151,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,0.49,0.343,0.2401,4020,13,8270,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32580,12.980769,2.563469,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,15.0,2.25,3.375,5.0625,2010,6,9370,22
32590,13.461538,2.599837,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,8.0,0.64,0.512,0.4096,4720,16,8590,19
32599,22.596154,3.117780,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,15.0,2.25,3.375,5.0625,9620,22,5390,9
32603,16.826923,2.822980,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,1.21,1.331,1.4641,7150,20,8770,21


In [8]:
(data['shs'].value_counts()), (data['hsg'].value_counts())

(0.0    1256
 1.0     120
 Name: shs, dtype: int64,
 1.0    1256
 0.0     120
 Name: hsg, dtype: int64)

We are constructing the output variable  **Y**  and the matrix  **Z**  which includes the characteristics of workers that are given in the data.

In [9]:
Y = np.log2(data['wage']) 
n = len(Y)
z = data.loc[:, ~ data.columns.isin(['wage', 'lwage','Unnamed: 0'])]
p = z.shape[1]

print("Number of observation:", n, '\n')
print( "Number of raw regressors:", p)

Number of observation: 1376 

Number of raw regressors: 18


## Prediction Question

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model**, enables us to approximate the real relationship by a
 more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

#### Basic regression

Now, let us fit both models to our data by running ordinary least squares (ols):

In [10]:
# Import packages for OLS regression

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [11]:
basic = 'lwage ~ sex + exp1 + shs + hsg + scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()
print(basic_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(basic_results.params), '\n')  # number of regressors in the Basic Model

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.180
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     6.212
Date:                Fri, 17 Sep 2021   Prob (F-statistic):           9.07e-33
Time:                        15:47:33   Log-Likelihood:                -872.87
No. Observations:                1376   AIC:                             1842.
Df Residuals:                    1328   BIC:                             2093.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0216      0.062     32.368      0.0

The basic model consists of $51$ regressors.

#### Flexible regression

In [12]:
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results_0 = smf.ols(flex , data=data)
flex_results = smf.ols(flex , data=data).fit()
print(flex_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:", len(flex_results.params), '\n') # number of regressors in the Flexible Model

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.507
Model:                            OLS   Adj. R-squared:                  0.232
Method:                 Least Squares   F-statistic:                     1.840
Date:                Fri, 17 Sep 2021   Prob (F-statistic):           2.24e-15
Time:                        15:47:34   Log-Likelihood:                -522.96
No. Observations:                1376   AIC:                             2034.
Df Residuals:                     882   BIC:                             4616.
Df Model:                         493                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 3.63

The flexible model consists of $979$ regressors.

#### Lasso 

In [13]:
# Import relevant packages for lasso 

from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

In [14]:
# Get exogenous variables from flexible model

X = flex_results_0.exog
(X.shape)

(1376, 979)

In [15]:
# Set endogenous variable
lwage = data["lwage"]
(lwage.shape)

(1376,)

###### Which value of alpha should we work with?

Lines below, we created a loop to find out which alpha's values have convergence problems and which ones have not. After that, we proceed to select three random values to compare the  (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$  of the other two models.

In [16]:
# We want to iterate different values of alphas from 0 to 1 

def seq(start, stop, step=1):
    n = int(round((stop - start)/float(step)))
    if n > 1:
        return([start + step*i for i in range(n+1)])
    elif n == 1:
        return([start])
    else:
        return([])

In [17]:
tope = seq(0,1,0.01)
len(tope)

101

In [18]:
# Creating the loop:

for i in tope:

    reg = linear_model.Lasso(alpha = i)
    reg.fit(X, lwage)
    lwage_lasso_fitted = reg.fit(X, lwage).predict( X )
    b = reg.score(X, lwage)
    
    alpha = i

    R2_L = reg.score(flex_results_0.exog, lwage)
    R2_adjL =1 - (1-R2_L)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)

    print('For alpha =', alpha, ', Lasso Regression: R^2 score', b) 
    print('For alpha =', alpha, ", R-squared for LASSO: ", R2_L, "\n")
    print('For alpha =', alpha, ', adjusted R-squared for LASSO: ', R2_adjL, "\n")

  reg.fit(X, lwage)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  lwage_lasso_fitted = reg.fit(X, lwage).predict( X )
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.0 , Lasso Regression: R^2 score 0.46789607629666685
For alpha = 0.0 , R-squared for LASSO:  0.46789607629666685 

For alpha = 0.0 , adjusted R-squared for LASSO:  -0.8475830684143513 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.01 , Lasso Regression: R^2 score 0.2076568000931922
For alpha = 0.01 , R-squared for LASSO:  0.2076568000931922 

For alpha = 0.01 , adjusted R-squared for LASSO:  -1.7511916663430824 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.02 , Lasso Regression: R^2 score 0.19604287862005554
For alpha = 0.02 , R-squared for LASSO:  0.19604287862005554 

For alpha = 0.02 , adjusted R-squared for LASSO:  -1.7915177825692519 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.03 , Lasso Regression: R^2 score 0.18667179764733421
For alpha = 0.03 , R-squared for LASSO:  0.18667179764733421 

For alpha = 0.03 , adjusted R-squared for LASSO:  -1.8240562581689783 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.04 , Lasso Regression: R^2 score 0.1801125505028911
For alpha = 0.04 , R-squared for LASSO:  0.1801125505028911 

For alpha = 0.04 , adjusted R-squared for LASSO:  -1.8468314218649615 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.05 , Lasso Regression: R^2 score 0.1724485594774804
For alpha = 0.05 , R-squared for LASSO:  0.1724485594774804 

For alpha = 0.05 , adjusted R-squared for LASSO:  -1.8734425018143046 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.06 , Lasso Regression: R^2 score 0.16292663385136996
For alpha = 0.06 , R-squared for LASSO:  0.16292663385136996 

For alpha = 0.06 , adjusted R-squared for LASSO:  -1.9065047435716318 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.07 , Lasso Regression: R^2 score 0.15562760432901723
For alpha = 0.07 , R-squared for LASSO:  0.15562760432901723 

For alpha = 0.07 , adjusted R-squared for LASSO:  -1.931848596079801 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.08 , Lasso Regression: R^2 score 0.14915181284409618
For alpha = 0.08 , R-squared for LASSO:  0.14915181284409618 

For alpha = 0.08 , adjusted R-squared for LASSO:  -1.9543339831802218 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.09 , Lasso Regression: R^2 score 0.1419793499429448
For alpha = 0.09 , R-squared for LASSO:  0.1419793499429448 

For alpha = 0.09 , adjusted R-squared for LASSO:  -1.9792383682536636 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.1 , Lasso Regression: R^2 score 0.13504437453298423
For alpha = 0.1 , R-squared for LASSO:  0.13504437453298423 

For alpha = 0.1 , adjusted R-squared for LASSO:  -2.003318143982694 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.11 , Lasso Regression: R^2 score 0.1285797483844473
For alpha = 0.11 , R-squared for LASSO:  0.1285797483844473 

For alpha = 0.11 , adjusted R-squared for LASSO:  -2.0257647625540023 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.12 , Lasso Regression: R^2 score 0.1229670782156288
For alpha = 0.12 , R-squared for LASSO:  0.1229670782156288 

For alpha = 0.12 , adjusted R-squared for LASSO:  -2.0452532006401776 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.13 , Lasso Regression: R^2 score 0.11747040634697903
For alpha = 0.13 , R-squared for LASSO:  0.11747040634697903 

For alpha = 0.13 , adjusted R-squared for LASSO:  -2.0643388668507674 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.14 , Lasso Regression: R^2 score 0.11152161431529017
For alpha = 0.14 , R-squared for LASSO:  0.11152161431529017 

For alpha = 0.14 , adjusted R-squared for LASSO:  -2.0849943947385756 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.15 , Lasso Regression: R^2 score 0.10691827531230746
For alpha = 0.15 , R-squared for LASSO:  0.10691827531230746 

For alpha = 0.15 , adjusted R-squared for LASSO:  -2.1009782107211548 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.16 , Lasso Regression: R^2 score 0.10278872766547886
For alpha = 0.16 , R-squared for LASSO:  0.10278872766547886 

For alpha = 0.16 , adjusted R-squared for LASSO:  -2.1153169178281988 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.17 , Lasso Regression: R^2 score 0.09934915157031177
For alpha = 0.17 , R-squared for LASSO:  0.09934915157031177 

For alpha = 0.17 , adjusted R-squared for LASSO:  -2.127259890380862 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.18 , Lasso Regression: R^2 score 0.09584561897898758
For alpha = 0.18 , R-squared for LASSO:  0.09584561897898758 

For alpha = 0.18 , adjusted R-squared for LASSO:  -2.1394249341007376 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.19 , Lasso Regression: R^2 score 0.09288966725761594
For alpha = 0.19 , R-squared for LASSO:  0.09288966725761594 

For alpha = 0.19 , adjusted R-squared for LASSO:  -2.1496886553555 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.2 , Lasso Regression: R^2 score 0.08983057313693599
For alpha = 0.2 , R-squared for LASSO:  0.08983057313693599 

For alpha = 0.2 , adjusted R-squared for LASSO:  -2.1603105099411946 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.21 , Lasso Regression: R^2 score 0.08673565365265945
For alpha = 0.21 , R-squared for LASSO:  0.08673565365265945 

For alpha = 0.21 , adjusted R-squared for LASSO:  -2.1710567581504883 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.22 , Lasso Regression: R^2 score 0.08381795363112077
For alpha = 0.22 , R-squared for LASSO:  0.08381795363112077 

For alpha = 0.22 , adjusted R-squared for LASSO:  -2.1811876610030527 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.23 , Lasso Regression: R^2 score 0.0807673408162426
For alpha = 0.23 , R-squared for LASSO:  0.0807673408162426 

For alpha = 0.23 , adjusted R-squared for LASSO:  -2.191780066610269 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.24 , Lasso Regression: R^2 score 0.07784837072432771
For alpha = 0.24 , R-squared for LASSO:  0.07784837072432771 

For alpha = 0.24 , adjusted R-squared for LASSO:  -2.2019153794294177 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.25 , Lasso Regression: R^2 score 0.0750225097230629
For alpha = 0.25 , R-squared for LASSO:  0.0750225097230629 

For alpha = 0.25 , adjusted R-squared for LASSO:  -2.2117273967949207 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.26 , Lasso Regression: R^2 score 0.07231415151795217
For alpha = 0.26 , R-squared for LASSO:  0.07231415151795217 

For alpha = 0.26 , adjusted R-squared for LASSO:  -2.2211314183404443 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.27 , Lasso Regression: R^2 score 0.06950023928441174
For alpha = 0.27 , R-squared for LASSO:  0.06950023928441174 

For alpha = 0.27 , adjusted R-squared for LASSO:  -2.2309019469291256 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.28 , Lasso Regression: R^2 score 0.067247168362226
For alpha = 0.28 , R-squared for LASSO:  0.067247168362226 

For alpha = 0.28 , adjusted R-squared for LASSO:  -2.238725109853382 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.29 , Lasso Regression: R^2 score 0.06587603840888623
For alpha = 0.29 , R-squared for LASSO:  0.06587603840888623 

For alpha = 0.29 , adjusted R-squared for LASSO:  -2.243485977746923 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.3 , Lasso Regression: R^2 score 0.06460044615053606
For alpha = 0.3 , R-squared for LASSO:  0.06460044615053606 

For alpha = 0.3 , adjusted R-squared for LASSO:  -2.2479151175328607 

For alpha = 0.31 , Lasso Regression: R^2 score 0.06336113140808963
For alpha = 0.31 , R-squared for LASSO:  0.06336113140808963 

For alpha = 0.31 , adjusted R-squared for LASSO:  -2.2522182937219113 

For alpha = 0.32 , Lasso Regression: R^2 score 0.06233332053566987
For alpha = 0.32 , R-squared for LASSO:  0.06233332053566987 

For alpha = 0.32 , adjusted R-squared for LASSO:  -2.2557870814733687 

For alpha = 0.33 , Lasso Regression: R^2 score 0.06133018321529671
For alpha = 0.33 , R-squared for LASSO:  0.06133018321529671 

For alpha = 0.33 , adjusted R-squared for LASSO:  -2.2592701971691085 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.34 , Lasso Regression: R^2 score 0.06033122044161132
For alpha = 0.34 , R-squared for LASSO:  0.06033122044161132 

For alpha = 0.34 , adjusted R-squared for LASSO:  -2.262738817911072 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.35000000000000003 , Lasso Regression: R^2 score 0.059358850434744426
For alpha = 0.35000000000000003 , R-squared for LASSO:  0.059358850434744426 

For alpha = 0.35000000000000003 , adjusted R-squared for LASSO:  -2.266115102657137 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.36 , Lasso Regression: R^2 score 0.05837050961419388
For alpha = 0.36 , R-squared for LASSO:  0.05837050961419388 

For alpha = 0.36 , adjusted R-squared for LASSO:  -2.2695468416173825 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.37 , Lasso Regression: R^2 score 0.057355601233210485
For alpha = 0.37 , R-squared for LASSO:  0.057355601233210485 

For alpha = 0.37 , adjusted R-squared for LASSO:  -2.2730708290513526 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.38 , Lasso Regression: R^2 score 0.05632615744776637
For alpha = 0.38 , R-squared for LASSO:  0.05632615744776637 

For alpha = 0.38 , adjusted R-squared for LASSO:  -2.2766452866397 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.39 , Lasso Regression: R^2 score 0.05536083900218425
For alpha = 0.39 , R-squared for LASSO:  0.05536083900218425 

For alpha = 0.39 , adjusted R-squared for LASSO:  -2.2799970867979713 

For alpha = 0.4 , Lasso Regression: R^2 score 0.054428042189907044
For alpha = 0.4 , R-squared for LASSO:  0.054428042189907044 

For alpha = 0.4 , adjusted R-squared for LASSO:  -2.2832359646183784 

For alpha = 0.41000000000000003 , Lasso Regression: R^2 score 0.05347723850456021
For alpha = 0.41000000000000003 , R-squared for LASSO:  0.05347723850456021 

For alpha = 0.41000000000000003 , adjusted R-squared for LASSO:  -2.2865373663036106 

For alpha = 0.42 , Lasso Regression: R^2 score 0.05260256406810848
For alpha = 0.42 , R-squared for LASSO:  0.05260256406810848 

For alpha = 0.42 , adjusted R-squared for LASSO:  -2.289574430319068 

For alpha = 0.43 , Lasso Regression: R^2 score 0.051849203950802836
For alpha = 0.43 , R-squared for LASSO:  0.051849203950802836 

For alpha = 0.43 

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.65 , Lasso Regression: R^2 score 0.041380389171228216
For alpha = 0.65 , R-squared for LASSO:  0.041380389171228216 

For alpha = 0.65 , adjusted R-squared for LASSO:  -2.32854031537768 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.66 , Lasso Regression: R^2 score 0.04085997132002139
For alpha = 0.66 , R-squared for LASSO:  0.04085997132002139 

For alpha = 0.66 , adjusted R-squared for LASSO:  -2.330347321805481 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.67 , Lasso Regression: R^2 score 0.040330372185230834
For alpha = 0.67 , R-squared for LASSO:  0.040330372185230834 

For alpha = 0.67 , adjusted R-squared for LASSO:  -2.3321862076901705 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.68 , Lasso Regression: R^2 score 0.03979367225845987
For alpha = 0.68 , R-squared for LASSO:  0.03979367225845987 

For alpha = 0.68 , adjusted R-squared for LASSO:  -2.33404974910257 



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


For alpha = 0.6900000000000001 , Lasso Regression: R^2 score 0.039247965858201894
For alpha = 0.6900000000000001 , R-squared for LASSO:  0.039247965858201894 

For alpha = 0.6900000000000001 , adjusted R-squared for LASSO:  -2.335944562992355 

For alpha = 0.7000000000000001 , Lasso Regression: R^2 score 0.03874463654542415
For alpha = 0.7000000000000001 , R-squared for LASSO:  0.03874463654542415 

For alpha = 0.7000000000000001 , adjusted R-squared for LASSO:  -2.3376922342172777 

For alpha = 0.71 , Lasso Regression: R^2 score 0.038335433212461356
For alpha = 0.71 , R-squared for LASSO:  0.038335433212461356 

For alpha = 0.71 , adjusted R-squared for LASSO:  -2.3391130791233983 

For alpha = 0.72 , Lasso Regression: R^2 score 0.037920449270698486
For alpha = 0.72 , R-squared for LASSO:  0.037920449270698486 

For alpha = 0.72 , adjusted R-squared for LASSO:  -2.3405539955878525 

For alpha = 0.73 , Lasso Regression: R^2 score 0.037499705259567806
For alpha = 0.73 , R-squared for LA

After a simple glance, we can conclude that, at least for alpha's values between $[0,1]$ with $0.001$ jumps, the Lasso model does not fit to neither to the data nor the equation we are studying. In all of these alpha extensions, we have an (adjusted) $R^2_{sample}$ less than cero, which means we have chosen the wrong model or parameters.
However, for comparison issues, once we can get to know which values of alpha does not have convergence problems, we select three values from that group.

###### alpha = 0.31

In [19]:
alpha = 0.31

In [20]:
# Set penalty value = 0.31

reg = linear_model.Lasso(alpha = alpha)

# LASSO regression for flexible model
reg.fit(X, lwage)
lwage_lasso_fitted = reg.fit(X, lwage).predict( X )
   
# Coefficients 

reg.coef_
print('Lasso Regression: R^2 score', reg.score(X, lwage))

Lasso Regression: R^2 score 0.06336113140808963


In [21]:
# Check predicted values
lwage_lasso_fitted

array([2.71670039, 3.00692513, 2.93912987, ..., 2.68775798, 2.66407835,
       2.65762638])

Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

##### Basic Model

In [22]:
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()

##### Flexible model 

In [23]:
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results = smf.ols(flex , data=data).fit()

In [24]:
# Assess the predictive performance

R2_1 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1, "\n")
R2_adj1 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1, "\n")


R2_2 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2, "\n")
R2_adj2 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2, "\n")


R2_L = reg.score(flex_results_0.exog, lwage)
print("R-squared for LASSO: ", R2_L, "\n")
R2_adjL = 1 - (1-R2_L)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
print("adjusted R-squared for LASSO: ", R2_adjL, "\n")

R-squared for the basic model:  0.18023814876721034 

adjusted R-squared for the basic model:  0.15122549288773657 

R-squared for the basic model:  0.5070440013634975 

adjusted R-squared for the basic model:  0.23150283659275406 

R-squared for LASSO:  0.06336113140808963 

adjusted R-squared for LASSO:  -2.2522182937219113 



In [25]:
# Calculating the MSE

MSE1 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1, "\n")
p1 = len(basic_results.params) # number of regressors
n = len(lwage)
MSE_adj1  = (n/(n-p1))*MSE1
print("adjusted MSE for the basic model: ", MSE_adj1, "\n")


MSE2 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2, "\n")
p2 = len(flex_results.params) # number of regressors
n = len(lwage)
MSE_adj2  = (n/(n-p2))*MSE2
print("adjusted MSE for the flexible model: ", MSE_adj2, "\n")


MSEL = mean_squared_error(lwage, lwage_lasso_fitted)
print("MSE for the LASSO model: ", MSEL, "\n")
pL = reg.coef_.shape[0] # number of regressors
n = len(lwage)
MSE_adjL  = (n/(n-pL))*MSEL
print("adjusted MSE for LASSO model: ", MSE_adjL, "\n")

MSE for the basic model:  0.20821908377460777 

adjusted MSE for the basic model:  0.21623355416895118 

MSE for the flexible model:  0.12521056721891977 

adjusted MSE for the flexible model:  0.43397919519706196 

MSE for the LASSO model:  0.23790578538462734 

adjusted MSE for LASSO model:  0.8245802536253078 



In [26]:
# Package for latex table 

import array_to_latex as a2l

table = np.zeros((3, 5))
table[0,0:5] = [p1, R2_1, MSE1, R2_adj1, MSE_adj1]
table[1,0:5] = [p2, R2_2, MSE2, R2_adj2, MSE_adj2]
table[2,0:5] = [pL, R2_L, MSEL, R2_adjL, MSE_adjL]
table

array([[ 5.10000000e+01,  1.80238149e-01,  2.08219084e-01,
         1.51225493e-01,  2.16233554e-01],
       [ 9.79000000e+02,  5.07044001e-01,  1.25210567e-01,
         2.31502837e-01,  4.33979195e-01],
       [ 9.79000000e+02,  6.33611314e-02,  2.37905785e-01,
        -2.25221829e+00,  8.24580254e-01]])

In [27]:
table = pd.DataFrame(table, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["basic reg","flexible reg", "lasso flex"])
table

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,51.0,0.180238,0.208219,0.151225,0.216234
flexible reg,979.0,0.507044,0.125211,0.231503,0.433979
lasso flex,979.0,0.063361,0.237906,-2.252218,0.82458


Following this **table**, the flexible regression make a good prediction unlike the basic regression and the lasso reg.

###### alpha = 0.51

In [28]:
alpha1 = 0.51

In [29]:
# Set penalty value = 0.51

reg1 = linear_model.Lasso(alpha = alpha1)

# LASSO regression for flexible model
reg1.fit(X, lwage)
lwage_lasso_fitted = reg1.fit(X, lwage).predict( X )
   
# Coefficients 

reg1.coef_
print('Lasso Regression: R^2 score', reg1.score(X, lwage))

Lasso Regression: R^2 score 0.04780796897404316


Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [30]:
# Assess the predictive performance

R2_1_1 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1_1, "\n")
R2_adj1_1 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1_1, "\n")


R2_2_1 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2_1, "\n")
R2_adj2_1 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2_1, "\n")


R2_L_1 = reg1.score(flex_results_0.exog, lwage)
print("R-squared for LASSO: ", R2_L_1, "\n")
R2_adjL_1 = 1 - (1-R2_L_1)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
print("adjusted R-squared for LASSO: ", R2_adjL_1, "\n")

R-squared for the basic model:  0.18023814876721034 

adjusted R-squared for the basic model:  0.15122549288773657 

R-squared for the basic model:  0.5070440013634975 

adjusted R-squared for the basic model:  0.23150283659275406 

R-squared for LASSO:  0.04780796897404316 

adjusted R-squared for LASSO:  -2.3062223299512388 



In [31]:
# Calculating the MSE

MSE1_1 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1_1, "\n")
p1_1 = len(basic_results.params) # number of regressors
n_1 = len(lwage)
MSE_adj1_1  = (n/(n-p1))*MSE1_1
print("adjusted MSE for the basic model: ", MSE_adj1_1, "\n")


MSE2_1 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2_1, "\n")
p2_1 = len(flex_results.params) # number of regressors
n_1 = len(lwage)
MSE_adj2_1  = (n/(n-p2))*MSE2_1
print("adjusted MSE for the flexible model: ", MSE_adj2_1, "\n")


MSEL_1 = mean_squared_error(lwage, lwage_lasso_fitted)
print("MSE for the LASSO model: ", MSEL_1, "\n")
pL_1 = reg1.coef_.shape[0] # number of regressors
n_1 = len(lwage)
MSE_adjL_1  = (n/(n-pL))*MSEL_1
print("adjusted MSE for LASSO model: ", MSE_adjL_1, "\n")

MSE for the basic model:  0.20821908377460777 

adjusted MSE for the basic model:  0.21623355416895118 

MSE for the flexible model:  0.12521056721891977 

adjusted MSE for the flexible model:  0.43397919519706196 

MSE for the LASSO model:  0.24185628055215028 

adjusted MSE for LASSO model:  0.8382726499742035 



In [32]:
# Package for latex table 

import array_to_latex as a2l

table1 = np.zeros((3, 5))
table1[0,0:5] = [p1_1, R2_1_1, MSE1_1, R2_adj1_1, MSE_adj1_1]
table1[1,0:5] = [p2_1, R2_2_1, MSE2_1, R2_adj2_1, MSE_adj2_1]
table1[2,0:5] = [pL_1, R2_L_1, MSEL_1, R2_adjL_1, MSE_adjL_1]
table1

array([[ 5.10000000e+01,  1.80238149e-01,  2.08219084e-01,
         1.51225493e-01,  2.16233554e-01],
       [ 9.79000000e+02,  5.07044001e-01,  1.25210567e-01,
         2.31502837e-01,  4.33979195e-01],
       [ 9.79000000e+02,  4.78079690e-02,  2.41856281e-01,
        -2.30622233e+00,  8.38272650e-01]])

In [33]:
table1 = pd.DataFrame(table1, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["basic reg","flexible reg", "lasso flex"])
table1

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,51.0,0.180238,0.208219,0.151225,0.216234
flexible reg,979.0,0.507044,0.125211,0.231503,0.433979
lasso flex,979.0,0.047808,0.241856,-2.306222,0.838273


###### alpha = 0.87

In [34]:
alpha2 = 0.87

In [35]:
# Set penalty value = 0.87

reg2 = linear_model.Lasso(alpha = alpha2)

# LASSO regression for flexible model
reg2.fit(X, lwage)
lwage_lasso_fitted = reg2.fit(X, lwage).predict( X )
   
# Coefficients 

reg2.coef_
print('Lasso Regression: R^2 score', reg2.score(X, lwage))

Lasso Regression: R^2 score 0.03320876316896926


Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [36]:
# Assess the predictive performance

R2_1_2 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1_2, "\n")
R2_adj1_2 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1_1, "\n")


R2_2_2 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2_2, "\n")
R2_adj2_2 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2_2, "\n")


R2_L_2 = reg2.score(flex_results_0.exog, lwage)
print("R-squared for LASSO: ", R2_L_2, "\n")
R2_adjL_2 = 1 - (1-R2_L_2)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
print("adjusted R-squared for LASSO: ", R2_adjL_2, "\n")

R-squared for the basic model:  0.18023814876721034 

adjusted R-squared for the basic model:  0.15122549288773657 

R-squared for the basic model:  0.5070440013634975 

adjusted R-squared for the basic model:  0.23150283659275406 

R-squared for LASSO:  0.03320876316896926 

adjusted R-squared for LASSO:  -2.3569140167744123 



In [37]:
# Calculating the MSE

MSE1_2 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1_2, "\n")
p1_2 = len(basic_results.params) # number of regressors
n_2 = len(lwage)
MSE_adj1_2  = (n/(n-p1))*MSE1_2
print("adjusted MSE for the basic model: ", MSE_adj1_2, "\n")


MSE2_2 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2_2, "\n")
p2_2 = len(flex_results.params) # number of regressors
n_2 = len(lwage)
MSE_adj2_2  = (n/(n-p2))*MSE2_2
print("adjusted MSE for the flexible model: ", MSE_adj2_2, "\n")


MSEL_2 = mean_squared_error(lwage, lwage_lasso_fitted)
print("MSE for the LASSO model: ", MSEL_2, "\n")
pL_2 = reg2.coef_.shape[0] # number of regressors
n_2 = len(lwage)
MSE_adjL_2 = (n/(n-pL))*MSEL_2
print("adjusted MSE for LASSO model: ", MSE_adjL_2, "\n")

MSE for the basic model:  0.20821908377460777 

adjusted MSE for the basic model:  0.21623355416895118 

MSE for the flexible model:  0.12521056721891977 

adjusted MSE for the flexible model:  0.43397919519706196 

MSE for the LASSO model:  0.24556447123216057 

adjusted MSE for LASSO model:  0.8511252201900579 



In [38]:
# Package for latex table 

import array_to_latex as a2l

table2 = np.zeros((3, 5))
table2[0,0:5] = [p1_2, R2_1_2, MSE1_2, R2_adj1_2, MSE_adj1_2]
table2[1,0:5] = [p2_2, R2_2_2, MSE2_2, R2_adj2_2, MSE_adj2_2]
table2[2,0:5] = [pL_2, R2_L_2, MSEL_2, R2_adjL_2, MSE_adjL_2]
table2

array([[ 5.10000000e+01,  1.80238149e-01,  2.08219084e-01,
         1.51225493e-01,  2.16233554e-01],
       [ 9.79000000e+02,  5.07044001e-01,  1.25210567e-01,
         2.31502837e-01,  4.33979195e-01],
       [ 9.79000000e+02,  3.32087632e-02,  2.45564471e-01,
        -2.35691402e+00,  8.51125220e-01]])

In [39]:
table2 = pd.DataFrame(table2, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["basic reg","flexible reg", "lasso flex"])
table2

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,51.0,0.180238,0.208219,0.151225,0.216234
flexible reg,979.0,0.507044,0.125211,0.231503,0.433979
lasso flex,979.0,0.033209,0.245564,-2.356914,0.851125


Now we proceed to the comparison between models and to its analysis:

In [40]:
table

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,51.0,0.180238,0.208219,0.151225,0.216234
flexible reg,979.0,0.507044,0.125211,0.231503,0.433979
lasso flex,979.0,0.063361,0.237906,-2.252218,0.82458


In [41]:
table1

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,51.0,0.180238,0.208219,0.151225,0.216234
flexible reg,979.0,0.507044,0.125211,0.231503,0.433979
lasso flex,979.0,0.047808,0.241856,-2.306222,0.838273


In [42]:
table2

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,51.0,0.180238,0.208219,0.151225,0.216234
flexible reg,979.0,0.507044,0.125211,0.231503,0.433979
lasso flex,979.0,0.033209,0.245564,-2.356914,0.851125


- Considering all the measures from **table**, the flexible regression perform a better prediction than the other ones. 

- Considering all the measures from **table1**, the flexible regression perform a better prediction than the other ones.

- Considering all the measures from **table2**, the flexible regression perform a better prediction than the other ones.

## Data Splitting
Measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we can consider).
- Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
- Use the testing sample for evaluation. Predict the  𝚠𝚊𝚐𝚎  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error  𝑀𝑆𝐸𝑡𝑒𝑠𝑡  based on the testing sample for both prediction models.

In [43]:
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0,n, size=math.floor(n))
data["random"] = random
random    # the array does not change 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["random"] = random


array([ 684,  559, 1216, ..., 1294,  573, 1367])

In [44]:
data_2 = data.sort_values(by=['random'])
data_2.head()

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,...,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2,random
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9262,9.134615,2.212071,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,5.0,0.25,0.125,0.0625,9360,22,5090,9,0
14134,24.038462,3.179655,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,11.0,1.21,1.331,1.4641,4250,14,7770,16,0
8689,12.980769,2.563469,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,17.0,2.89,4.913,8.3521,5620,17,5380,9,3
11486,12.980769,2.563469,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,7.0,0.49,0.343,0.2401,4760,16,5790,9,3
20866,34.615385,3.544298,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,5.0,0.25,0.125,0.0625,220,1,770,4,3


In [45]:
# Create training and testing sample 
train = data_2[ : math.floor(n*1/5)]    # training sample
test =  data_2[ math.floor(n*1/5) : ]   # testing sample
print(train.shape)
print(test.shape)

(275, 21)
(1101, 21)


In [46]:
# Basic Model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()


# Flexible model 
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results = smf.ols(flex , data=data).fit()

- basic model

In [47]:
# estimating the parameters in the training sample
basic_results = smf.ols(basic , data=train).fit()
print(basic_results.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.301
Model:                            OLS   Adj. R-squared:                  0.167
Method:                 Least Squares   F-statistic:                     2.250
Date:                Fri, 17 Sep 2021   Prob (F-statistic):           6.08e-05
Time:                        15:49:29   Log-Likelihood:                -153.57
No. Observations:                 275   AIC:                             397.1
Df Residuals:                     230   BIC:                             559.9
Df Model:                          44                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0293      0.140     14.464      0.0

In [48]:
lwage_test = test["lwage"].values
#test = test.drop(columns=['wage', 'lwage', 'random'])
#test

In [49]:
# calculating the out-of-sample MSE
test = sm.add_constant(test)   #add constant 

lwage_pred =  basic_results.predict(test) # predict out of sample
print(lwage_pred)

rownames
6461     2.771974
19084    2.718507
9432     2.733841
14106    2.651899
23945    2.726331
           ...   
5201     2.336775
9639     2.793093
15695    2.575746
3853     2.911139
20067    2.230210
Length: 1101, dtype: float64


In [50]:
MSE_test1 = np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
R2_test1  = 1 - MSE_test1/np.var(lwage_test)

print("Test MSE for the basic model: ", MSE_test1, " ")
print("Test R2 for the basic model: ", R2_test1)

Test MSE for the basic model:  0.24940994035591493  
Test R2 for the basic model:  0.01620258082380932


In the basic model, the $MSE_{test}$ (0.24940994035591493) isn't quite closed to the $MSE_{sample}$ (0.180238)

- Flexible model

In [51]:
# estimating the parameters in the training sample
flex_results = smf.ols(flex , data=train).fit()

# calculating the out-of-sample MSE
lwage_flex_pred =  flex_results.predict(test) # predict out of sample
lwage_test = test["lwage"].values

MSE_test2 = np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
R2_test2  = 1 - MSE_test2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_test2, " ")
print("Test R2 for the flexible model: ", R2_test2)

Test MSE for the flexible model:  126510.26715651712  
Test R2 for the flexible model:  -499018.7028645425


In the flexible model, the discrepancy between the $MSE_{test}$ (126510.26) and the $MSE_{sample}$ (0.125) is very large.

It is worth to notice that the $MSE_{test}$ vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample $MSE$, the basic model using ols regression apparently performs much better than the flexible model.


Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (*least absolute shrinkage and selection operator*) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors $p$ is relatively large in relation to $n$. 

Note that the out-of-sample $MSE$ on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

- flexible model using lasso

In [53]:
# get exogenous variables from training data used in flex model
flex_results_0 = smf.ols(flex , data=train)
X_train = flex_results_0.exog
print(X_train.shape)

# Get endogenous variable 
lwage_train = train["lwage"]
print(lwage_train.shape)

(275, 979)
(275,)


In [54]:
# get exogenous variables from testing data used in flex model
flex_results_1 = smf.ols(flex , data=test)
X_test = flex_results_1.exog
print(X_test.shape)

# Get endogenous variable 
lwage_test = test["lwage"]
print(lwage_test.shape)

(1101, 979)
(1101,)


In [55]:
Calculating the out-of-sample MSE

SyntaxError: invalid syntax (<ipython-input-55-381ed24ba79e>, line 1)

In [56]:
#alpha=0.31

reg = linear_model.Lasso(alpha=0.31)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso_1 = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso_1 = 1 - MSE_lasso_1/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso_1, " ")
print("Test R2 for the flexible model: ", R2_lasso_1)

Test MSE for the flexible model:  0.25816421854508254  
Test R2 for the flexible model:  -0.01832866631478991


  model = cd_fast.enet_coordinate_descent(


In [57]:
# Package for latex table 
import array_to_latex as a2l

table11 = np.zeros((3, 2))
table11[0,0] = MSE_test1
table11[1,0] = MSE_test2
table11[2,0] = MSE_lasso_1
table11[0,1] = R2_test1
table11[1,1] = R2_test2
table11[2,1] = R2_lasso_1

table11 = pd.DataFrame(table11, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table11

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.24941,0.016203
flexible reg,126510.267157,-499018.702865
lasso regression,0.258164,-0.018329


In [58]:
#alpha=0.51

reg = linear_model.Lasso(alpha=0.51)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso_2 = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso_2 = 1 - MSE_lasso_2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso_2, " ")
print("Test R2 for the flexible model: ", R2_lasso_2)

Test MSE for the flexible model:  0.24966393799330208  
Test R2 for the flexible model:  0.015200687235359611


  model = cd_fast.enet_coordinate_descent(


In [59]:
# Package for latex table 
import array_to_latex as a2l

table12 = np.zeros((3, 2))
table12[0,0] = MSE_test1
table12[1,0] = MSE_test2
table12[2,0] = MSE_lasso_2
table12[0,1] = R2_test1
table12[1,1] = R2_test2
table12[2,1] = R2_lasso_2

table12 = pd.DataFrame(table12, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table12

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.24941,0.016203
flexible reg,126510.267157,-499018.702865
lasso regression,0.249664,0.015201


In [60]:
#alpha=0.87

reg = linear_model.Lasso(alpha=0.87)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso_3 = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso_3 = 1 - MSE_lasso_3/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso_3, " ")
print("Test R2 for the flexible model: ", R2_lasso_3)

Test MSE for the flexible model:  0.24672277910294635  
Test R2 for the flexible model:  0.026802087410468745


In [61]:
# Package for latex table 
import array_to_latex as a2l

table13 = np.zeros((3, 2))
table13[0,0] = MSE_test1
table13[1,0] = MSE_test2
table13[2,0] = MSE_lasso_3
table13[0,1] = R2_test1
table13[1,1] = R2_test2
table13[2,1] = R2_lasso_3

table13 = pd.DataFrame(table13, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table13

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.24941,0.016203
flexible reg,126510.267157,-499018.702865
lasso regression,0.246723,0.026802


Considering the 50% (training) and (50%) test partition, it can be seen that the best model (slightly) is the flexible one. 

### CASE 2: 80% - 20%

In [62]:
# Create training and testing sample 
train = data_2[ : math.floor(n*4/5)]    # training sample
test =  data_2[ math.floor(n*4/5) : ]   # testing sample
print(train.shape)
print(test.shape)

(1100, 21)
(276, 21)


In [63]:
# Basic Model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()


# Flexible model 
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results = smf.ols(flex , data=data).fit()

- basic model

In [65]:
# estimating the parameters in the training sample
basic_results = smf.ols(basic , data=train).fit()
print(basic_results.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.190
Model:                            OLS   Adj. R-squared:                  0.154
Method:                 Least Squares   F-statistic:                     5.265
Date:                Fri, 17 Sep 2021   Prob (F-statistic):           2.13e-25
Time:                        15:49:57   Log-Likelihood:                -711.67
No. Observations:                1100   AIC:                             1519.
Df Residuals:                    1052   BIC:                             1759.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0547      0.071     28.859      0.0

In [66]:
lwage_test = test["lwage"].values
#test = test.drop(columns=['wage', 'lwage', 'random'])
#test

In [67]:
# calculating the out-of-sample MSE
test = sm.add_constant(test)   #add constant 

lwage_pred =  basic_results.predict(test) # predict out of sample
print(lwage_pred)

rownames
30171    2.849302
20062    2.635376
26099    2.518647
29638    2.849302
24679    2.917170
           ...   
5201     2.528900
9639     2.805803
15695    2.242189
3853     2.914903
20067    2.408205
Length: 276, dtype: float64


In [68]:
MSE_test1 = np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
R2_test1  = 1 - MSE_test1/np.var(lwage_test)

print("Test MSE for the basic model: ", MSE_test1, " ")
print("Test R2 for the basic model: ", R2_test1)

Test MSE for the basic model:  0.1998710251148981  
Test R2 for the basic model:  0.04156410192334259


In the basic model, the $MSE_{test}$ (0.24940994035591493) isn't quite closed to the $MSE_{sample}$ (0.19987)

- Flexible model

In [70]:
# estimating the parameters in the training sample
flex_results = smf.ols(flex , data=train).fit()

# calculating the out-of-sample MSE
lwage_flex_pred =  flex_results.predict(test) # predict out of sample
lwage_test = test["lwage"].values

MSE_test2 = np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
R2_test2  = 1 - MSE_test2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_test2, " ")
print("Test R2 for the flexible model: ", R2_test2)

Test MSE for the flexible model:  21.54882931955914  
Test R2 for the flexible model:  -102.33249438991751


In the flexible model, the discrepancy between the $MSE_{test}$ (126510.26) and the $MSE_{sample}$ (21.548829) is very large.

It is worth to notice that the $MSE_{test}$ vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample $MSE$, the basic model using ols regression apparently performs much better than the flexible model.


Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (*least absolute shrinkage and selection operator*) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors $p$ is relatively large in relation to $n$. 

Note that the out-of-sample $MSE$ on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

- flexible model using lasso

In [71]:
# get exogenous variables from training data used in flex model
flex_results_0 = smf.ols(flex , data=train)
X_train = flex_results_0.exog
print(X_train.shape)

# Get endogenous variable 
lwage_train = train["lwage"]
print(lwage_train.shape)

(1100, 979)
(1100,)


In [72]:
# get exogenous variables from testing data used in flex model
flex_results_1 = smf.ols(flex , data=test)
X_test = flex_results_1.exog
print(X_test.shape)

# Get endogenous variable 
lwage_test = test["lwage"]
print(lwage_test.shape)

(276, 979)
(276,)


Calculating the out-of-sample MSE

In [75]:
#alpha=0.31

reg = linear_model.Lasso(alpha=0.31)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso_1 = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso_1 = 1 - MSE_lasso_1/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso_1, " ")
print("Test R2 for the flexible model: ", R2_lasso_1)

Test MSE for the flexible model:  0.20226455264309567  
Test R2 for the flexible model:  0.030086486772569976


  model = cd_fast.enet_coordinate_descent(


In [76]:
# Package for latex table 
import array_to_latex as a2l

table11 = np.zeros((3, 2))
table11[0,0] = MSE_test1
table11[1,0] = MSE_test2
table11[2,0] = MSE_lasso_1
table11[0,1] = R2_test1
table11[1,1] = R2_test2
table11[2,1] = R2_lasso_1

table11 = pd.DataFrame(table11, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table11

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.199871,0.041564
flexible reg,21.548829,-102.332494
lasso regression,0.202265,0.030086


In [77]:
#alpha=0.51

reg = linear_model.Lasso(alpha=0.51)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso_2 = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso_2 = 1 - MSE_lasso_2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso_2, " ")
print("Test R2 for the flexible model: ", R2_lasso_2)

Test MSE for the flexible model:  0.2034417261560974  
Test R2 for the flexible model:  0.024441619776579437


In [78]:
# Package for latex table 
import array_to_latex as a2l

table12 = np.zeros((3, 2))
table12[0,0] = MSE_test1
table12[1,0] = MSE_test2
table12[2,0] = MSE_lasso_2
table12[0,1] = R2_test1
table12[1,1] = R2_test2
table12[2,1] = R2_lasso_2

table12 = pd.DataFrame(table12, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table12

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.199871,0.041564
flexible reg,21.548829,-102.332494
lasso regression,0.203442,0.024442


In [79]:
#alpha=0.87

reg = linear_model.Lasso(alpha=0.87)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso_3 = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso_3 = 1 - MSE_lasso_3/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso_3, " ")
print("Test R2 for the flexible model: ", R2_lasso_3)

Test MSE for the flexible model:  0.20790372485024064  
Test R2 for the flexible model:  0.003045123094883695


In [80]:
# Package for latex table 
import array_to_latex as a2l

table13 = np.zeros((3, 2))
table13[0,0] = MSE_test1
table13[1,0] = MSE_test2
table13[2,0] = MSE_lasso_3
table13[0,1] = R2_test1
table13[1,1] = R2_test2
table13[2,1] = R2_lasso_3

table13 = pd.DataFrame(table13, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table13

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.199871,0.041564
flexible reg,21.548829,-102.332494
lasso regression,0.207904,0.003045


Considering the 80% (training) and (20%) test partition, it can be seen that the best model for salary data is  the basic model.

## Question 3

In addition, do two cases of Partialling-Out using lasso. Remember that we want to find the beta associated with sex.
Y = log(wage), D = sex
- Case 1: Partialling-Out using lasso 1 : Matrix W = 'exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
- Case 2: Partialling-Out using lasso 2 : Matrix W = (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

We are going to run a regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}
 
Here, we assume that the dimension of $W$ is relatively high, so we need to use variable selection or penalization for regularization purposes. We are considering the partialling-out model using Lasso. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

In [1194]:
from sklearn import linear_model
import math

- Case 1: 
       
$$ W = exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2 $$

In [1242]:
flex_y = 'lwage ~ (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2)'
flex_d = 'sex ~ (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2)'

In [1243]:
# flex_y

lasso_model = linear_model.Lasso( alpha = 0.1 )

flex_y_covariables = smf.ols(formula = flex_y, data = data)
Y_lasso_fitted = lasso_model.fit( flex_y_covariables.exog, data[[ 'lwage' ]] ).predict( flex_y_covariables.exog )
t_Y = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)

In [1244]:
# extraflex_d

flex_d_covariables = smf.ols( flex_d, data=data)
D_lasso_fitted = lasso_model.fit( flex_d_covariables.exog, data[[ 'sex' ]] ).predict( flex_d_covariables.exog )
t_D = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)

data_res = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

In [1245]:
# regression of Y on D after partialling-out the effect of W

partial_lasso_fit = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs = partial_lasso_fit.cov_HC0
partial_lasso_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

Coefficient for D via partialling-out using lasso -0.1230703162670709


The estimated regression coefficient $\beta_1\approx-0.1231$ measures how our lasso prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that wage gap is about $12.31$\% after controlling for worker characteristics. 

### Summarize the results

In [1246]:
table3 = np.zeros( (1, 2) )

table3[0,0] = partial_lasso_est  
table3[0,1] = partial_lasso_se    

table3_pandas = pd.DataFrame( table3, columns = [ "Estimate","Std. Error" ])
table3_pandas.index = [ "partial reg via lasso" ]
table3_html = table3_pandas.to_html()
table3_pandas

Unnamed: 0,Estimate,Std. Error
partial reg via lasso,-0.12307,0.028018


### "Extra" flexible model

In [1247]:
extraflex = 'lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

control_fit = smf.ols( formula = extraflex, data=data).fit()

#summary( control_fit )
control_est = control_fit.summary2().tables[1]['Coef.']['sex']

print( f"Number of Extra-Flex Controls {control_fit.summary2().tables[1].shape[0]-1} \nCoefficient for OLS with extra flex controls {control_est}" )

# standard error
HCV_coefs = control_fit.cov_HC0

n= len(data[ 'wage' ])

p = len(control_fit.summary2().tables[1]['Coef.'])

control_se = control_fit.summary2().tables[1]['Std.Err.']['sex']*math.sqrt(n/(n-p))
control_se

Number of Extra-Flex Controls 979 
Coefficient for OLS with extra flex controls -0.08832946702423343


0.07144339566478773

In [1248]:
# models
# model for Y
extraflex_y = 'lwage ~  (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

# model for 
extraflex_d = 'sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

# extraflex_y
lasso_model = linear_model.Lasso( alpha = 0.1  )

extraflex_y_covariables = smf.ols(formula = extraflex_y, data = data)

Y_lasso_fitted = lasso_model.fit( extraflex_y_covariables.exog, data[[ 'lwage' ]] ).predict( extraflex_y_covariables.exog )

t_Y = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)

# extraflex_d
extraflex_d_covariables = smf.ols( extraflex_d, data=data)

D_lasso_fitted = lasso_model.fit( extraflex_d_covariables.exog, data[[ 'sex' ]] ).predict( extraflex_d_covariables.exog )

t_D = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)

data_res = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

# regression of Y on D after partialling-out the effect of W
partial_lasso_fit = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs = partial_lasso_fit.cov_HC0
partial_lasso_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

  positive)


Coefficient for D via partialling-out using lasso -0.07998365514820807


  positive)


### Summarize the results

In [1249]:
table4 = np.zeros( ( 2, 2 ) )

table4[0,0] = control_est
table4[0,1] = control_se    
table4[1,0] =  partial_lasso_est
table4[1,1] = partial_lasso_se 

table4_pandas = pd.DataFrame( table4, columns = [ "Estimate","Std. Error" ])
table4_pandas.index = [ "full reg","partial reg via lasso" ]
table4_pandas.round(8)

Unnamed: 0,Estimate,Std. Error
full reg,-0.088329,0.071443
partial reg via lasso,-0.079984,0.030204


- Case 2: 
       
$$ W = (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 $$


In [1250]:
flex_y_1 = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_d_1 = 'sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

In [1251]:
# flex_y

lasso_model_1 = linear_model.Lasso( alpha = 0.5 )

flex_y_covariables_1 = smf.ols(formula = flex_y_1, data = data)
Y_lasso_fitted_1 = lasso_model_1.fit( flex_y_covariables_1.exog, data[[ 'lwage' ]] ).predict( flex_y_covariables_1.exog )
t_Y_1 = data[[ 'lwage' ]] - Y_lasso_fitted_1.reshape( Y_lasso_fitted_1.size, 1)

In [1252]:
# extraflex_d

flex_d_covariables_1 = smf.ols( flex_d_1, data=data)
D_lasso_fitted_1 = lasso_model_1.fit( flex_d_covariables_1.exog, data[[ 'sex' ]] ).predict( flex_d_covariables_1.exog )
t_D_1 = data[[ 'sex' ]] - D_lasso_fitted_1.reshape( D_lasso_fitted_1.size, 1)

data_res_1 = pd.DataFrame( np.hstack(( t_Y_1 , t_D_1 )) , columns = [ 't_Y', 't_D' ] )

In [1253]:
# regression of Y on D after partialling-out the effect of W

partial_lasso_fit_1 = smf.ols( formula = 't_Y ~ t_D' , data = data_res_1 ).fit()
partial_lasso_est_1 = partial_lasso_fit_1.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs_1 = partial_lasso_fit_1.cov_HC0
partial_lasso_se_1 = np.power( HCV_coefs_1.diagonal() , 0.5)[1]

Coefficient for D via partialling-out using lasso -0.07998365514820807


The estimated regression coefficient $\beta_1\approx-0.0799$ measures how our lasso prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that wage gap is about $8$\% after controlling for worker characteristics.

### Summarize the results

In [1254]:
table5 = np.zeros( (1, 2) )

table5[0,0] = partial_lasso_est_1  
table5[0,1] = partial_lasso_se_1    

table5_pandas = pd.DataFrame( table5, columns = [ "Estimate","Std. Error" ])
table5_pandas.index = [ "partial reg via lasso" ]
table5_html = table3_pandas.to_html()
table5_pandas

Unnamed: 0,Estimate,Std. Error
partial reg via lasso,-0.114063,0.028114


### "Extra" flexible model

In [1255]:
extraflex_1 = 'lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

control_fit_1 = smf.ols( formula = extraflex_1, data=data).fit()

#summary( control_fit )
control_est_1 = control_fit_1.summary2().tables[1]['Coef.']['sex']

print( f"Number of Extra-Flex Controls {control_fit_1.summary2().tables[1].shape[0]-1} \nCoefficient for OLS with extra flex controls {control_est_1}" )

# standard error
HCV_coefs_1 = control_fit_1.cov_HC0

n_1 = len(data[ 'wage' ])

p_1 = len(control_fit_1.summary2().tables[1]['Coef.'])

control_se_1 = control_fit_1.summary2().tables[1]['Std.Err.']['sex']*math.sqrt(n/(n-p))
control_se_1

Number of Extra-Flex Controls 979 
Coefficient for OLS with extra flex controls -0.08832946702423343


0.07144339566478773

In [1256]:
# models
# model for Y
extraflex_y_1 = 'lwage ~  (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

# model for 
extraflex_d_1 = 'sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

# extraflex_y
lasso_model_1 = linear_model.Lasso( alpha = 0.5  )

extraflex_y_covariables_1 = smf.ols(formula = extraflex_y_1, data = data)

Y_lasso_fitted_1 = lasso_model_1.fit( extraflex_y_covariables_1.exog, data[[ 'lwage' ]] ).predict( extraflex_y_covariables_1.exog )

t_Y_1 = data[[ 'lwage' ]] - Y_lasso_fitted_1.reshape( Y_lasso_fitted_1.size, 1)

# extraflex_d
extraflex_d_covariables_1 = smf.ols( extraflex_d_1, data=data)

D_lasso_fitted_1 = lasso_model_1.fit( extraflex_d_covariables_1.exog, data[[ 'sex' ]] ).predict( extraflex_d_covariables_1.exog )

t_D_1 = data[[ 'sex' ]] - D_lasso_fitted_1.reshape( D_lasso_fitted_1.size, 1)

data_res_1 = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

# regression of Y on D after partialling-out the effect of W
partial_lasso_fit_1 = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est_1 = partial_lasso_fit_1.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est_1}" )

# standard error
HCV_coefs_1 = partial_lasso_fit_1.cov_HC0
partial_lasso_se_1 = np.power( HCV_coefs_1.diagonal() , 0.5)[1]

Coefficient for D via partialling-out using lasso -0.07998365514820807


### Summarize the results

In [1257]:
table6 = np.zeros( ( 2, 2 ) )

table6[0,0] = control_est_1
table6[0,1] = control_se_1    
table6[1,0] =  partial_lasso_est_1
table6[1,1] = partial_lasso_se_1 

table6_pandas = pd.DataFrame( table6, columns = [ "Estimate","Std. Error" ])
table6_pandas.index = [ "full reg","partial reg via lasso" ]
table6_pandas.round(8)

Unnamed: 0,Estimate,Std. Error
full reg,-0.088329,0.071443
partial reg via lasso,-0.079984,0.030204
