* Created by: Anzony Quispe

This notebook contains an example for teaching.

# Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.

In the following wage example,  Y  is the hourly wage of a worker and  X  is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:

* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

# Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below  3 .

The variable of interest  Y  is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size  **n=5150** .

# Data analysis


We start by loading the data set.

In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr

In [2]:
rdata_read = pyreadr.read_r("../data/wage2015_subsample_inference.Rdata")
data = rdata_read[ 'data' ]
type(data)
data.shape

(5150, 20)

Let's have a look at the structure of the data.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   wage    5150 non-null   float64 
 1   lwage   5150 non-null   float64 
 2   sex     5150 non-null   float64 
 3   shs     5150 non-null   float64 
 4   hsg     5150 non-null   float64 
 5   scl     5150 non-null   float64 
 6   clg     5150 non-null   float64 
 7   ad      5150 non-null   float64 
 8   mw      5150 non-null   float64 
 9   so      5150 non-null   float64 
 10  we      5150 non-null   float64 
 11  ne      5150 non-null   float64 
 12  exp1    5150 non-null   float64 
 13  exp2    5150 non-null   float64 
 14  exp3    5150 non-null   float64 
 15  exp4    5150 non-null   float64 
 16  occ     5150 non-null   category
 17  occ2    5150 non-null   category
 18  ind     5150 non-null   category
 19  ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 740.3+ KB


In [4]:
data.describe()

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681


### Cleaning data

Focus on people who did not go to college. The variable `shs (Some High School)` and `hsg (High School Graduate)` help us to identify people that did not go to college.

In [5]:
df1 = data[ ( data.shs == 1 ) | ( data.hsg == 1 ) ].copy()

We are constructing the output variable  **Y**  and the matrix  **Z**  which includes the characteristics of workers that are given in the data.

In [6]:
Y = np.log2(df1['wage']) 
n = len(Y)
z = df1.loc[:, ~df1.columns.isin(['wage', 'lwage','Unnamed: 0'])]
p = z.shape[1]

print("Number of observation:", n, '\n')
print( "Number of raw regressors:", p)

Number of observation: 1376 

Number of raw regressors: 18


For the outcome variable *wage* and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

In [7]:
Z_subset = df1.loc[:, df1.columns.isin(["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])]
table = Z_subset.mean(axis=0)
table

lwage     2.718562
sex       0.321948
shs       0.087209
hsg       0.912791
scl       0.000000
clg       0.000000
ad        0.000000
mw        0.286337
so        0.291424
we        0.198401
ne        0.223837
exp1     17.190044
dtype: float64

In [8]:
table = pd.DataFrame(data=table, columns={"Sample mean":"0"} )
table.index
index1 = list(table.index)
index2 = ["Log Wage","Sex","Some High School","High School Graduate",\
          "Some College","College Graduate", "Advanced Degree","Midwest",\
          "South","West","Northeast","Experience"]

In [9]:
table = table.rename(index=dict(zip(index1,index2)))
table

Unnamed: 0,Sample mean
Log Wage,2.718562
Sex,0.321948
Some High School,0.087209
High School Graduate,0.912791
Some College,0.0
College Graduate,0.0
Advanced Degree,0.0
Midwest,0.286337
South,0.291424
West,0.198401


E.g., the share of female workers in our sample is ~32% ($sex=1$ if female).

Alternatively, we can also print the table as latex.

In [10]:
print( table.to_latex() )

\begin{tabular}{lr}
\toprule
{} &  Sample mean \\
\midrule
Log Wage             &     2.718562 \\
Sex                  &     0.321948 \\
Some High School     &     0.087209 \\
High School Graduate &     0.912791 \\
Some College         &     0.000000 \\
College Graduate     &     0.000000 \\
Advanced Degree      &     0.000000 \\
Midwest              &     0.286337 \\
South                &     0.291424 \\
West                 &     0.198401 \\
Northeast            &     0.223837 \\
Experience           &    17.190044 \\
\bottomrule
\end{tabular}



## Prediction Question

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model**, enables us to approximate the real relationship by a
 more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

Now, let us fit both models to our data by running ordinary least squares (ols):

In [11]:
# Import packages for OLS regression
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [12]:
# 1. ORIGINAL basic model
basic_ori = 'lwage ~  sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results1_ori = smf.ols(basic_ori , data=df1)
basic_results_ori = basic_results1_ori.fit()
print(basic_results_ori.summary()) # estimated coefficients
print( "Number of regressors in the original basic model:",len(basic_results_ori.params), '\n')  
# number of regressors in the Basic Model
# we have to drop  hsg+ scl + clg

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.180
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     6.212
Date:                Sat, 09 Oct 2021   Prob (F-statistic):           9.07e-33
Time:                        18:35:49   Log-Likelihood:                -872.87
No. Observations:                1376   AIC:                             1842.
Df Residuals:                    1328   BIC:                             2093.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0216      0.062     32.368      0.0

We have to drop the hsg, scl and clg variables because of multicolinearity.

In [13]:
# 1. basic model
basic = 'lwage ~ sex + exp1 + shs + mw + so + we + occ2+ ind2'
basic_results1 = smf.ols(basic , data=df1)
basic_results = basic_results1.fit()
print(basic_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(basic_results.params), '\n')  
# number of regressors in the Basic Model
# we have to drop  hsg+ scl + clg

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.180
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     6.212
Date:                Sat, 09 Oct 2021   Prob (F-statistic):           9.07e-33
Time:                        18:35:49   Log-Likelihood:                -872.87
No. Observations:                1376   AIC:                             1842.
Df Residuals:                    1328   BIC:                             2093.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0729      0.092     33.528      0.0

##### Note that the basic model consists of $48$ regressors. Three regressors were dropped out.

In [14]:
# X_var = pd.DataFrame( basic_results1.exog , columns = basic_results1.exog_names )

# basic_results1.exog_names[-4]
# basic_results1.exog_names[-5]

# Q1 = np.linalg.qr( X_var.values )[0]

# r = np.linalg.qr( X_var.values )[1]

# clean_names = np.array(basic_results1.exog_names)[ (r.sum( axis = 0 ) != 0 ).tolist()].tolist()

# new_xvar = X_var.loc[ : , clean_names ]

# Q1 = np.linalg.qr( new_xvar.values )[0]

# r = np.linalg.qr( new_xvar.values )[1]



# import sparseqr



# abs(r.sum( axis = 0)) < 1e-07

# abs(r.sum( axis = 0)) < 1e-07

# r[ : , [-4, -5 ] ]

# np.linalg.qr( X_var.values )[ 1 ]

# def back_substitution(A: np.ndarray, b: np.ndarray) -> np.ndarray:
#     n = b.size
#     x = np.zeros_like(b)

#     if A[n-1, n-1] == 0:
#         raise ValueError

#     for i in range(n-1, 0, -1):
#         x[i] = A[i, i]/b[i]
#         for j in range (i-1, 0, -1):
#             A[i, i] += A[j, i]*x[i]

#     return x

# !pip install qr

# !pip install redis

# import  qr

# np.linalg.inv( r )

# np.dot( np.linalg.inv( r ) , Q1 )

# c(backsolve(R, t(Q1) %*% y))

In [15]:
# 2. flexible model
flex_ori = 'lwage ~ sex + ( exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we )**2'
flex_results_0_ori = smf.ols( flex_ori , data = df1 )
flex_results_ori = smf.ols( flex_ori , data = df1 ).fit()
print( flex_results_ori.summary() ) # estimated coefficients
print( "Number of regressors in the basic model:", len( flex_results_ori.params ), '\n' ) 
# number of regressors in the Flexible Model

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.510
Model:                            OLS   Adj. R-squared:                  0.235
Method:                 Least Squares   F-statistic:                     1.856
Date:                Sat, 09 Oct 2021   Prob (F-statistic):           8.92e-16
Time:                        18:35:52   Log-Likelihood:                -518.83
No. Observations:                1376   AIC:                             2028.
Df Residuals:                     881   BIC:                             4615.
Df Model:                         494                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 3.55

Note that the flexible model consists of $494$ regressors of $979$ variables.
I use the `clean_data_flex` data from R to get exact results.

#### Import data from R

In [16]:
rdata_read = pyreadr.read_r("../data/wg2_clean_data_flex.RData")
clean_data_flex = rdata_read[ 'clean_data_flex' ]
clean_data_flex['lwage'] = df1[ 'lwage' ].copy()
print( type( clean_data_flex ) )
print( clean_data_flex.shape )

flex_results = sm.OLS( clean_data_flex['lwage'], clean_data_flex.iloc[ : , :-1 ]  ).fit()

# 2. flexible model
# flex = "lwage~" + all_columns
# flex_results = smf.ols(flex , data = clean_data_flex ).fit()
print(flex_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:", len( flex_results.params ), '\n' ) 
# number of regressors in the Flexible Model

<class 'pandas.core.frame.DataFrame'>
(1376, 496)
                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.510
Model:                            OLS   Adj. R-squared:                  0.235
Method:                 Least Squares   F-statistic:                     1.856
Date:                Sat, 09 Oct 2021   Prob (F-statistic):           8.92e-16
Time:                        18:35:53   Log-Likelihood:                -518.83
No. Observations:                1376   AIC:                             2028.
Df Residuals:                     881   BIC:                             4615.
Df Model:                         494                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------

## Try Lasso next

In [17]:
import hdmpy

In [18]:
X_vars = pd.DataFrame(flex_results_0_ori.exog[:, 1:], columns = flex_results_0_ori.exog_names[1:] )
X_vars.shape

(1376, 979)

In [19]:
fit_rlasso = hdmpy.rlasso( X_vars, df1['lwage']  , post = True )

  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]


In [20]:
y_hat = np.dot( sm.add_constant( X_vars ).values , fit_rlasso.est['coefficients'].values )

Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [21]:
# Import relevant packages for lasso 
from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [22]:
# Assess the predictive performance
R2_1 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1, "\n")
R2_adj1 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1, "\n")


R2_2 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2, "\n")
R2_adj2 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2, "\n")

R2_L = r2_score( df1['lwage'].values ,  y_hat )
print("R-squared for LASSO: ", R2_L, "\n")
n_obs = X_vars.shape[0]
n_predictors = sum((fit_rlasso.est['coefficients'] != 0).iloc[:, 0]) - 1
R2_adjL = 1 - ( ( 1 - R2_L ) * ( n_obs -1 ) / ( n_obs - n_predictors - 1 ) )
print("adjusted R-squared for LASSO: ", R2_adjL, "\n")

R-squared for the basic model:  0.18023814876721 

adjusted R-squared for the basic model:  0.15122549288773612 

R-squared for the basic model:  0.5099981483772937 

adjusted R-squared for the basic model:  0.2352411509861282 

R-squared for LASSO:  0.08894150875845319 

adjusted R-squared for LASSO:  0.0836097838645743 



In [23]:
# calculating the MSE
MSE1 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1, "\n")
p1 = len(basic_results.params) # number of regressors
n = df1.shape[0]
MSE_adj1  = (n/(n-p1))*MSE1
print("adjusted MSE for the basic model: ", MSE_adj1, "\n")

MSE2 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2, "\n")
p2 = len(flex_results.params) # number of regressors
n = df1.shape[0]
MSE_adj2  = (n/(n-p2))*MSE2
print("adjusted MSE for the flexible model: ", MSE_adj2, "\n")


MSEL = mean_squared_error( df1['lwage'].values ,  y_hat  )
print("MSE for the LASSO model: ", MSEL, "\n")
pL = fit_rlasso.est['coefficients'].shape[0]
n = X_vars.shape[0]
MSE_adjL  = (n/(n-pL))*MSEL
print("adjusted MSE for LASSO model: ", MSE_adjL, "\n")

MSE for the basic model:  0.20821908377460785 

adjusted MSE for the basic model:  0.21574507475441296 

MSE for the flexible model:  0.12446021541415708 

adjusted MSE for the flexible model:  0.19438962135060175 

MSE for the LASSO model:  0.23140838284449752 

adjusted MSE for LASSO model:  0.8040856939243146 



In [24]:
# Package for latex table 
import array_to_latex as a2l

table = np.zeros((3, 5))
table[0,0:5] = [p1, R2_1, MSE1, R2_adj1, MSE_adj1]
table[1,0:5] = [p2, R2_2, MSE2, R2_adj2, MSE_adj2]
table[2,0:5] = [pL, R2_L, MSEL, R2_adjL, MSE_adjL]
table

array([[4.80000000e+01, 1.80238149e-01, 2.08219084e-01, 1.51225493e-01,
        2.15745075e-01],
       [4.95000000e+02, 5.09998148e-01, 1.24460215e-01, 2.35241151e-01,
        1.94389621e-01],
       [9.80000000e+02, 8.89415088e-02, 2.31408383e-01, 8.36097839e-02,
        8.04085694e-01]])

In [25]:
table = pd.DataFrame(table, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["basic reg","flexible reg", "lasso flex"])


In [26]:
table

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,48.0,0.180238,0.208219,0.151225,0.215745
flexible reg,495.0,0.509998,0.12446,0.235241,0.19439
lasso flex,980.0,0.088942,0.231408,0.08361,0.804086


Considering all measures above, the flexible model performs slightly better than the basic model. 

One procedure to circumvent this issue is to use **data splitting** that is described and applied in the following.

## Data Splitting

Measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we can consider).
- Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
- Use the testing sample for evaluation. Predict the $\mathtt{wage}$  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models. 

In [27]:
# Import relevant packages for splitting data
import random
import math

# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0,n, size=math.floor(n))

## Generating train and test data

In [28]:
# for basic, flex and lasso model
df1["random"] = random
random    # the array does not change 

array([ 684,  559, 1216, ..., 1294,  573, 1367])

In [30]:
# basic
data_2 = df1.sort_values(by=['random'])
data_2.head()

# Create training and testing sample 
train_basic = data_2[ : math.floor(n*4/5)]    # training sample
test_basic =  data_2[ math.floor(n*4/5) : ]   # testing sample
print(train_basic.shape)
print(test_basic.shape)

(1100, 21)
(276, 21)


##### Basic Model

In [31]:
# Basic Model
basic = 'lwage ~ sex + exp1 + shs + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data = train_basic ).fit()

In [32]:
lwage_test = test_basic["lwage"].values
lwage_pred = basic_results.predict( sm.add_constant( test_basic.iloc[ : , :-1 ] ) ).values

MSE_test1 = sum( ( lwage_test - lwage_pred ) ** 2 ) / lwage_test.size
R2_test1 = 1- ( MSE_test1 / np.var( lwage_test ) )

print("Test MSE for the basic model: ", MSE_test1, "\n")
print("Test R2 for the basic model: ", R2_test1)

Test MSE for the basic model:  0.19987102511489996 

Test R2 for the basic model:  0.04156410192333371


##### FLEX MODEL

In [33]:
# FLEX MODEL
clean_data_flex[ "random" ] = random
data_flex = clean_data_flex.sort_values( by = [ 'random' ] )
data_flex.head()

# Create training and testing sample
# FLEX MODEL
train_flex = data_flex[ : math.floor(n*4/5)]    # training sample
test_flex =  data_flex[ math.floor(n*4/5) : ]   # testing sample
print(train_flex.shape)
print(test_flex.shape)

(1100, 497)
(276, 497)


In [34]:
regflex = sm.OLS( train_flex['lwage'], train_flex.iloc[ : , :-2 ]  ).fit()

trainregflex = regflex.predict( test_flex.iloc[ : , :-2 ]  ).values
lwage_test = test_flex["lwage"].values


MSE_test2 = sum( ( lwage_test - trainregflex ) ** 2 ) / lwage_test.size
R2_test2 = 1- ( MSE_test2 / np.var( lwage_test ) )

print("Test MSE for the basic model: ", MSE_test2, "\n")
print("Test R2 for the basic model: ", R2_test2)

Test MSE for the basic model:  47.249616940689414 

Test R2 for the basic model:  -225.57475749821515


#### HDMPY

In [35]:
X_vars = pd.DataFrame( flex_results_0_ori.exog[ : , 1: ], columns = flex_results_0_ori.exog_names[ 1: ] )
X_vars[ 'lwage' ] = df1.reset_index( drop = True ).copy().loc[ :, 'lwage']
X_vars[ "random" ] = random
X_vars2 = X_vars.sort_values(by=['random'])
train_lasso = X_vars2[ : math.floor(n*4/5)]    # training sample
test_lasso =  X_vars2[ math.floor(n*4/5) : ]   # testing sample

In [36]:
fit_rlasso = hdmpy.rlasso( train_lasso.iloc[ :  , :-2 ] , train_lasso['lwage']  , post = False )

  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]


In [37]:
trainreglasso = np.dot( sm.add_constant( test_lasso.iloc[ : , :-2 ] ).values , fit_rlasso.est['coefficients'].values )

In [38]:
MSE_lasso = sum( ( test_lasso[ 'lwage' ].ravel() - trainreglasso.ravel() ) ** 2 ) / lwage_test.size
R2_lasso = 1- ( MSE_lasso / np.var( test_lasso[ 'lwage' ].ravel() ) )

print("Test MSE for the basic model: ", MSE_lasso , "\n")
print("Test R2 for the basic model: ", R2_lasso )

Test MSE for the basic model:  0.19533619446587505 

Test R2 for the basic model:  0.06330984762722047


Finally, let us summarize the results:

In [39]:
# Package for latex table 
import array_to_latex as a2l

table2 = np.zeros((3, 2))
table2[0,0] = MSE_test1
table2[1,0] = MSE_test2
table2[2,0] = MSE_lasso
table2[0,1] = R2_test1
table2[1,1] = R2_test2
table2[2,1] = R2_lasso

table2 = pd.DataFrame(table2, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table2

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.199871,0.041564
flexible reg,47.249617,-225.574757
lasso regression,0.195336,0.06331


In [40]:
table2.to_latex
print(table2.to_latex())

\begin{tabular}{lrr}
\toprule
{} &  \$MSE\_\{test\}\$ &  \$R\textasciicircum 2\_\{test\}\$ \\
\midrule
basic reg        &      0.199871 &      0.041564 \\
flexible reg     &     47.249617 &   -225.574757 \\
lasso regression &      0.195336 &      0.063310 \\
\bottomrule
\end{tabular}



# Partialling Out

In [41]:
data2 = pd.DataFrame( basic_results1_ori.exog , columns = basic_results1_ori.exog_names )
data2['lwage'] = df1['lwage'].reset_index( drop = True ).copy()
# fit_rlasso = hdmpy.rlasso( train_lasso.iloc[ :  , :-2 ] , train_lasso['lwage']  , post = False ).est['res']
# fit_rlasso = hdmpy.rlasso( train_lasso.iloc[ :  , :-2 ] , train_lasso['lwage']  , post = False )

rl = hdmpy.rlasso( data2.iloc[ : , 1:-1 ].drop(['sex'] , axis = 1 ) , data2['lwage']  , post = False ).est['residuals']
rs = hdmpy.rlasso( data2.iloc[ : , 1:-1 ].drop(['sex'] , axis = 1 ) , data2['sex']  , post = False ).est['residuals']

basic_partial_est = sm.OLS( rl, rs ).fit().summary2().tables[1].iloc[0, 0]
print("Coefficient for SEX via basic partialling-out",basic_partial_est)

  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]


Coefficient for SEX via basic partialling-out -0.08315441617892128


  c /= stddev[:, None]
  c /= stddev[None, :]


In [42]:
data2 = pd.DataFrame( flex_results_0_ori.exog , columns = flex_results_0_ori.exog_names )
data2['lwage'] = df1['lwage'].reset_index( drop = True ).copy()


rl = hdmpy.rlasso( data2.iloc[ : , 1:-1 ].drop(['sex'] , axis = 1 ) , data2['lwage']  , post = False ).est['residuals']
rs = hdmpy.rlasso( data2.iloc[ : , 1:-1 ].drop(['sex'] , axis = 1 ) , data2['sex']  , post = False ).est['residuals']

flex_partial_est = sm.OLS( rl, rs ).fit().summary2().tables[1].iloc[0, 0]
print("Coefficient for SEX via flex partialling-out",flex_partial_est)

  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stdd

Coefficient for SEX via flex partialling-out -0.082977034026189
