# OLS and lasso for gender wage gap inference

## An inferential problem: The Gender Wage Gap

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

$$
\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 D  + \beta_2' W + \epsilon,
\end{align}
$$

where $Y$ is hourly wage, $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the
$W$'s are a vector of worker characteristics explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

## Data analysis

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.

In [7]:
import pandas as pd
import numpy as np
import pyreadr
import os
from urllib.request import urlopen
import math

In [8]:
link="https://raw.githubusercontent.com/d2cml-ai/14.388_py/main/data/wage2015_subsample_inference.Rdata"
response = urlopen(link)
content = response.read()
fhandle = open( 'wage2015_subsample_inference.Rdata', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("wage2015_subsample_inference.Rdata")
os.remove("wage2015_subsample_inference.Rdata")

# Extracting the data frame from rdata_read
data = result[ 'data' ]
data.shape

(5150, 20)

To start our (causal) analysis, we compare the sample means given gender:

In [9]:
Z = data[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]

data_female = data[data[ 'sex' ] == 1 ]
Z_female = data_female[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]

data_male = data[ data[ 'sex' ] == 0 ]
Z_male = data_male[ [ "lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1" ] ]


table = np.zeros( (12, 3) )
table[:, 0] = Z.mean().values
table[:, 1] = Z_male.mean().values
table[:, 2] = Z_female.mean().values
table_pandas = pd.DataFrame( table, columns = [ 'All', 'Men', 'Women'])
table_pandas.index = ["Log Wage","Sex","Less then High School","High School Graduate","Some College","Gollage Graduate","Advanced Degree", "Northeast","Midwest","South","West","Experience"]
table_html = table_pandas.to_html()

table_pandas

Unnamed: 0,All,Men,Women
Log Wage,2.970787,2.98783,2.949485
Sex,0.444466,0.0,1.0
Less then High School,0.023301,0.031807,0.012669
High School Graduate,0.243883,0.294303,0.180865
Some College,0.278058,0.273331,0.283967
Gollage Graduate,0.31767,0.293953,0.347313
Advanced Degree,0.137087,0.106606,0.175186
Northeast,0.227767,0.22195,0.235037
Midwest,0.259612,0.259,0.260376
South,0.296505,0.298148,0.294452


In [10]:
print( table_html )

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>All</th>
      <th>Men</th>
      <th>Women</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Log Wage</th>
      <td>2.970787</td>
      <td>2.987830</td>
      <td>2.949485</td>
    </tr>
    <tr>
      <th>Sex</th>
      <td>0.444466</td>
      <td>0.000000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>Less then High School</th>
      <td>0.023301</td>
      <td>0.031807</td>
      <td>0.012669</td>
    </tr>
    <tr>
      <th>High School Graduate</th>
      <td>0.243883</td>
      <td>0.294303</td>
      <td>0.180865</td>
    </tr>
    <tr>
      <th>Some College</th>
      <td>0.278058</td>
      <td>0.273331</td>
      <td>0.283967</td>
    </tr>
    <tr>
      <th>Gollage Graduate</th>
      <td>0.317670</td>
      <td>0.293953</td>
      <td>0.347313</td>
    </tr>
    <tr>
      <th>Advanced Degree</th>
      <td>0.137087</td>
      <td>0.106606</td>
 

In particular, the table above shows that the difference in average *logwage* between men and women is equal to $0,038$

In [11]:
data_female['lwage'].mean() - data_male['lwage'].mean()

-0.03834473367442026

Thus, the unconditional gender wage gap is about $3,8$\% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

This unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$:

$$
\begin{align}
\log(Y) &=\beta D + \epsilon.
\end{align}
$$

We verify this by running an ols regression in R.

In [12]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [13]:
nocontrol_model = smf.ols( formula = 'lwage ~ sex', data = data )
nocontrol_est = nocontrol_model.fit().summary2().tables[1]['Coef.']['sex']
HCV_coefs = nocontrol_model.fit().cov_HC0
nocontrol_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

# print unconditional effect of gender and the corresponding standard error
print( f'The estimated gender coefficient is {nocontrol_est} and the corresponding robust standard error is {nocontrol_se}' )

The estimated gender coefficient is -0.038344733674415696 and the corresponding robust standard error is 0.015901935079095802


Note that the standard error is computed with the *R* package *sandwich* to be robust to heteroskedasticity. 


Next, we run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

$$
\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}
$$

Here, we are considering the flexible model from the previous lab. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

Let us run the ols regression with controls.

## Ols regression with controls

In [14]:
flex = 'lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'

# The smf api replicates R script when it transform data
control_model = smf.ols( formula = flex, data = data )
control_est = control_model.fit().summary2().tables[1]['Coef.']['sex']

print(control_model.fit().summary2().tables[1])
print( f"Coefficient for OLS with controls {control_est}" )

HCV_coefs = control_model.fit().cov_HC0
control_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

               Coef.  Std.Err.          t         P>|t|    [0.025    0.975]
Intercept   3.279677  0.284196  11.540202  2.037819e-30  2.722526  3.836828
occ2[T.10]  0.020954  0.156498   0.133896  8.934903e-01 -0.285852  0.327761
occ2[T.11] -0.642418  0.309090  -2.078417  3.772286e-02 -1.248372 -0.036463
occ2[T.12] -0.067477  0.252049  -0.267716  7.889294e-01 -0.561605  0.426651
occ2[T.13] -0.232978  0.231538  -1.006220  3.143593e-01 -0.686896  0.220940
...              ...       ...        ...           ...       ...       ...
exp4:scl    0.021076  0.024529   0.859230  3.902557e-01 -0.027012  0.069164
exp4:clg    0.007869  0.022753   0.345868  7.294565e-01 -0.036736  0.052475
exp4:mw     0.006244  0.015870   0.393446  6.940073e-01 -0.024868  0.037356
exp4:so     0.000314  0.013628   0.023075  9.815913e-01 -0.026402  0.027031
exp4:we     0.001768  0.015960   0.110804  9.117763e-01 -0.029521  0.033058

[246 rows x 6 columns]
Coefficient for OLS with controls -0.06955320329684715


The estimated regression coefficient $\beta_1\approx-0.0696$ measures how our linear prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size $4$\% for women increases to about $7$\% after controlling for worker characteristics.  


Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.

## Partialling-Out using ols

In [15]:
# models
# model for Y
flex_y = 'lwage ~  (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
# model for D
flex_d = 'sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)' 

# partialling-out the linear effect of W from Y
t_Y = smf.ols( formula = flex_y , data = data ).fit().resid

# partialling-out the linear effect of W from D
t_D = smf.ols( formula = flex_d , data = data ).fit().resid

data_res = pd.DataFrame( np.vstack(( t_Y.values , t_D.values )).T , columns = [ 't_Y', 't_D' ] )
# regression of Y on D after partialling-out the effect of W
partial_fit =  smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_est = partial_fit.summary2().tables[1]['Coef.']['t_D']

print("Coefficient for D via partialling-out", partial_est)

# standard error
HCV_coefs = partial_fit.cov_HC0
partial_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

# confidence interval
partial_fit.conf_int( alpha=0.05 ).iloc[1, :]

Coefficient for D via partialling-out -0.06955320329684608


0   -0.098671
1   -0.040435
Name: t_D, dtype: float64

Again, the estimated coefficient measures the linear predictive effect (PE) of $D$ on $Y$ after taking out the linear effect of $W$ on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.

We know that the partialling-out approach works well when the dimension of $W$ is low
in relation to the sample size $n$. When the dimension of $W$ is relatively high, we need to use variable selection
or penalization for regularization purposes. 

In the following, we illustrate the partialling-out approach using lasso instead of ols. 

## Partialling-Out using lasso

In [16]:
# models
# model for Y
flex_y = 'lwage ~  (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'

# model for D
flex_d = 'sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'

In [19]:
# With Sklearn

from sklearn import linear_model
# flex_y
lasso_model = linear_model.Lasso( alpha = 0.1 )

flex_y_covariables = smf.ols(formula = flex_y, data = data)
Y_lasso_fitted = lasso_model.fit( flex_y_covariables.exog, data[[ 'lwage' ]] ).predict( flex_y_covariables.exog )
t_Y = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)

# extraflex_d
flex_d_covariables = smf.ols( flex_d, data=data)
D_lasso_fitted = lasso_model.fit( flex_d_covariables.exog, data[[ 'sex' ]] ).predict( flex_d_covariables.exog )
t_D = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)

data_res = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

# regression of Y on D after partialling-out the effect of W
partial_lasso_fit = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs = partial_lasso_fit.cov_HC0
partial_lasso_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

Coefficient for D via partialling-out using lasso -0.06487798557263097


Using lasso for partialling-out here provides similar results as using ols.

Next, we summarize the results.

## Summarize the results

In [21]:
table2 = np.zeros( (4, 2) )

table2[0,0] = nocontrol_est  
table2[0,1] = nocontrol_se   
table2[1,0] = control_est
table2[1,1] = control_se    
table2[2,0] = partial_est  
table2[2,1] = partial_se  
table2[3,0] =  partial_lasso_est
table2[3,1] = partial_lasso_se 

table2_pandas = pd.DataFrame( table2, columns = [ "Estimate","Std. Error" ])
table2_pandas.index = [ "Without controls", "full reg", "partial reg", "partial reg via lasso" ]
table2_html = table2_pandas.to_html()
table2_pandas

Unnamed: 0,Estimate,Std. Error
Without controls,-0.038345,0.015902
full reg,-0.069553,0.144608
partial reg,-0.069553,0.015
partial reg via lasso,-0.064878,0.015557


It it worth to notice that controlling for worker characteristics increases the gender wage gap from less that 4\% to 7\%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast);  a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.

Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect
selection effects of unobserved differences in covariates between men and women in our sample.


Next we try "extra" flexible model, where we take interactions of all controls, giving us about 1000 controls.

## "Extra" flexible model

In [109]:
extraflex = 'lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

control_fit = smf.ols( formula = extraflex, data=data).fit()

#summary( control_fit )
control_est = control_fit.summary2().tables[1]['Coef.']['sex']

print( f"Number of Extra-Flex Controls {control_fit.summary2().tables[1].shape[0]-1} \nCoefficient for OLS with extra flex controls {control_est}" )

# standard error
HCV_coefs = control_fit.cov_HC0

n= len(data[ 'wage' ])

p = len(control_fit.summary2().tables[1]['Coef.'])

control_se = control_fit.summary2().tables[1]['Std.Err.']['sex']*math.sqrt(n/(n-p))
control_se
# crude adjustment for the effect of dimensionality on OLS standard errors, motivated by Cattaneo, Jannson, and Newey (2018)

# for really correct way of doing this, we need to implement Cattaneo, Jannson, and Newey (2018)'s procedure.

Number of Extra-Flex Controls 979 
Coefficient for OLS with extra flex controls -0.06127046379396306


0.017759931217339292

## Laso "Extra" Flexible model

In [110]:
# models
# model for Y
extraflex_y = 'lwage ~  (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

# model for 
extraflex_d = 'sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

# extraflex_y
lasso_model = linear_model.Lasso( alpha = 0.1  )

extraflex_y_covariables = smf.ols(formula = extraflex_y, data = data)

Y_lasso_fitted = lasso_model.fit( extraflex_y_covariables.exog, data[[ 'lwage' ]] ).predict( extraflex_y_covariables.exog )

t_Y = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)

# extraflex_d
extraflex_d_covariables = smf.ols( extraflex_d, data=data)

D_lasso_fitted = lasso_model.fit( extraflex_d_covariables.exog, data[[ 'sex' ]] ).predict( extraflex_d_covariables.exog )

t_D = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)

data_res = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

# regression of Y on D after partialling-out the effect of W
partial_lasso_fit = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs = partial_lasso_fit.cov_HC0
partial_lasso_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

  model = cd_fast.enet_coordinate_descent(


Coefficient for D via partialling-out using lasso -0.06676571026846852


  model = cd_fast.enet_coordinate_descent(


## Summarize the results

In [111]:
table3 = np.zeros( ( 2, 2 ) )

table3[0,0] = control_est
table3[0,1] = control_se    
table3[1,0] =  partial_lasso_est
table3[1,1] = partial_lasso_se 

table3_pandas = pd.DataFrame( table3, columns = [ "Estimate","Std. Error" ])
table3_pandas.index = [ "full reg","partial reg via lasso" ]
table3_pandas.round(8)

Unnamed: 0,Estimate,Std. Error
full reg,-0.06127,0.01776
partial reg via lasso,-0.066766,0.015468


In this case p/n = 20%, that is  p/n  is no longer small and we start seeing the differences between unregularized partialling out and regularized partialling out with lasso (double lasso). The results based on double lasso have rigorous guarantees in this non-small p/n regime under approximate sparsity. The results based on OLS still have guarantees in p/n< 1 regime under assumptions laid out in Cattaneo, Newey, and Jansson (2018), without approximate sparsity, although other regularity conditions are needed.