# Application: Heterogeneous Effect of Sex on Wage Using Double Lasso

We use US census data from the year 2015 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  



This analysis allows a closer look how the gender wage gap is related to other socio-economic variables.


In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LinearRegression
import patsy
import warnings
warnings.simplefilter('ignore')
np.random.seed(1234)

In [2]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
data = pd.read_csv(file)

In [3]:
data.describe() 

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038,5310.737476,11.670874,6629.154951,13.316893
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225,11874.35608,6.966684,5333.443992,5.701019
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,370.0,2.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625,1740.0,5.0,4880.0,9.0
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,4040.0,13.0,7370.0,14.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481,5610.0,17.0,8190.0,18.0
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681,100000.0,22.0,100000.0,22.0


Define outcome and regressors

In [4]:
y = np.log(data['wage']).values
Z = data.drop(['wage', 'lwage'], axis=1)
Z.columns

Index(['sex', 'shs', 'hsg', 'scl', 'clg', 'ad', 'mw', 'so', 'we', 'ne', 'exp1',
       'exp2', 'exp3', 'exp4', 'occ', 'occ2', 'ind', 'ind2'],
      dtype='object')

## Feature Engineering

Construct all our control variables

In [5]:
# Ultra flexible controls of all pair-wise interactions (around 1k variables); un-comment to run this
Zcontrols = patsy.dmatrix('0 + (shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we+exp1+exp2+exp3+exp4)**2',
                          Z, return_type='dataframe')

Zcontrols = Zcontrols - Zcontrols.mean(axis=0)

Construct all the variables that we will use to model heterogeneity of effect in a linear manner

In [6]:
Zhet = patsy.dmatrix('0 + (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',
                     Z, return_type='dataframe')
Zhet = Zhet - Zhet.mean(axis=0)

Construct all interaction variables between sex and heterogeneity variables

In [7]:
Zhet['sex'] = Z['sex']
Zinteractions = patsy.dmatrix('0 + sex + sex * (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',
                              Zhet, return_type='dataframe')
interaction_cols = [c for c in Zinteractions.columns if c.startswith('sex')]

Put all the variables together

In [8]:
X = pd.concat([Zinteractions, Zcontrols], axis=1)
X.shape

(5150, 1002)

## Double Lasso for All Interactive Effects

We use "plug-in" tuning with a theoretically valid choice of penalty $\lambda = 2 \cdot c \hat{\sigma}\sqrt{n}  \Phi^{-1}(1-\alpha/2p)$, where $c>1$ and $1-\alpha$ is a confidence level, and $\Phi^{-1}$ denotes the quantile function. This choice ensures that the Lasso predictor is well behaved under independence as long as appropriate penalty weights are used.

In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks.

Note: In the book, we multiply by $\sqrt{n}$. This is because there, Lasso minimizes the sum of errors. If you were using say sklearn's Lasso whose objective minimizes the average errors, you would instead divide by $\sqrt{n}$.

To estimate lasso using the theoretically motivated penalty level, we just use the hdmpy package. To install it run
```
!pip install multiprocess
!git clone https://github.com/maxhuppertz/hdmpy.git
```

You can run the cells below and then repeat the whole analysis above using the newly defined `lasso_model` variable.

In [10]:
!pip install multiprocess
#!git clone https://github.com/maxhuppertz/hdmpy.git



In [11]:
import sys
sys.path.insert(1, "./hdmpy")

In [12]:
# We wrap the package so that it has the familiar sklearn API
import hdmpy
from sklearn.base import BaseEstimator


class RLasso(BaseEstimator):

    def __init__(self, *, post=True):
        self.post = post

    def fit(self, X, y):
        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)
        return self

    def predict(self, X):
        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])


def lasso_model():
    return RLasso(post=False)

In [13]:
alpha = {}
res_y, res_D, epsilon = {}, {}, {}
for c in interaction_cols:
    print(f"Double Lasso for target variable {c}")
    D = X[c].values
    W = X.drop([c], axis=1)
    res_y[c] = y - lasso_model().fit(W, y).predict(W)
    res_D[c] = D - lasso_model().fit(W, D).predict(W)
    final = LinearRegression(fit_intercept=False).fit(res_D[c].reshape(-1, 1), res_y[c])
    epsilon[c] = res_y[c] - final.predict(res_D[c].reshape(-1, 1))
    alpha[c] = [final.coef_[0]]

# Calculate the covariance matrix of the estimated parameters
V = np.zeros((len(interaction_cols), len(interaction_cols)))
for it, c in enumerate(interaction_cols):
    Jc = np.mean(res_D[c]**2)
    for itp, cp in enumerate(interaction_cols):
        Jcp = np.mean(res_D[cp]**2)
        Sigma = np.mean(res_D[c] * epsilon[c] * epsilon[cp] * res_D[cp])
        V[it, itp] = Sigma / (Jc * Jcp)

# Calculate standard errors for each parameter
n = X.shape[0]
for it, c in enumerate(interaction_cols):
    alpha[c] += [np.sqrt(V[it, it] / n)]

# put all in a dataframe
df = pd.DataFrame.from_dict(alpha, orient='index', columns=['point', 'stderr'])

# Calculate and pointwise p-value
summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['Std. Error'] = df['stderr']
summary['p-value'] = norm.sf(np.abs(df['point'] / df['stderr']), loc=0, scale=1) * 2
summary['ci_lower'] = df['point'] - 1.96 * df['stderr']
summary['ci_upper'] = df['point'] + 1.96 * df['stderr']
summary

Double Lasso for target variable sex
Double Lasso for target variable sex:shs
Double Lasso for target variable sex:hsg
Double Lasso for target variable sex:scl
Double Lasso for target variable sex:clg
Double Lasso for target variable sex:mw
Double Lasso for target variable sex:so
Double Lasso for target variable sex:we
Double Lasso for target variable sex:exp1
Double Lasso for target variable sex:exp2
Double Lasso for target variable sex:exp3
Double Lasso for target variable sex:exp4


Unnamed: 0,Estimate,Std. Error,p-value,ci_lower,ci_upper
sex,-0.067859,0.015321,9e-06,-0.097889,-0.037829
sex:shs,-0.197714,0.113114,0.080477,-0.419417,0.023989
sex:hsg,0.012411,0.050681,0.806545,-0.086923,0.111745
sex:scl,0.021593,0.0496,0.663311,-0.075622,0.118809
sex:clg,0.06175,0.046871,0.187692,-0.030118,0.153617
sex:mw,-0.108562,0.041151,0.008336,-0.189217,-0.027907
sex:so,-0.072678,0.039383,0.064979,-0.149869,0.004513
sex:we,-0.050941,0.042267,0.228113,-0.133784,0.031901
sex:exp1,0.01811,0.006726,0.007087,0.004928,0.031292
sex:exp2,0.023545,0.048364,0.626381,-0.071248,0.118338


### Joint Confidence Intervals

In [14]:
Drootinv = np.diagflat(1 / np.sqrt(np.diag(V)))
scaledCov = Drootinv @ V @ Drootinv
np.random.seed(123)
U = np.random.multivariate_normal(np.zeros(scaledCov.shape[0]), scaledCov, size=10000)
z = np.max(np.abs(U), axis=1)
c = np.percentile(z, 95)

summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['CI lower'] = df['point'] - c * df['stderr']
summary['CI upper'] = df['point'] + c * df['stderr']
summary

Unnamed: 0,Estimate,CI lower,CI upper
sex,-0.067859,-0.111136,-0.024581
sex:shs,-0.197714,-0.51722,0.121792
sex:hsg,0.012411,-0.130745,0.155567
sex:scl,0.021593,-0.118509,0.161695
sex:clg,0.06175,-0.070645,0.194144
sex:mw,-0.108562,-0.224798,0.007674
sex:so,-0.072678,-0.183922,0.038566
sex:we,-0.050941,-0.17033,0.068447
sex:exp1,0.01811,-0.000887,0.037108
sex:exp2,0.023545,-0.113066,0.160156
