# Application: Heterogeneous Effect of Gender on Wage Using Double Lasso

 We use US census data from the year 2012 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  For a detailed description of the variables we refer to the help page.



This analysis allows a closer look how discrimination according to gender is related to other socio-economic variables.



In [93]:
import hdmpy
import pyreadr
import patsy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

In [2]:
cps2012_env = pyreadr.read_r("../data/cps2012.Rdata")
cps2012 = cps2012_env[ 'data' ]
cps2012.describe()

Unnamed: 0,year,lnw,female,widowed,divorced,separated,nevermarried,hsd08,hsd911,hsg,cg,ad,mw,so,we,exp1,exp2,exp3,exp4,weight
count,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0,29217.0
mean,2012.0,2.797007,0.428757,0.007975,0.113393,0.0166,0.156347,0.004107,0.022179,0.247288,0.283431,0.1558,0.291645,0.282849,0.199644,18.756939,4.286811,10.875998,29.408779,1513.842566
std,0.0,0.662406,0.494907,0.088947,0.317078,0.127769,0.363191,0.063957,0.147267,0.431443,0.450671,0.362672,0.454528,0.450391,0.39974,8.76704,3.321506,11.121864,36.569919,1009.81161
min,2012.0,-7.469874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,106.79
25%,2012.0,2.408296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.5,1.3225,1.520875,1.749006,654.24
50%,2012.0,2.77454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,3.61,6.859,13.0321,1472.1
75%,2012.0,3.181569,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,26.0,6.76,17.576,45.6976,1966.63
max,2012.0,5.970942,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,43.5,18.9225,82.312875,358.061006,6444.15


In [3]:
formula_basic =  '''lnw ~ -1 + female + female:(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) + +(widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3) ** 2'''

y, X = patsy.dmatrices(formula_basic, cps2012, return_type='dataframe')
X.shape[1]

136

We have the same number of covariables.

In [4]:
variance_cols = X.var().to_numpy()
X = X.iloc[ : ,  np.where( variance_cols != 0   )[0] ]

def demean(x):
    dif = x - np.mean( x )
    return dif 

X = X.apply( demean, axis = 0 )

index_gender = np.where( X.columns.str.contains('female'))[0]

The parameter estimates for the target parameters, i.e. all coefficients related to gender (i.e. by interaction with other variables) are calculated and summarized by the following commands:

In [6]:
effect_female = hdmpy.rlassoEffects( x = X , y = y , index = index_gender )

In [None]:
result_coeff = pd.concat( [ effect_female.res.get( 'coefficients' ).rename(columns = { 0 : "Estimate." }) , \
             effect_female.res.get( 'se' ).rename( columns = { 0 : "Std. Error" } ) , \
             effect_female.res.get( 't' ).rename( columns = { 0 : "t value" } ) , \
             effect_female.res.get( 'pval' ).rename( columns = { 0 : "Pr(>|t|)" } ) ] ,\
             axis = 1 )

print( result_coeff )

result_coeff.round( 3 ).to_latex()

Now, we estimate and plot confident intervals, first "pointwise" and then the joint confidence intervals.

In [43]:
pointwise_CI = pd.DataFrame({ '5%' : result_coeff.iloc[ : , 0 ] \
                                     - result_coeff.iloc[ : , 1 ] * 1.645 ,\
                              '95%' : result_coeff.iloc[ : , 0 ] \
                             + result_coeff.iloc[ : , 1 ] * 1.645})

In [76]:
result_coeff = result_coeff.sort_values('Estimate.')

x = result_coeff.index

coef = result_coeff.iloc[ : , 0 ].to_numpy()

sd_error = result_coeff.iloc[ : , 1 ].to_numpy()

figure(figsize=(12, 6), dpi=80)

plt.errorbar( x = x , y = bottom , yerr = height , linestyle="None" , color = "black", \
              capsize = 3 , marker = "s" , markersize = 3 , mfc = "black" , mec = "black" )
plt.xticks(x, x, rotation=90)
plt.show()

Finally, we compare the pointwise confidence intervals to joint confidence intervals.