* Python code replication of: " https://www.kaggle.com/janniskueck/pm1-notebook-inference "
* Created by: Anzony Quispe & Alexander Quispe

This notebook contains an example for teaching.

# An inferential problem: The Gender Wage Gap

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 D  + \beta_2' W + \epsilon,
\end{align}

where $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the
$W$'s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

## Data Analysis

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.



In [1]:
using Pkg


Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("Plots")
Pkg.add("CategoricalArrays")

using DataFrames
using Dates
using Plots
using Statistics,RData  #upload data of R format 
using CategoricalArrays # categorical data 

[32m[1m    Updating[22m[39m registry at `C:\Users\Roberto Carlos\.julia\registries\General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m PooledArrays ──────────────── v1.4.1
[32m[1m   Installed[22m[39m Formatting ────────────────── v0.4.2
[32m[1m   Installed[22m[39m IteratorInterfaceExtensions ─ v1.0.0
[32m[1m   Installed[22m[39m Missings ──────────────────── v1.0.2
[32m[1m   Installed[22m[39m OrderedCollections ────────── v1.4.1
[32m[1m   Installed[22m[39m Tables ────────────────────── v1.7.0
[32m[1m   Installed[22m[39m DataStructures ────────────── v0.18.11
[32m[1m   Installed[22m[39m DataValueInterfaces ───────── v1.0.0
[32m[1m   Installed[22m[39m DataAPI ───────────────────── v1.9.0
[32m[1m   Installed[22m[39m Crayons ───────────────────── v4.1.1
[32m[1m   Installed[22m[39m DataFrames ────────────────── v1.3.2


[32m[1m   Installed[22m[39m GR_jll ─────────────────────── v0.64.2+0
[32m[1m   Installed[22m[39m UnicodeFun ─────────────────── v0.4.1
[32m[1m   Installed[22m[39m LERC_jll ───────────────────── v3.0.0+1
[32m[1m   Installed[22m[39m Xorg_xkeyboard_config_jll ──── v2.27.0+4
[32m[1m   Installed[22m[39m Unzip ──────────────────────── v0.1.2
[32m[1m   Installed[22m[39m Xorg_libpthread_stubs_jll ──── v0.1.0+3
[32m[1m   Installed[22m[39m Qt5Base_jll ────────────────── v5.15.3+0
[32m[1m   Installed[22m[39m ChangesOfVariables ─────────── v0.1.2
[32m[1m   Installed[22m[39m LaTeXStrings ───────────────── v1.3.0
[32m[1m   Installed[22m[39m Glib_jll ───────────────────── v2.68.3+2
[32m[1m   Installed[22m[39m Expat_jll ──────────────────── v2.4.8+0
[32m[1m   Installed[22m[39m GeometryBasics ─────────────── v0.4.2
[32m[1m   Installed[22m[39m FixedPointNumbers ──────────── v0.8.4
[32m[1m   Installed[22m[39m Zstd_jll ───────────────────── v1.5.2+0


 [90m [ec84b674] [39m[92m+ Xorg_libXrandr_jll v1.5.2+4[39m
 [90m [ea2f1a96] [39m[92m+ Xorg_libXrender_jll v0.9.10+4[39m
 [90m [14d82f49] [39m[92m+ Xorg_libpthread_stubs_jll v0.1.0+3[39m
 [90m [c7cfdc94] [39m[92m+ Xorg_libxcb_jll v1.13.0+3[39m
 [90m [cc61e674] [39m[92m+ Xorg_libxkbfile_jll v1.1.0+4[39m
 [90m [12413925] [39m[92m+ Xorg_xcb_util_image_jll v0.4.0+1[39m
 [90m [2def613f] [39m[92m+ Xorg_xcb_util_jll v0.4.0+1[39m
 [90m [975044d2] [39m[92m+ Xorg_xcb_util_keysyms_jll v0.4.0+1[39m
 [90m [0d47668e] [39m[92m+ Xorg_xcb_util_renderutil_jll v0.3.9+1[39m
 [90m [c22f9ab0] [39m[92m+ Xorg_xcb_util_wm_jll v0.4.1+1[39m
 [90m [35661453] [39m[92m+ Xorg_xkbcomp_jll v1.4.2+4[39m
 [90m [33bec58e] [39m[92m+ Xorg_xkeyboard_config_jll v2.27.0+4[39m
 [90m [c5fb5394] [39m[92m+ Xorg_xtrans_jll v1.4.0+3[39m
 [90m [3161d3a3] [39m[92m+ Zstd_jll v1.5.2+0[39m
 [90m [0ac62f75] [39m[92m+ libass_jll v0.15.1+0[39m
 [90m [f638f0a6] [39m[92m+ libfdk_a

LoadError: ArgumentError: Package RData not found in current path:
- Run `import Pkg; Pkg.add("RData")` to install the RData package.


In [2]:
rdata_read = load("../../../data/wage2015_subsample_inference.RData")
data = rdata_read["data"]
names(data)
println("Number of Rows : ", size(data)[1],"\n","Number of Columns : ", size(data)[2],) #rows and columns

LoadError: UndefVarError: load not defined

***Variable description***

- occ : occupational classification
- ind : industry classification
- lwage : log hourly wage
- sex : gender (1 female) (0 male)
- shs : some high school
- hsg : High school graduated
- scl : Some College
- clg: College Graduate
- ad: Advanced Degree
- ne: Northeast
- mw: Midwest
- so: South
- we: West
- exp1: experience

In [3]:
describe(data)

LoadError: UndefVarError: data not defined

To start our (causal) analysis, we compare the sample means given gender:

In [4]:
Z = select(data, ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"])

data_female = filter(row -> row.sex == 1, data)
Z_female = select(data_female,["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] )

data_male = filter(row -> row.sex == 0, data)
Z_male = select(data_male,["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] )

means = DataFrame( variables = names(Z), All = describe(Z, :mean)[!,2], Men = describe(Z_male,:mean)[!,2], Female = describe(Z_female,:mean)[!,2])


LoadError: UndefVarError: data not defined

In particular, the table above shows that the difference in average logwage between men and women is equal to $0,038$

In [5]:
mean(Z_female[:,:lwage]) - mean(Z_male[:,:lwage])

LoadError: UndefVarError: Z_female not defined

Thus, the unconditional gender wage gap is about $3,8$\% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

This unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$:

\begin{align}
\log(Y) &=\beta D + \epsilon.
\end{align}

We verify this by running an ols regression in Julia.

In [None]:
#install all the package that we can need
Pkg.add("Lathe")
Pkg.add("GLM") # package to run models 
Pkg.add("StatsPlots")
Pkg.add("MLBase")
Pkg.add("Tables")
Pkg.add("CovarianceMatrices") # robust standar error 
# Load the installed packages
using DataFrames
using CSV
using Tables
using Lathe
using GLM
using CovarianceMatrices


[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Missings ──────────────── v0.4.5
[32m[1m   Installed[22m[39m AbstractFFTs ──────────── v1.1.0
[32m[1m   Installed[22m[39m TimerOutputs ──────────── v0.5.16
[32m[1m   Installed[22m[39m FillArrays ────────────── v0.11.9
[32m[1m   Installed[22m[39m CategoricalArrays ─────── v0.9.7
[32m[1m   Installed[22m[39m Lathe ─────────────────── v0.1.8
[32m[1m   Installed[22m[39m Calculus ──────────────── v0.5.1
[32m[1m   Installed[22m[39m Rmath_jll ─────────────── v0.3.0+0
[32m[1m   Installed[22m[39m LLVM ──────────────────── v3.9.0
[32m[1m   Installed[22m[39m Rmath ─────────────────── v0.7.0
[32m[1m   Installed[22m[39m DataFrames ────────────── v0.22.7
[32m[1m   Installed[22m[39m HypergeometricFunctions ─ v0.3.8
[32m[1m   Installed[22m[39m StatsFuns ─────────────── v0.9.17
[32m[1m   Installed[22m[39m SpecialFunctions ──────── v1.8.4
[32m[1m   Installed[22m[39m 

 [90m [4536629a] [39m[92m+ OpenBLAS_jll[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mWoodburyMatrices[39m
[32m  ✓ [39m[90mObservables[39m
[32m  ✓ [39m[90mOpenBLAS_jll[39m
[32m  ✓ [39m[90mDistances[39m
[32m  ✓ [39m[90mRatios[39m
[32m  ✓ [39mCategoricalArrays
[32m  ✓ [39m[90mSentinelArrays[39m
[32m  ✓ [39m[90mOffsetArrays[39m
[32m  ✓ [39m[90mDataValues[39m
[32m  ✓ [39m[90mIntelOpenMP_jll[39m
[32m  ✓ [39m[90mAxisAlgorithms[39m
[32m  ✓ [39m[90mFFTW_jll[39m
[32m  ✓ [39m[90mArpack_jll[39m
[32m  ✓ [39m[90mWidgets[39m
[32m  ✓ [39m[90mTableOperations[39m
[32m  ✓ [39m[90mNearestNeighbors[39m
[33m  ✓ [39m[90mStatsBase[39m
[32m  ✓ [39m[90mMKL_jll[39m
[32m  ✓ [39m[90mArpack[39m
[32m  ✓ [39m[90mStatsModels[39m
[32m  ✓ [39m[90mClustering[39m
[32m  ✓ [39m[90mInterpolations[39m
[32m  ✓ [39m[90mMultivariateStats[39m
[32m  ✓ [39m[90mDistributions[39m
[32m  ✓ [39mGLM
[32m  ✓ [39m[90mFFT

In [None]:
nocontrol_model = lm(@formula(lwage ~ sex), data)
nocontrol_est = GLM.coef(nocontrol_model)[2]
nocontrol_se = GLM.coeftable(nocontrol_model).cols[2][2]

nocontrol_se1 = stderror(HC1(), nocontrol_model)[2]
println("The estimated gender coefficient is ", nocontrol_est ," and the corresponding robust standard error is " ,nocontrol_se1)

Next, we run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}

Here, we are considering the flexible model from the previous lab. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

Let us run the ols regression with controls.

## Ols regression with controls

In [None]:
flex = @formula(lwage ~ sex + (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
control_model = lm(flex , data)
control_est = GLM.coef(control_model)[2]
control_se = GLM.coeftable(control_model).cols[2][2]
control_se1 = stderror( HC0(), control_model)[2]


In [None]:
control_model 

In [None]:
println("Coefficient for OLS with controls " , control_est, "robust standard error:", control_se1)

The estimated regression coefficient $\beta_1\approx-0.0696$ measures how our linear prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size $4$\% for women increases to about $7$\% after controlling for worker characteristics.  


Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.

## Partialling-Out using ols

In [None]:
# models
# model for Y
flex_y = @formula(lwage ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
flex_d = @formula(sex ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))

# partialling-out the linear effect of W from Y
t_Y = residuals(lm(flex_y, data))

# partialling-out the linear effect of W from D
t_D = residuals(lm(flex_d, data))

data_res = DataFrame(t_Y = t_Y, t_D = t_D )
# regression of Y on D after partialling-out the effect of W

partial_fit = lm(@formula(t_Y ~ t_D), data_res)

partial_est = GLM.coef(partial_fit)[2]

# standard error
partial_se = GLM.coeftable(partial_fit).cols[2][2]

partial_se1 = stderror( HC0(), partial_fit)[2]

#condifence interval
GLM.confint(partial_fit)[2,:]

In [None]:
println("Coefficient for D via partiallig-out ", partial_est, " robust standard error:", control_se1 )

Again, the estimated coefficient measures the linear predictive effect (PE) of $D$ on $Y$ after taking out the linear effect of $W$ on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.

We know that the partialling-out approach works well when the dimension of $W$ is low
in relation to the sample size $n$. When the dimension of $W$ is relatively high, we need to use variable selection
or penalization for regularization purposes. 

In the following, we illustrate the partialling-out approach using lasso instead of ols. 

## Summarize the results

In [None]:
DataFrame(modelos = [ "Without controls", "full reg", "partial reg" ], 
Estimate = [nocontrol_est,control_est, partial_est], 
StdError = [nocontrol_se1,control_se1, partial_se1])

It it worth to notice that controlling for worker characteristics increases the gender wage gap from less that 4\% to 7\%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast);  a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.

Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect
selection effects of unobserved differences in covariates between men and women in our sample.


Next we try "extra" flexible model, where we take interactions of all controls, giving us about 1000 controls.

## "Extra" flexible model

In [None]:
import Pkg
Pkg.add("StatsModels")
Pkg.add("Combinatorics")
Pkg.add("IterTools")
# we have to configure the package internaly with the itertools package, this because 
#julia dont iunderstand (a formula) ^2, it takes as an entire term not as interactions 
#between variables

In [None]:
#this code fix the problem mencioned above
using StatsModels, Combinatorics, IterTools

combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)
expand_exp(args, deg::ConstantTerm) =
    tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.FullRank, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

In [None]:
extra_flex = @formula(lwage ~  sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2)

control_fit = lm(extra_flex, data)
control_est = GLM.coef(control_fit)[2]

println("Number of Extra-Flex Controls: ", size(modelmatrix(control_fit))[2] -1) #minus the intercept
println("Coefficient for OLS with extra flex controls ", control_est)

#std error
control_se = GLM.stderror(control_fit)[2];