# An inferential problem: The Gender Wage Gap

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 D  + \beta_2' W + \epsilon,
\end{align}

where $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the
$W$'s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

## Data analysis

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.

In [2]:
load("../data/wage2015_subsample_inference.Rdata")
attach(data)

dim(data)

To start our (causal) analysis, we subset the data to level of schooling for some college or college, then compare the sample means given gender:

In [3]:
library(xtable)

Z_scl <- data[data$scl==1,]
Z_clg <- data[data$clg==1,]
Z_data <- rbind(Z_scl,Z_clg)

Z <- Z_data[which(colnames(Z_data) %in% c("lwage","sex","scl","clg","ne","mw","so","we","exp1"))]

data_female <- Z_data[Z_data$sex==1,]
Z_female <- data_female[which(colnames(Z_data) %in% c("lwage","sex","scl","clg","ne","mw","so","we","exp1"))]


data_male <- Z_data[Z_data$sex==0,]
Z_male <- data_male[which(colnames(Z_data) %in% c("lwage","sex","scl","clg","ne","mw","so","we","exp1"))]

table <- matrix(0, 9, 3)
table[1:9,1]   <- as.numeric(lapply(Z,mean))
table[1:9,2]   <- as.numeric(lapply(Z_male,mean))
table[1:9,3]   <- as.numeric(lapply(Z_female,mean))
rownames(table) <- c("Log Wage","Sex","Some College","College Graduate","Midwest","South","West","Northeast","Experience")
colnames(table) <- c("All","Men","Women")
tab<- xtable(table, digits = 4)
tab

Unnamed: 0,All,Men,Women
Log Wage,3.0000223,3.0384121,2.9569035
Sex,0.4709909,0.0,1.0
Some College,0.4667536,0.4818238,0.449827
College Graduate,0.5332464,0.5181762,0.550173
Midwest,0.2659713,0.2612446,0.2712803
South,0.285854,0.2908195,0.2802768
West,0.2216428,0.228589,0.2138408
Northeast,0.2265319,0.2193469,0.2346021
Experience,12.7009452,12.4331485,13.0017301


In particular, the table above shows that the difference in average *logwage* between men and women is equal to $0,0815$

In [5]:
mean(data_female$lwage)-mean(data_male$lwage)

Thus, the unconditional gender wage gap is about $8,15$\% for the group of never married workers (women whit Some College or College Graduated get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

This unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$:

\begin{align}
\log(Y) &=\beta D + \epsilon.
\end{align}

We verify this by running an ols regression in R.

In [4]:
library(sandwich)
nocontrol.fit <- lm(lwage ~ sex,data=Z_data)
nocontrol.est <- summary(nocontrol.fit)$coef["sex",1]
HCV.coefs <- vcovHC(nocontrol.fit, type = 'HC');
nocontrol.se <- sqrt(diag(HCV.coefs))[2] # Estimated std errors

# print unconditional effect of gender and the corresponding standard error
cat ("The estimated gender coefficient is",nocontrol.est," and the corresponding robust standard error is",nocontrol.se)


The estimated gender coefficient is -0.08150856  and the corresponding robust standard error is 0.01957965

Note that the standard error is computed with the *R* package *sandwich* to be robust to heteroskedasticity. 


Next, we run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}

Here, we are considering the flexible model from the previous lab. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

Let us run the ols regression with controls.

In [8]:
# Ols regression with controls

flex <- lwage ~ sex + (exp1+exp2+exp3+exp4)*(clg+occ2+ind2+mw+so+we)

control.fit <- lm(flex, data=Z_data)
control.est <- summary(control.fit)$coef[2,1]

summary(control.fit)

cat("Coefficient for OLS with controls", control.est)

HCV.coefs <- vcovHC(control.fit, type = 'HC');
control.se <- sqrt(diag(HCV.coefs))[2] # Estimated std errors


Call:
lm(formula = flex, data = Z_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.87897 -0.27894 -0.00777  0.25823  2.85755 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.4357559  0.5208785   6.596 5.02e-11 ***
sex         -0.0530623  0.0193532  -2.742 0.006149 ** 
exp1        -0.1962420  0.1996670  -0.983 0.325767    
exp2         4.3105015  2.3797581   1.811 0.070197 .  
exp3        -2.3067253  1.0176306  -2.267 0.023480 *  
exp4         0.3513392  0.1382879   2.541 0.011118 *  
clg          0.2498673  0.1237850   2.019 0.043627 *  
occ22        0.2151993  0.1582208   1.360 0.173900    
occ23        0.0487642  0.2095297   0.233 0.815986    
occ24        0.0281449  0.2300084   0.122 0.902619    
occ25       -0.2711807  0.3944166  -0.688 0.491793    
occ26       -0.2000530  0.2705614  -0.739 0.459725    
occ27       -0.1203371  0.4188017  -0.287 0.773875    
occ28       -0.1719401  0.272

Coefficient for OLS with controls -0.05306234

The estimated regression coefficient $\beta_1\approx-0.053$ measures how our linear prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size $8$\% for women decreases to about $5$\% after controlling for worker characteristics. Also, we can see that people with complete college earn $24$\% more than those with some college.


Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.

In [None]:
# Partialling-Out using ols

# models
flex.y <- lwage ~  (exp1+exp2+exp3+exp4)*(clg+occ2+ind2+mw+so+we) # model for Y
flex.d <- sex ~ (exp1+exp2+exp3+exp4)*(clg+occ2+ind2+mw+so+we) # model for D

# partialling-out the linear effect of W from Y
t.Y <- lm(flex.y, data=Z_data)$res
# partialling-out the linear effect of W from D
t.D <- lm(flex.d, data=Z_data)$res

# regression of Y on D after partialling-out the effect of W
partial.fit <- lm(t.Y~t.D, data=Z_data)
partial.est <- summary(partial.fit)$coef[2,1]

cat("Coefficient for D via partialling-out", partial.est)

# standard error
HCV.coefs <- vcovHC(partial.fit, type = 'HC')
partial.se <- sqrt(diag(HCV.coefs))[2]

# confidence interval
confint(partial.fit)[2,]

Again, the estimated coefficient measures the linear predictive effect (PE) of $D$ on $Y$ after taking out the linear effect of $W$ on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.