# Estimating Gender Discrimination in the Workplace
In this exercise we’ll use data from the 2018 US Current Population Survey (CPS) to try and estimate the effect of being a woman on workplace compensation. Note that our focus will be only on differential compensation in the work place, and as a result it is important to bear in mind that our estimates are not estimates of all forms of gender discrimination. For example, these analyses will not account for things like gender discrimination in terms of getting jobs.

## Exercise 1:
Begin by downloading and importing 2018 CPS data from http://www.github.com/nickeubank/MIDS_Data/Current_Population_Survey. The file is called morg18.dta and is a Stata dataset. Additional data on the dataset can be found by following the links in the README.txt file in the folder, but for the moment it is sufficient to know this is a national survey run in the United States.

The survey does include some survey weights we won’t be using (i.e. not everyone in the sample was included with the same probability), so the numbers we estimate will not be perfect estimates of the gender wage gap in the United States, but they are pretty close.

In [5]:
import pandas as pd

# Load survey
df = pd.read_stata('./data/morg18.dta')
df.head()

Unnamed: 0,hhid,intmonth,hurespli,hrhtype,minsamp,hrlonglk,hrsample,hrhhid2,serial,hhnum,...,ym_file,ym,ch02,ch35,ch613,ch1417,ch05,ihigrdc,docc00,dind02
0,4795110719,January,1.0,Husband/wife primary fam (neither in Armed For...,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,14.0,,
1,4795110719,January,1.0,Husband/wife primary fam (neither in Armed For...,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,13.0,,
2,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,1,0,12.0,Office and administrative support occupations,"Health care services , except hospitals"
3,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,0,0,12.0,Office and administrative support occupations,Administrative and support services
4,110359424339,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,711,7111,1,1,...,696,693,0,0,0,0,0,,Healthcare practitioner and technical occupations,Hospitals


## Exercise 2
Because our interest is only in-the-workplace wage discrimination among full-time workers, we need to start by subsetting our data for people currently employed at the time of this survey using the lfsr94 variable, who are employed full time (meaning that their usual hours per week – uhourse – is 35 or above).

As noted above, this analysis will miss many forms of gender discrimination. For example, in dropping anyone who isn’t working, we immediately lose any women who couldn’t get jobs, or who chose to lose the workforce because the wages they were offered (which were likely lower than those offered men) were lower than they were willing / could accept. And in focusing on full time employees, we miss the fact women may not be offered full time jobs at the same rate as men.

In [7]:
df = df[df.lfsr94 == 'Employed-At Work']
df = df[df.uhourse >= 35]

df.shape

(133814, 98)

## Exercise 3
Now let’s estimate the basic wage gap for the United States!

Earnings per hour worked can be found in the earnhre variable. Two things are worth noting about this variable:

- It is coded in cents (1/100 of a dollar), not dollars, so make sure to divide by 100 to get dollars.

- Earnings are “top-coded” at 9999 (meaning any value above 99.99 dollars an hour is coded as 99.99 dollars an hour). Thankfully these are rare, so we’ll just leave them in as-is for now. However, note that wage inequality is likely to be especially high for extremely high paid individuals (e.g. most CEOs are men), so this will bias us towards slightly conservative (low) estimates of the gender wage gap.

Using the variable sex (1=Male, 2=Female), estimate the gender wage gap in terms of wages per hour worked!

In [11]:
# Adjust earnings per hour (in cents) into dollars,
df['earnhre_dollars'] = df['earnhre'] / 100

import statsmodels.formula.api as smf

df['female'] = (df.sex == 2).astype('int')
smf.ols('earnhre_dollars ~ female', df).fit().summary()

0,1,2,3
Dep. Variable:,earnhre_dollars,R-squared:,0.015
Model:,OLS,Adj. R-squared:,0.015
Method:,Least Squares,F-statistic:,999.7
Date:,"Fri, 19 Feb 2021",Prob (F-statistic):,8.84e-218
Time:,15:54:40,Log-Likelihood:,-244780.0
No. Observations:,65755,AIC:,489600.0
Df Residuals:,65753,BIC:,489600.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,20.5546,0.053,386.055,0.000,20.450,20.659
female,-2.4759,0.078,-31.619,0.000,-2.629,-2.322

0,1,2,3
Omnibus:,31043.263,Durbin-Watson:,1.852
Prob(Omnibus):,0.0,Jarque-Bera (JB):,210689.419
Skew:,2.182,Prob(JB):,0.0
Kurtosis:,10.606,Cond. No.,2.54


According to the OLS model, there is a gender wage gap. The coefficient shows that females' per hour worked income is significantly lower than the male by -2.47 dollars.

## Exercise 4
The variable uhourse is the number of hours that the respondent usually works per week. What is the wage gap not per hour, but per year? Is the difference statistically significant?

In [12]:
df['annual_earnings'] = df['earnhre_dollars'] * df['uhourse'] * 52

smf.ols('annual_earnings ~ female', df).fit().summary()

0,1,2,3
Dep. Variable:,annual_earnings,R-squared:,0.024
Model:,OLS,Adj. R-squared:,0.024
Method:,Least Squares,F-statistic:,1637.0
Date:,"Fri, 19 Feb 2021",Prob (F-statistic):,0.0
Time:,16:10:10,Log-Likelihood:,-753350.0
No. Observations:,65755,AIC:,1507000.0
Df Residuals:,65753,BIC:,1507000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.511e+04,121.688,370.664,0.000,4.49e+04,4.53e+04
female,-7240.6673,178.971,-40.457,0.000,-7591.450,-6889.884

0,1,2,3
Omnibus:,33612.087,Durbin-Watson:,1.863
Prob(Omnibus):,0.0,Jarque-Bera (JB):,291856.174
Skew:,2.309,Prob(JB):,0.0
Kurtosis:,12.23,Cond. No.,2.54


The salary gap still exist in annual income, the femeal's annual salary is significanlly lower than male by 7240 dollars. According to the p-value, the difference is statistically significant.

## Exercise 5
We just compared all full-time working men to all full-time working women. For this to be an accurate causal estimate of the effect of being a woman in the work place, what must be true of these two groups? What is one reason that this may not be true?

## Exercise 6
One answer to the second part of Exercise 5 is that working women are likely to be younger, since a larger portion of younger women are entering the workforce as compared to older generations.

To control for this difference, let’s now regress annual earnings on gender and age. What is the implied average annual wage difference between women and men? Is it different from your raw estimate? Is the difference statistically significant?

## Exercise 7
In running this regression and interpreting the coefficient on female, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on female, we’re basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

# Exercise 8
Now let’s add to our regression an indicator variable for whether the respondent has at least graduated high school, and an indicator for whether the respondent at least has a BA.

In answering this question, use the following table of codes for the variable grade92.

Education is coded as follows:

![cps_educ_codes](./img/cps_educ_codes.png)  

## Exercise 9
In running this regression and interpreting the coefficient on female, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on female, we are once more basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

## Exercise 10
Given how the coefficient on female has changed between Exercise 6 and Exercise 8, what can you infer about the educational attainment of the women in your survey data (as compared to the educational attainment of men)?

## Exercise 11
What does that tell you about the potential outcomes of men and women before you added education as a control?

## Exercise 12
Finally, let’s include fixed effects for the type of job held by each respondent.

Fixed effects are a method used when we have a nested data structure in which respondents belong to groups, and those groups may all be subject to different pressures. In this context, for example, we can add fixed effects for the industry of each respondent – since wages often vary across industries, controlling for industry is likely to improve our estimates.

(Note that fixed effects are very similar in principle to hierarchical models. There are some differences you will read about for our next class, but they are designed to serve the same role, just with slightly different mechanics).

When we add fixed effects for groups like this, our interpretation of the other coefficients changes. Whereas in previous exercises we were trying to explain variation in men and women’s wages across all respondents, we are now effectively comparing men and women’s wages within each employment sector. Our coefficient on female, in other words, now tells us how much less (on average) we would expect a woman to be paid than a man within the same industry, not across all respondents.

(Note that running this regression will result in lots of coefficients popping up you don’t care about. We’ll introduce some more efficient methods for adding fixed effects that aren’t so messy in a later class – for now, you can ignore those coefficients!)

## Exercise 13
Now that we’ve added industry fixed effects, what groups are we implicitly treated as counter-factuals for one another now?

## Exercise 14
What happened to your estimate of the gender wage gap when you added industry fixed effects? What does that tell you about the industries chosen by women as opposed to men?