# Estimating Gender Discrimination in the Workplace

In this exercise we'll use data from the 2018 US Current Population Survey (CPS) to try and estimate the effect of being a woman on workplace compensation. 

Note that our focus will be *only* on differential compensation in the work place, and as a result it is important to bear in mind that our estimates are not estimates of *all* forms of gender discrimination. For example, these analyses will not account for things like gender discrimination in terms of *getting* jobs. We'll discuss this in more detail below.

## Exercise 1: 

Begin by downloading and importing 2018 CPS data from [https://github.com/nickeubank/MIDS_Data/tree/master/Current_Population_Survey](https://github.com/nickeubank/MIDS_Data/tree/master/Current_Population_Survey). The file is called `morg18.dta` and is a Stata dataset. Additional data on the dataset can be found by following the links in the README.txt file in the folder, but for the moment it is sufficient to know this is a national survey run in the United States.

The survey does include some survey weights we won't be using (i.e. not everyone in the sample was included with the same probability), so the numbers we estimate will not be perfect estimates of the gender wage gap in the United States, but they are pretty close.

In [53]:
# https://github.com/nickeubank/MIDS_Data/raw/refs/heads/master/Current_Population_Survey/morg18.dta
import pandas as pd

data = "morg18.dta"
data_cps = pd.read_stata(data)

print(data_cps.head())

              hhid intmonth  hurespli  \
0  000004795110719  January       1.0   
1  000004795110719  January       1.0   
2  000110339935453  January       1.0   
3  000110339935453  January       1.0   
4  000110359424339  January       1.0   

                                             hrhtype minsamp  \
0  Husband/wife primary fam (neither in Armed For...   MIS 8   
1  Husband/wife primary fam (neither in Armed For...   MIS 8   
2  Unmarried civilian female primary fam householder   MIS 4   
3  Unmarried civilian female primary fam householder   MIS 4   
4  Unmarried civilian female primary fam householder   MIS 4   

                      hrlonglk hrsample hrhhid2 serial  hhnum  ... ym_file  \
0  MIS 2-4 Or MIS 6-8 (link To     0601   06011      1      1  ...     696   
1  MIS 2-4 Or MIS 6-8 (link To     0601   06011      1      1  ...     696   
2  MIS 2-4 Or MIS 6-8 (link To     0701   07011      1      1  ...     696   
3  MIS 2-4 Or MIS 6-8 (link To     0701   07011      1  

## Exercise 2:

Because our interest is only in-the-workplace wage discrimination among full-time workers, we need to start by subsetting our data for people currently employed (and "at work", not "absent") at the time of this survey using the `lfsr94` variable, who are employed full time (meaning that their usual hours per week—`uhourse`—is 35 or above).

As noted above, this analysis will miss many forms of gender discrimination. For example, in dropping anyone who isn't working, we immediately lose any women who couldn't get jobs, or who chose to lose the workforce because the wages they were offered (which were likely lower than those offered men) were lower than they were willing / could accept. And in focusing on full time employees, we miss the fact women may not be offered full time jobs at the same rate as men. 

In [58]:
data_cps[["lfsr94", "uhourse"]]

Unnamed: 0,lfsr94,uhourse
0,Disabled-Not In Labor Force,
1,Retired-Not In Labor Force,
2,Employed-At Work,40.0
3,Employed-At Work,40.0
4,Employed-At Work,40.0
...,...,...
302327,Retired-Not In Labor Force,
302328,Employed-At Work,40.0
302329,Employed-At Work,35.0
302330,Retired-Not In Labor Force,


In [None]:
data_cps_ft = data_cps[
    (data_cps["lfsr94"] == "Employed-At Work") & (data_cps["uhourse"] >= 35)
].copy()

data_cps_ft.head()

Unnamed: 0,hhid,intmonth,hurespli,hrhtype,minsamp,hrlonglk,hrsample,hrhhid2,serial,hhnum,...,ym_file,ym,ch02,ch35,ch613,ch1417,ch05,ihigrdc,docc00,dind02
2,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,1,0,12.0,Office and administrative support occupations,"Health care services , except hospitals"
3,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,0,0,12.0,Office and administrative support occupations,Administrative and support services
4,110359424339,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,711,7111,1,1,...,696,693,0,0,0,0,0,,Healthcare practitioner and technical occupations,Hospitals
6,110651278174,January,1.0,Civilian male primary individual,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,12.0,Transportation and material moving occupations,Transportation and warehousing
17,7680515071194,January,1.0,Civilian male primary individual,MIS 8,MIS 2-4 Or MIS 6-8 (link To,611,6112,2,2,...,696,681,0,0,0,0,0,12.0,Transportation and material moving occupations,Retail trade


## Exercise 3

Now let's estimate the basic wage gap for the United States!

Earnings per week worked can be found in the `earnwke` variable. Using the variable `sex` (1=Male, 2=Female), estimate the gender wage gap in terms of wages per hour worked!

(You may also find it helpful, for context, to estimate the average hourly pay by dividing weekly pay by `uhourse`.)

In [None]:
# Calculate hourly wage
data_cps_ft["hourly_wage"] = data_cps_ft["earnwke"] / data_cps_ft["uhourse"]

# Calculate average hourly wage by gender
avg_hourly_wage = data_cps_ft.groupby("sex")["hourly_wage"].mean()
print(avg_hourly_wage)

sex
1    27.883330
2    23.803158
Name: hourly_wage, dtype: float64


## Exercise 4

Assuming 48 work weeks in a year, calculate annual earnings for men and women. Report the difference in dollars and in percentage terms.

In [None]:
# Calculate annual earnings
data_cps_ft["annual_earnings"] = (
    data_cps_ft["hourly_wage"] * data_cps_ft["uhourse"] * 48
)

# Calculate average annual earnings by gender
avg_annual_earnings = data_cps_ft.groupby("sex")["annual_earnings"].mean()
print(avg_annual_earnings)

# Calculate wage gap in dollars and percentage
wage_gap_dollars = avg_annual_earnings[1] - avg_annual_earnings[2]
wage_gap_percentage = (wage_gap_dollars / avg_annual_earnings[1]) * 100
print(f"Wage gap in dollars: {wage_gap_dollars}")
print(f"Wage gap in percentage: {wage_gap_percentage:.2f}%")

sex
1    57827.223276
2    47312.787581
Name: annual_earnings, dtype: float64
Wage gap in dollars: 10514.435694871754
Wage gap in percentage: 18.18%


## Exercise 5

We just compared all full-time working men to all full-time working women. For this to be an accurate *causal* estimate of the effect of being a woman in the work place, what must be true of these two groups? What is one reason that this may *not* be true?

For this to be a true causal estimate, full-time working men and women would need to be similar in every important way except for gender. That means factors like age, education, experience, occupation, and industry would need to be comparable across the two groups. This may not be true because men and women are not randomly assigned to jobs. If women are more likely to work in different industries or have different levels of experience, then our simple comparison captures those differences too not just the effect of gender.

## Exercise 6

One answer to the second part of Exercise 5 is that working women are likely to be younger, since a larger portion of younger women are entering the workforce as compared to older generations.

To *control* for this difference, let's now regress annual earnings on gender, age, and age-squared (the relationship between age and income is generally non-linear). What is the implied average annual wage difference between women and men? Is it different from your raw estimate? 

In [None]:
import statsmodels.api as sm

reg_data = data_cps_ft[["annual_earnings", "sex", "age"]].dropna().copy()

reg_data["female"] = (reg_data["sex"] == 2).astype(int)
reg_data["age_squared"] = reg_data["age"] ** 2

X = reg_data[["female", "age", "age_squared"]]
X = sm.add_constant(X)
y = reg_data["annual_earnings"]

model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     2625.
Date:                Mon, 23 Feb 2026   Prob (F-statistic):               0.00
Time:                        17:04:58   Log-Likelihood:            -1.4441e+06
No. Observations:              122603   AIC:                         2.888e+06
Df Residuals:                  122599   BIC:                         2.888e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        3.833e+04    315.241    121.587      

## Exercise 7

In running this regression and interpreting the coefficient on `female`, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on `female`, we're basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

When we interpret the coefficient on female, we’re implicitly comparing full-time working women to full-time working men who are the same age (since we control for age and age squared).We’re essentially treating these two groups as counterfactuals for each other and assuming that, conditional on age, the only systematic difference between them is gender.

## Exercise 8

Now let's add to our regression an indicator variable for whether the respondent has at least graduated high school, and an indicator for whether the respondent at least has a BA. 

In answering this question, use the following table of codes for the variable `grade92`. 

Education is coded as follows:
    
![CPS Educ Codes](./images/cps_educ_codes.png)

In [None]:
reg_data["high_school_grad"] = (data_cps_ft["grade92"] >= 4).astype(int)
reg_data["ba_grad"] = (data_cps_ft["grade92"] >= 11).astype(int)

X = reg_data[["female", "age", "age_squared", "high_school_grad", "ba_grad"]]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     2625.
Date:                Mon, 23 Feb 2026   Prob (F-statistic):               0.00
Time:                        17:09:08   Log-Likelihood:            -1.4441e+06
No. Observations:              122603   AIC:                         2.888e+06
Df Residuals:                  122599   BIC:                         2.888e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
female           -1.062e+04    181.123  

## Exercise 9 

In running this regression and interpreting the coefficient on `female`, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on `female`, we are once more basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

Now, the implicit comparison is between full-time working women and full-time working men who are the same age and have the same level of education (since we control for both). We’re treating those two groups as counterfactuals for each other and assuming that, conditional on age and education, the only systematic difference between them is gender.

## Exercise 10

Given how the coefficient on `female` has changed between Exercise 6 and Exercise 8, what can you infer about the educational attainment of the women in your survey data (as compared to the educational attainment of men)?

Since the coefficient on female decreases in magnitude after controlling for education, this suggests that women in the survey tend to have higher educational attainment than men. Part of the raw wage gap was being offset by women having more education on average. However, because the coefficient remains negative, education does not fully explain the wage gap.

## Exercise 11

What does that tell you about the *potential outcomes* of men and women before you added education as a control?

Before adding education as a control, we were comparing the potential earnings of men and women who differed both in gender and in education. That means the gap we saw was combining the effect of gender with the effect of differences in education. Once we add education, we get closer to comparing the potential outcomes of men and women with the same education level, which gives us a cleaner estimate of the gender effect.

## Exercise 12

Finally, let's include *fixed effects* for the type of job held by each respondent. 

Fixed effects are a method used when we have a nested data structure in which respondents belong to groups, and those groups may all be subject to different pressures. In this context, for example, we can add fixed effects for the industry of each respondent—since wages often vary across industries, controlling for industry is likely to improve our estimates. Use `ind02` to control for industry.

(Note that fixed effects are very similar in principle to hierarchical models. There are some differences [you will read about](../fixed_effects_v_hierarchical.ipynb) for our next class, but they are designed to serve the same role, just with slightly different mechanics). 

When we add fixed effects for groups like this, our interpretation of the other coefficients changes. Whereas in previous exercises we were trying to explain variation in men and women's wages *across all respondents*, we are now effectively comparing men and women's wages *within each employment sector*. Our coefficient on `female`, in other words, now tells us how much less (on average) we would expect a woman to be paid than a man *within the same industry*, not across all respondents. 

(Note that running this regression will result in lots of coefficients popping up you don't care about. We'll introduce some more efficient methods for adding fixed effects that aren't so messy in a later class -- for now, you can ignore those coefficients!)

In [None]:
import statsmodels.formula.api as smf

reg_data = data_cps_ft.dropna(subset=["annual_earnings", "age", "sex", "ind02"]).copy()


reg_data["female"] = (reg_data["sex"] == 2).astype(int)
reg_data["age_squared"] = reg_data["age"] ** 2

model_fe = smf.ols(
    "annual_earnings ~ female + age + age_squared + C(ind02)", data=reg_data
).fit()

print(model_fe.summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.184
Model:                            OLS   Adj. R-squared:                  0.182
Method:                 Least Squares   F-statistic:                     105.9
Date:                Mon, 23 Feb 2026   Prob (F-statistic):               0.00
Time:                        17:38:08   Log-Likelihood:            -1.4355e+06
No. Observations:              122603   AIC:                         2.871e+06
Df Residuals:                  122341   BIC:                         2.874e+06
Df Model:                         261                                         
Covariance Type:            nonrobust                                         
                                                                                                                               coef    std err          t      P>|t|      [0.025      0.975]
---------------------

## Exercise 13

Now that we've added industry fixed effects, what groups are we implicitly treated as counter-factuals for one another now? 

After adding industry fixed effects, we are now comparing full-time working women to full-time working men within the same industry (and of the same age). Instead of comparing across the entire labor market, we are treating men and women working in the same sector as counterfactuals for one another, assuming that within an industry, the only systematic difference between them is gender.

## Exercise 14

What happened to your estimate of the gender wage gap when you added industry fixed effects? What does that tell you about the industries chosen by women as opposed to men?

When we added industry fixed effects, the estimated gender wage gap decreased. This suggests that part of the original wage gap was driven by men and women working in different industries. In other words, women are more likely to work in industries that pay less on average, and once we compare men and women within the same industry, the gap becomes smaller.

When you're done, please come read [this discussion](discussion_regressions_incomeineq.ipynb).