<span style="color: #9370DB">*Daniela Jiménez*</span>

<span style="color: #9370DB">*Bárbara Flores*</span>

# Estimating Gender Discrimination in the Workplace

In this exercise we'll use data from the 2018 US Current Population Survey (CPS) to try and estimate the effect of being a woman on workplace compensation. 

Note that our focus will be *only* on differential compensation in the work place, and as a result it is important to bear in mind that our estimates are not estimates of *all* forms of gender discrimination. For example, these analyses will not account for things like gender discrimination in terms of *getting* jobs. We'll discuss this in more detail below.

## Exercise 1: 

Begin by downloading and importing 2018 CPS data from [https://github.com/nickeubank/MIDS_Data/tree/master/Current_Population_Survey](https://github.com/nickeubank/MIDS_Data/tree/master/Current_Population_Survey). The file is called `morg18.dta` and is a Stata dataset. Additional data on the dataset can be found by following the links in the README.txt file in the folder, but for the moment it is sufficient to know this is a national survey run in the United States.

The survey does include some survey weights we won't be using (i.e. not everyone in the sample was included with the same probability), so the numbers we estimate will not be perfect estimates of the gender wage gap in the United States, but they are pretty close.

In [38]:
import pandas as pd
import statsmodels.formula.api as smf

path = "https://github.com/nickeubank/MIDS_Data/raw/master/Current_Population_Survey/morg18.dta"
morg18 = pd.read_stata(path)
morg18.head()

Unnamed: 0,hhid,intmonth,hurespli,hrhtype,minsamp,hrlonglk,hrsample,hrhhid2,serial,hhnum,...,ym_file,ym,ch02,ch35,ch613,ch1417,ch05,ihigrdc,docc00,dind02
0,4795110719,January,1.0,Husband/wife primary fam (neither in Armed For...,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,14.0,,
1,4795110719,January,1.0,Husband/wife primary fam (neither in Armed For...,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,13.0,,
2,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,1,0,12.0,Office and administrative support occupations,"Health care services , except hospitals"
3,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,0,0,12.0,Office and administrative support occupations,Administrative and support services
4,110359424339,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,711,7111,1,1,...,696,693,0,0,0,0,0,,Healthcare practitioner and technical occupations,Hospitals


## Exercise 2:

Because our interest is only in-the-workplace wage discrimination among full-time workers, we need to start by subsetting our data for people currently employed (and "at work", not "absent") at the time of this survey using the `lfsr94` variable, who are employed full time (meaning that their usual hours per week—`uhourse`—is 35 or above).

As noted above, this analysis will miss many forms of gender discrimination. For example, in dropping anyone who isn't working, we immediately lose any women who couldn't get jobs, or who chose to lose the workforce because the wages they were offered (which were likely lower than those offered men) were lower than they were willing / could accept. And in focusing on full time employees, we miss the fact women may not be offered full time jobs at the same rate as men. 

In [39]:
morg18 = morg18[morg18["lfsr94"] == "Employed-At Work"]
morg18 = morg18[morg18["uhourse"] >= 35]

## Exercise 3

Now let's estimate the basic wage gap for the United States!

Earnings per week worked can be found in the `earnwke` variable. Using the variable `sex` (1=Male, 2=Female), estimate the gender wage gap in terms of wages per hour worked!

(You may also find it helpful, for context, to estimate the average hourly pay by dividing weekly pay by `uhourse`.)

In [40]:
morg18.shape

(133814, 98)

In [41]:
average_salary_male_weekly = morg18[morg18["sex"] == 1]["earnwke"].mean()
average_salary_female_weekly = morg18[morg18["sex"] == 2]["earnwke"].mean()

weekly_gender_wage_gap = average_salary_male_weekly - average_salary_female_weekly

print(f"\nThe average weekly salary for men is ${average_salary_male_weekly:,.2f}")
print(f"The average weekly salary for women is ${average_salary_female_weekly:.2f}")
print(
    f"\nThe gender wage gap in the United States is ${weekly_gender_wage_gap:.2f} per week"
)


The average weekly salary for men is $1,204.73
The average weekly salary for women is $985.68

The gender wage gap in the United States is $219.05 per week


In [42]:
salary_per_hour_male = (
    morg18[morg18["sex"] == 1]["earnwke"] / morg18[morg18["sex"] == 1]["uhourse"]
).mean()
salary_per_hour_female = (
    morg18[morg18["sex"] == 2]["earnwke"] / morg18[morg18["sex"] == 2]["uhourse"]
).mean()

hourly_gender_wage_gap = salary_per_hour_male - salary_per_hour_female

print(f"The average hourly salary for men is ${salary_per_hour_male:,.2f}")
print(f"The average hourly salary for women is ${salary_per_hour_female:,.2f}")
print(
    f"\nThe gender wage gap in the United States is ${hourly_gender_wage_gap:,.2f} per hour"
)

The average hourly salary for men is $27.88
The average hourly salary for women is $23.80

The gender wage gap in the United States is $4.08 per hour


## Exercise 4

Assuming 48 work weeks in a year, calculate annual earnings for men and women. Report the difference in dollars and in percentage terms.

In [43]:
morg18["annual_earnings"] = morg18["earnwke"] * 48

annual_earnings_male = morg18[morg18["sex"] == 1]["annual_earnings"].mean()

annual_earnings_female = morg18[morg18["sex"] == 2]["annual_earnings"].mean()

earnings_difference = annual_earnings_male - annual_earnings_female

percentage_difference = (earnings_difference / annual_earnings_female) * 100

print(f"Annual earnings for men: ${annual_earnings_male:,.0f}")
print(f"Annual earnings for women: ${annual_earnings_female:,.0f}")
print(
    f"\nThe difference in earnings is ${earnings_difference:,.0f} which represents {percentage_difference:.2f}% more earnings for men compared to women."
)

Annual earnings for men: $57,827
Annual earnings for women: $47,313

The difference in earnings is $10,514 which represents 22.22% more earnings for men compared to women.


## Exercise 5

We just compared all full-time working men to all full-time working women. For this to be an accurate *causal* estimate of the effect of being a woman in the work place, what must be true of these two groups? What is one reason that this may *not* be true?

>For this comparison to be accurate, both groups must be similar in other income-affecting characteristics such as age, race, educational level, etc. If these conditions were similar for both groups or had similar distributions, comparing the means of the two groups could provide a precise estimate of the effect of being a woman in the workplace.
>
>One reason why this assumption might not hold true is if women who have full-time jobs tend to have a higher average level of education than men (since they are more likely to have full-time jobs). In such a case, it would be necessary to control for educational level in our comparison to obtain a precise estimate of the effect of being a woman in the workplace.

## Exercise 6

One answer to the second part of Exercise 5 is that working women are likely to be younger, since a larger portion of younger women are entering the workforce as compared to older generations.

To *control* for this difference, let's now regress annual earnings on gender, age, and age-squared (the relationship between age and income is generally non-linear). What is the implied average annual wage difference between women and men? Is it different from your raw estimate? 

> The following regression model shows the same results as the previous difference in means calculations

In [44]:
morg18["female"] = (morg18["sex"] == 2).astype(int)
model = smf.ols("annual_earnings ~ female ", morg18).fit()
print("{:,.0f}".format(round(model.params["female"])))

-10,514


In [45]:
print(model.get_robustcov_results("HC3").summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.026
Model:                            OLS   Adj. R-squared:                  0.026
Method:                 Least Squares   F-statistic:                     3350.
Date:                Tue, 09 Apr 2024   Prob (F-statistic):               0.00
Time:                        12:12:25   Log-Likelihood:            -1.4464e+06
No. Observations:              122603   AIC:                         2.893e+06
Df Residuals:                  122601   BIC:                         2.893e+06
Df Model:                           1                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   5.783e+04    133.001    434.787      0.0

> Let's now control for age

In [46]:
morg18["age"] = morg18["age"].astype("int64")
morg18["age_2"] = morg18["age"] ** 2
model2 = smf.ols("annual_earnings ~ female + age + age_2", morg18).fit()
print(model2.get_robustcov_results("HC3").summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.083
Model:                            OLS   Adj. R-squared:                  0.083
Method:                 Least Squares   F-statistic:                     4820.
Date:                Tue, 09 Apr 2024   Prob (F-statistic):               0.00
Time:                        12:12:25   Log-Likelihood:            -1.4426e+06
No. Observations:              122603   AIC:                         2.885e+06
Df Residuals:                  122599   BIC:                         2.885e+06
Df Model:                           3                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -7102.4067    787.776     -9.016      0.0

In [47]:
earnings_difference_controled_by_age = model2.params["female"]

print(
    f"\nThe difference in earnings between male and woman, controled by age is -${-earnings_difference_controled_by_age:,.0f}"
)


The difference in earnings between male and woman, controled by age is -$10,735


> This difference is larger than the raw estimate calculated previously and it implies that controlling by age, the gender gap in earnings is even larger. 

## Exercise 7

In running this regression and interpreting the coefficient on `female`, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on `female`, we're basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

> When using the last regression and interpreting the coefficient of 'female,' we are comparing the group of women to men, assuming both groups have a similar age distribution.

## Exercise 8

Now let's add to our regression an indicator variable for whether the respondent has at least graduated high school, and an indicator for whether the respondent at least has a BA. 

In answering this question, use the following table of codes for the variable `grade92`. 

Education is coded as follows:
    
![CPS Educ Codes](../images/cps_educ_codes.png)

In [48]:
morg18["high_school"] = morg18["grade92"] >= 39
morg18["BA"] = morg18["grade92"] >= 43


model3 = smf.ols(
    "annual_earnings ~ female + age + age_2 + high_school + BA ", morg18
).fit()
print(model3.get_robustcov_results("HC3").summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.273
Model:                            OLS   Adj. R-squared:                  0.273
Method:                 Least Squares   F-statistic:                     9051.
Date:                Tue, 09 Apr 2024   Prob (F-statistic):               0.00
Time:                        12:12:25   Log-Likelihood:            -1.4284e+06
No. Observations:              122603   AIC:                         2.857e+06
Df Residuals:                  122597   BIC:                         2.857e+06
Df Model:                           5                                         
Covariance Type:                  HC3                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            -1.92e+04    

In [49]:
earnings_difference_controled_by_age_education = model3.params["female"]

print(
    f"\nThe difference in earnings between male and woman, controled by age, completion of high school and having at\n least a BA is -${-earnings_difference_controled_by_age_education:,.0f}"
)


The difference in earnings between male and woman, controled by age, completion of high school and having at
 least a BA is -$13,039


> This difference is larger than the raw estimate calculated previously and it implies that controlling by age, high school education and BA education, the gender gap in earnings is even larger. 

## Exercise 9 

In running this regression and interpreting the coefficient on `female`, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on `female`, we are once more basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

> When using the last regression and interpreting the coefficient of 'female,' we are comparing the group of women to men, assuming both groups have a similar age distribution and similar education levels.

## Exercise 10

Given how the coefficient on `female` has changed between Exercise 6 and Exercise 8, what can you infer about the educational attainment of the women in your survey data (as compared to the educational attainment of men)?

> Considering how the coefficient on female has decreased from -$10,735 to -$13,039 when controlling the regression for educational level, we can infer that the educational attainment of women in our survey data is likely higher than that of men.
>
>This inference is supported by the fact that after controlling for educational attainment, the gender wage gap expands. It can be inferred that women in the survey data tend to have higher levels of educational attainment compared to men, but despite this, they still face wage discrimination in the workplace.


In [50]:
# we can also check this hypothesis:
morg18.groupby("female")["high_school"].mean()

female
0    0.926028
1    0.955296
Name: high_school, dtype: float64

In [51]:
morg18.groupby("female")["BA"].mean()

female
0    0.374868
1    0.444782
Name: BA, dtype: float64

>We can see that for both the High School and BA variables, the average for the female group is higher, which confirms our hypothesis.

## Exercise 11

What does that tell you about the *potential outcomes* of men and women before you added education as a control?

> The observed difference in educational levels between men and women before adding education as a control suggests that there was a baseline difference between men and women. This means that, in terms of education, both groups have different educational levels; in this case, women have a higher proportion of individuals who have completed high school and at least a bachelor's degree in arts. Therefore, the comparison mentioned earlier was not correct, as it did not take into account the differences between both groups in terms of educational level (baseline difference).

## Exercise 12

Finally, let's include *fixed effects* for the type of job held by each respondent. 

Fixed effects are a method used when we have a nested data structure in which respondents belong to groups, and those groups may all be subject to different pressures. In this context, for example, we can add fixed effects for the industry of each respondent—since wages often vary across industries, controlling for industry is likely to improve our estimates. Use `ind02` to control for industry.

(Note that fixed effects are very similar in principle to hierarchical models. There are some differences [you will read about](../fixed_effects_v_hierarchical.ipynb) for our next class, but they are designed to serve the same role, just with slightly different mechanics). 

When we add fixed effects for groups like this, our interpretation of the other coefficients changes. Whereas in previous exercises we were trying to explain variation in men and women's wages *across all respondents*, we are now effectively comparing men and women's wages *within each employment sector*. Our coefficient on `female`, in other words, now tells us how much less (on average) we would expect a woman to be paid than a man *within the same industry*, not across all respondents. 

(Note that running this regression will result in lots of coefficients popping up you don't care about. We'll introduce some more efficient methods for adding fixed effects that aren't so messy in a later class -- for now, you can ignore those coefficients!)

In [52]:
import warnings
warnings.simplefilter('ignore')
model4 = smf.ols(
    "annual_earnings ~ female + age + age_2 + high_school + BA  + ind02", morg18
).fit()
print(model4.get_robustcov_results("HC3").summary())

                            OLS Regression Results                            
Dep. Variable:        annual_earnings   R-squared:                       0.320
Model:                            OLS   Adj. R-squared:                  0.319
Method:                 Least Squares   F-statistic:                     219.7
Date:                Tue, 09 Apr 2024   Prob (F-statistic):               0.00
Time:                        12:12:41   Log-Likelihood:            -1.4243e+06
No. Observations:              122603   AIC:                         2.849e+06
Df Residuals:                  122339   BIC:                         2.852e+06
Df Model:                         263                                         
Covariance Type:                  HC3                                         
                                                                                                                            coef    std err          t      P>|t|      [0.025      0.975]
------------------------

In [53]:
earnings_difference_controled_by_age_education = model4.params["female"]

print(
    f"\nThe difference in earnings between male and woman, controled by age, completion of high school, having at least\n a BA and industry job is -${-earnings_difference_controled_by_age_education:,.0f}"
)


The difference in earnings between male and woman, controled by age, completion of high school, having at least
 a BA and industry job is -$10,980


## Exercise 13

Now that we've added industry fixed effects, what groups are we implicitly treated as counter-factuals for one another now? 

> Regarding the interpretation of the 'female' coefficient, we are contrasting women with men, presuming that both cohorts share comparable age distributions, educational backgrounds, and occupations within the same industry.

## Exercise 14

What happened to your estimate of the gender wage gap when you added industry fixed effects? What does that tell you about the industries chosen by women as opposed to men?

> When we added industry as a control variable to our model, we saw a decrease in the gender wage gap from -$13,039 to -$10,980. This suggests that controlling for industry results in a reduction in the wage gap between men and women. Therefore, industries where women are represented tend to have lower wages compared to those where men predominate. In other words, this change in the wage gap estimation implies that there are systematic differences in wages between industries preferred by men and women.

When you're done, please come read [this discussion](discussion_regressions_incomeineq.ipynb).