## Exercise Estimating Gender Discrimination in the Workplace

### By Xiaoquan Liu & Emma Wang 

> Exercise 1

In [1]:
import pandas as pd
df = pd.read_stata('https://github.com/nickeubank/MIDS_Data/blob/master/'+
                   'Current_Population_Survey/morg18.dta?raw=true')

In [2]:
df["lfsr94"].value_counts()

Employed-At Work               172378
Retired-Not In Labor Force      61161
Other-Not In Labor Force        37263
Disabled-Not In Labor Force     17052
Employed-Absent                  6443
Unemployed-Looking               5710
Unemployed-On Layoff             1046
Name: lfsr94, dtype: int64

In [3]:
df.head()

Unnamed: 0,hhid,intmonth,hurespli,hrhtype,minsamp,hrlonglk,hrsample,hrhhid2,serial,hhnum,...,ym_file,ym,ch02,ch35,ch613,ch1417,ch05,ihigrdc,docc00,dind02
0,4795110719,January,1.0,Husband/wife primary fam (neither in Armed For...,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,14.0,,
1,4795110719,January,1.0,Husband/wife primary fam (neither in Armed For...,MIS 8,MIS 2-4 Or MIS 6-8 (link To,601,6011,1,1,...,696,681,0,0,0,0,0,13.0,,
2,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,1,0,12.0,Office and administrative support occupations,"Health care services , except hospitals"
3,110339935453,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,701,7011,1,1,...,696,693,0,0,0,0,0,12.0,Office and administrative support occupations,Administrative and support services
4,110359424339,January,1.0,Unmarried civilian female primary fam householder,MIS 4,MIS 2-4 Or MIS 6-8 (link To,711,7111,1,1,...,696,693,0,0,0,0,0,,Healthcare practitioner and technical occupations,Hospitals


> Execise 2

In [4]:
df = df[(df["uhourse"] >= 35) & (df["lfsr94"]== "Employed-At Work" )]

> Exercise 3

In [5]:
#  estimate the basic wage gap between gender using earnings per week

df["hour_wage"] = df["earnwke"]/df["uhourse"]


In [6]:
female_hour_wage = df[df["sex"] == 2].hour_wage.mean()
male_hour_wage = df[df["sex"] == 1].hour_wage.mean()
print("The difference in hourly earnings between female and male is", 
      f"{female_hour_wage - male_hour_wage}")
print("In percentage term, women earn", 
      f"{100* (female_hour_wage - male_hour_wage)/male_hour_wage} percent than men per hour.")

The difference in hourly earnings between female and male is -4.080172337386848
In percentage term, women earn -14.633016712106063 percent than men per hour.


* The gender wage gap in terms of wages per hour worked is $4.08; we can see a 14% less average hourly wage in women as compared to men. 

> Exercise 4

In [7]:
# Assuming 48 work weeks in a year, calculate annual earnings for men and women. Report the difference 

df["annual_earnings"] = df["earnwke"] * 48
female_annual_earning = df[df["sex"] == 2].annual_earnings.mean()
male_annual_earning = df[df["sex"] == 1].annual_earnings.mean()
print("The difference in annual earnings between female and male is",
     f"{female_annual_earning - male_annual_earning}.")
print("In percentage term, women earn", 
      f"{100*(female_annual_earning - male_annual_earning) / male_annual_earning} percent per year than men.")

The difference in annual earnings between female and male is -10514.435694871769.
In percentage term, women earn -18.182501422795216 percent per year than men.


* The difference in annual earnings between female and male is around 10514 dollars; On average, women earn 18.2% less than men per year.

> Exercise 5

what must be true of these two groups? 
What is one reason that this may not be true? 

For the comparison between full-time working men and full-time working women to be an accurate causal estimate of the effect of being a woman in the workplace, the two groups must be identical in all characteristics that affect earnings, except for gender.

However, it is unlikely as there are many factors that can affect earnings, such as education, experience, occupation, industry, and location, among others. Therefore, the comparison may be biased if women and men differ in these or other unobserved characteristics that affect earnings.

> Exercise 6

To control for this difference, let’s now regress annual earnings on gender, age, and age-squared (the relationship between age and income is generally non-linear). What is the implied average annual wage difference between women and men? Is it different from your raw estimate?

In [8]:
import statsmodels.formula.api as smf
df["age_squared"] = df["age"] * df["age"]
model = smf.ols("annual_earnings ~ C(sex) + age + age_squared", data=df)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,annual_earnings,R-squared:,0.06
Model:,OLS,Adj. R-squared:,0.06
Method:,Least Squares,F-statistic:,2625.0
Date:,"Wed, 22 Feb 2023",Prob (F-statistic):,0.0
Time:,15:24:33,Log-Likelihood:,-1444100.0
No. Observations:,122603,AIC:,2888000.0
Df Residuals:,122599,BIC:,2888000.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.833e+04,315.241,121.587,0.000,3.77e+04,3.89e+04
C(sex)[T.2],-1.062e+04,181.123,-58.641,0.000,-1.1e+04,-1.03e+04
age,456.6349,6.839,66.766,0.000,443.230,470.040
age_squared,-0.2740,1.241,-0.221,0.825,-2.706,2.158

0,1,2,3
Omnibus:,18297.323,Durbin-Watson:,1.733
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27620.546
Skew:,1.11,Prob(JB):,0.0
Kurtosis:,3.692,Cond. No.,269.0


The implied average annual wage difference between men and women, controlling for age and age-squared, is approximately $10620. This is similar to our raw estimate in Exercise 4, which was approximately $10514.

* By regressing annual earnings on gender, age, and age-squared, we found the coefficient of 'age' is **456.63** and the **p-value of 'age' < 0.5**, meaning with one year older, people are tend to earn 456.63 more dollars on average. This difference is statistically significant. 

* That implies the distribution of age in these two groups is also related to the average annual wage difference between women and men, meaning only comparing annual earning of all full-time working men to all full-time working women can not be an accurate causal estimate of the effect of being a woman in the work place.

> Exercise 7

In running this regression and interpreting the coefficient on female, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on female, we’re basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

Answer: The implicit comparison we are making when we interpret the coefficient on female in the regression with annual wage is between men and women with the same age, age-squared. In other words, we are comparing what the earnings of women would be if they had the same age, holding everything else constant.

> Exercise 8 

Add two indicator variables: whether the respondent has at least graduated high school, and whether the respondent at least has a BA.

In [9]:
import numpy as np
df["high_school"]= np.where(df["grade92"] >= 39, 1, 0)
df["ba_degree"] = np.where(df["grade92"] >= 43, 1, 0)

In [10]:
model = smf.ols("annual_earnings ~ C(sex) + age + age_squared + C(high_school) + C(ba_degree)", data=df)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,annual_earnings,R-squared:,0.259
Model:,OLS,Adj. R-squared:,0.259
Method:,Least Squares,F-statistic:,8573.0
Date:,"Wed, 22 Feb 2023",Prob (F-statistic):,0.0
Time:,15:24:34,Log-Likelihood:,-1429600.0
No. Observations:,122603,AIC:,2859000.0
Df Residuals:,122597,BIC:,2859000.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.579e+04,420.146,37.577,0.000,1.5e+04,1.66e+04
C(sex)[T.2],-1.3e+04,161.440,-80.496,0.000,-1.33e+04,-1.27e+04
C(high_school)[T.1],1.373e+04,344.678,39.840,0.000,1.31e+04,1.44e+04
C(ba_degree)[T.1],2.756e+04,167.116,164.937,0.000,2.72e+04,2.79e+04
age,445.9737,6.074,73.426,0.000,434.069,457.878
age_squared,-0.3958,1.102,-0.359,0.719,-2.555,1.763

0,1,2,3
Omnibus:,14257.428,Durbin-Watson:,1.849
Prob(Omnibus):,0.0,Jarque-Bera (JB):,20565.673
Skew:,0.89,Prob(JB):,0.0
Kurtosis:,3.926,Cond. No.,481.0


* The result shows that controlling for the education level, age and age-squared, the annual earnings for female is $13,000 less than male.
 

> Exercise 9

In running this regression and interpreting the coefficient on female, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on female, we are once more basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?


The implicit comparison we are making when we interpret the coefficient on female in the regression with annual earnings is between men and women with the same age, age-squared, high school graduation status, and college graduation status. In other words, we are comparing what the earnings of women would be if they had the same age, age-squared, high school graduation status, and college graduation status as men, holding everything else constant.


> Exercise 10

Given how the coefficient on female has changed between Exercise 6 and Exercise 8, what can you infer about the educational attainment of the women in your survey data (as compared to the educational attainment of men)?

Comparing the coefficient on female between Exercise 6 and Exercise 8, where the coefficients are -10,620 and -13,000 respectively. We can infer that the educational attainment of women in the survey data is higher than that of men. This is because in Exercise 8, the coefficient on female is larger in magnitude where education is controlled than it was in Exercise 6, which suggests that controlling for education has increased the unexplained wage gap between men and women, and meaning that female education attainment is higher than men so they have smaller wage gap when education is not controlled.

> Exercise 11

What does that tell you about the potential outcomes of men and women before you added education as a control?

The fact that adding education as a control variable changed the coefficient on female suggests that the educational attainment of men and women in the survey data differed significantly. This means that before adding education as a control, the potential outcomes of men and women in terms of their earnings could have been confounded by differences in their educational attainment. That is, the potential outcomes of men and women before controlling education is not the accurate causal estimate of gender discrimination as education can caus omitted variable bias(OVB). The change in the coefficient on female after controlling for education indicates that the effect of gender on earnings was partially driven by differences in education levels between men and women. By controlling education, we can also infer that factors other than education level, such as discrimination or differences in work experience, may be contributing to the gender earnings gap.

> Exercise 12



In [11]:
#include fixed effects for the type of job held by each respondent.
# add all variables to the model
model = smf.ols("annual_earnings ~ C(sex) + age +" 
                + "age*age + C(ind02) + C(high_school) + C(ba_degree)", data=df)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,annual_earnings,R-squared:,0.309
Model:,OLS,Adj. R-squared:,0.308
Method:,Least Squares,F-statistic:,208.8
Date:,"Wed, 22 Feb 2023",Prob (F-statistic):,0.0
Time:,15:24:37,Log-Likelihood:,-1425300.0
No. Observations:,122603,AIC:,2851000.0
Df Residuals:,122340,BIC:,2854000.0
Df Model:,262,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.099e+04,1138.057,9.661,0.000,8763.886,1.32e+04
C(sex)[T.2],-1.086e+04,175.700,-61.816,0.000,-1.12e+04,-1.05e+04
C(ind02)[T.Animal production (112)],-858.3717,1701.889,-0.504,0.614,-4194.045,2477.302
"C(ind02)[T.Forestry except logging (1131, 1132)]",924.0703,3722.778,0.248,0.804,-6372.512,8220.653
C(ind02)[T.Logging (1133)],5655.4285,2999.255,1.886,0.059,-223.062,1.15e+04
"C(ind02)[T.Fishing, hunting, and trapping (114)]",3510.5395,4528.900,0.775,0.438,-5366.030,1.24e+04
C(ind02)[T.Support activities for agriculture and forestry (115)],6206.0435,2760.441,2.248,0.025,795.625,1.16e+04
C(ind02)[T.Oil and gas extraction (211)],3.392e+04,3072.493,11.039,0.000,2.79e+04,3.99e+04
C(ind02)[T.Coal mining (2121)],2.528e+04,2550.831,9.911,0.000,2.03e+04,3.03e+04

0,1,2,3
Omnibus:,14403.972,Durbin-Watson:,1.863
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21479.46
Skew:,0.874,Prob(JB):,0.0
Kurtosis:,4.07,Cond. No.,1.05e+16


>Exercise 13

Adding fixed effects for industry means that we are controlling for the unobserved differences between different industries that may affect the wages of men and women differently. By doing this, we can effectively focus on the wage gap within each industry rather than the overall wage gap across all workers. This can generate a more precise estimate of the gender wage gap within each industry.

> Exercise 14

The estimate of gender wage gap becomes -10,860 when the industry fixed effects are added. The choice of industries by men and women can have a significant impact on the gender wage gap. Given the decreasing gap of wage between men and women, we can infer that women are more likely to work in low-paying industries. In general, adding fixed effects for industry or job type can provide a more nuanced understanding of the gender wage gap by controlling for the impact of different industries on men and women's wages.