# Econ 695 Project 
Due: 12/17/2025. Submit this notebook or other scripts you use to produce the tables and figures along with your report. 

Read "ECON695_project.pdf" for details on each expected output. 

In [1]:
import pandas as pd 
import os  # for setting directory 
# os.chdir() # input your personal directory where the dataset is saved
import statsmodels.formula.api as smf # for OLS regressions
import numpy as np  # to work with arrays (vectors/matrices)
import matplotlib.pyplot as plt # for plots

In [3]:
data = pd.read_csv("projectdata.csv")
print(data.shape)
data.head(2)

(16969, 12)


Unnamed: 0,y,age,educ,female,exp,yl1,yl2,yl3,yp1,yp2,owage2,owage1
0,2.570021,48,12,0,30,2.620926,2.616735,2.608906,2.784972,2.791555,1.191211,1.122809
1,2.298948,46,12,1,28,2.296236,2.284491,2.275459,2.309241,2.32077,2.397997,2.48569


In [4]:
print(data['female'].value_counts(dropna=False))
print(data['educ'].value_counts(dropna=False))
data.describe()

female
0    10575
1     6394
Name: count, dtype: int64
educ
12    5019
6     4651
9     3859
16    3440
Name: count, dtype: int64


Unnamed: 0,y,age,educ,female,exp,yl1,yl2,yl3,yp1,yp2,owage2,owage1
count,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0,16969.0
mean,1.787921,33.558902,10.484118,0.376805,17.074783,1.736934,1.717822,1.696991,1.806854,1.840162,1.692074,1.64842
std,0.645257,5.693736,3.586129,0.4846,6.446165,0.610285,0.600409,0.590419,0.646991,0.650853,0.466264,0.472052
min,0.601944,22.0,6.0,0.0,5.0,0.603566,0.592804,0.590613,0.637028,0.696327,0.716435,0.718321
25%,1.297586,29.0,6.0,0.0,12.0,1.276873,1.264874,1.249482,1.315681,1.346496,1.352856,1.299025
50%,1.617799,33.0,9.0,0.0,17.0,1.576101,1.566374,1.551751,1.635421,1.660561,1.56147,1.550452
75%,2.194695,37.0,12.0,1.0,22.0,2.120593,2.095272,2.058599,2.216467,2.252076,1.957534,1.889397
max,4.343445,52.0,16.0,1.0,30.0,4.237082,4.30092,4.315834,4.322492,4.399618,3.804267,3.808492


# Section 1. Overview: Female vs. Male Workers
## Summary Statistics (Table 1)
A suggested format for **Table 1** is to have 4 columns:
- column 1 = characteristics for all workers
- column 2 = characteristics for female workers
- column 3 = characteristics for male workers
- column 4 = test statistic comparing females and males (eg t-test)

See the project.pdf for the list of characteristics. 
Hint: to convert a dataframe to LaTex table, use `table.to_latex(float_format="%.3f")`. You could also copy-paste the estimates into excel and format it manually. 

## Distribution of Wages (Figure 1)
- Figure 1: histogram of log hourly wages for men and women in one figure. 
- Bonus (Figure 1b): Develop some interesting ways to illustrate the fact that women's wages are lower than men's.

# Section 2. Gender Wage Gaps
## 2.1 Wage Regressions and Oaxaca Decomposition (Table 2)
In Table 2, you will fit a series of standard wage models. The first set of columns are estimated using the pooled data for women and men. The second set are estimated separately by gender, based on which you can conduct an Oaxaca decomposition. 

In [None]:
# sample code for generating a LaTex Table for regressions 
'''
from statsmodels.iolib.summary2 import summary_col
summary = summary_col(
    results=[model0, model1],
    float_format='%0.3f', stars=False,
    model_names=['(1)', '(2)'],
    info_dict={
        'N': lambda x: f"{int(x.nobs)}", 'Adj R2': lambda x: f"{x.rsquared_adj:.3f}"
    }
)
summary.as_latex()
'''

## 2.2 Gender Difference in Experience Profiles
- Figure 2: Plot the relationship between wages and experience for men and women who have one level of education (e.g., education=12), and show the fit of your regression models. 
- Bonus (Figure 2 continued): Consider plotting “Figure 2” for each of the 4 education groups. Within each education group, how much of the gender gap can be explained by differences in experience profile?

# Section 3. Gender Wage Gaps Conditional on Coworker Wages

**Table 3** will report 5 models. 

The first set of models are estimated in the pooled data for men and women: 
- including only a constant, a female dummy, and owage2
- including a constant, education, a cubic in experience, a female dummy, and owage2
- including a constant, education, a cubic in experience, a female dummy, owage2, and the interaction of owage2 with the female dummy.

Then, fit separate models for men and women that include a constant, education, and a cubic in experience and owage2. Use these models to perform a new decomposition that accounts for the effect of higher-wage coworkers.

# Section 4. Event Study - Wage Changes around Moves

- **Figure 3**: Conduct 9 separate event studies, plotting mean wages in period -3, -2, -1, 0, 1, 2 for people who start in each tercile of `owage1` and go to each tercile of `owage2`.  

- **Table 4**: Model the change in wages from -1 to 0 (y - yl1) as a function of the change in the mean log wage of co-workers (owage2 - owage1).

- **Bonus (Table 4b)**: Apply shrinkage methods to first-differenced models that control for interactions between (owage2 - owage1) and experience dummies. 


## Figure 3
Hint: get the terciles of owage1, owage2 via `qcut`. 
Reshape the data to a panel data at (person, l) level where $l=-3,-2,-1,0,1,2$ -- years relative to the move to the 2nd job. There are multiple ways to reshape the data from wide to long. For example, you can set the index of a dataframe first and then `stack`.   

## Table 4:

## Bonus: Shrinkage


In [None]:
# For Ridge/Lasso: 
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV

# Suppose you have defined a dataframe/matrix X with regressors, and y = change in log wage. Try the following: 
'''
fd_Ridge =RidgeCV(fit_intercept=True,cv=5).fit(X, y) 
# extract coefficients: 
[fd_Ridge.intercept_.item()] + fd_Ridge.coef_.tolist()
# For Lasso: 
fd_Lasso =LassoCV(cv=5).fit(X, y)
[fd_Lasso.intercept_.item()] + fd_Lasso.coef_.tolist()
'''