Note: Write your code in the code cells, and your responses in markdown. 
Run the entire script and display the outputs of your code. 

Due: **11:59PM Central Time on Friday, 09/26**. Upload both your code (.ipynb) and responses (html or pdf) to Canvas by then. 

<!-- Helpful tips on Markdown: - https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html# -->

In [None]:
# Packages you might need: (pip install ... if you don't have them)
import pandas as pd # for data manipulation
import os  # for setting directory 
print(os.getcwd())
# os.chdir() # input your personal directory where the dataset is saved
import statsmodels.formula.api as smf # for OLS regressions
import numpy as np  # to work with arrays (vectors/matrices)

# Oaxaca Decomposition
The goal of this problem set is to extend the analysis of public-private pay gaps in Lecture 3 to male workers. The file pay.csv contains 45,404 observations on male workers, age 30-50, with education of 12 or more years, in the March 2012 and 2013 CPS files. The variables are:
- govt=1 if government worker
- logwage=log hourly wage based on earnings, weeks worked, and average hours per week last year
- educ = years of education
- hs = 1 if high school grad (12 years of schooling)
- somecoll=1 if education is 13-14 years.
- college = 1 if BA (16 years of education)
- ma = 1 if masters degree (18 years of schooling)
- phd = 1 if phd or LLD or medical degree (20 years of schooling)
- age= age in years (range 30-50)

NOTE: note that “somecoll” combines 2 levels of education, 13 and 14 years (13 years are people with some college but no degree, 14 years are people with an AA degree from a community college).

Please create 2 new variables: educ13=1[educ=13] and educ14=1[educ=14] so you will have a total of 6 possible categories for education.

In [None]:
# Load dataset (using pandas "pd")
pay = pd.read_csv("pay.csv") # add your own directory if necessary

# create 2 new variables: educ13, educ14 
pay['educ13'] = 0+(pay['educ']==13)
pay['educ14'] = 0+(pay['educ']==14)
print(pay.head(2)) 
# 6 educ categories: phd, ma, college, educ14, educ13, hs
# define a variable for 6 education levels: 
pay['educ_level'] = 1*(pay['hs']==1) + 2*(pay['educ13']==1) + 3*(pay['educ14']==1) + 4 * (pay['college']==1) +5* (pay['ma']==1) + 6 * (pay['phd']==1) 
print(pay['educ_level'].value_counts(dropna=False))

## 1. Means of All Variables in the private and public sectors
Construct a table like the 1st table (page 7) in Lecture 3. Show the means of all the variables for workers in the private (govt=0) and public (govt=1) sectors, and the differences in these means.

Hint: use pandas groupby https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html, and reshape dataframes to look similar to the tables in lecture 3: (long to wide) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html 

In [None]:
# Write your code here 

## 2. Education and Wage Gaps
Construct a table like the 2nd table (page 8) in Lecture 3. Show the fractions of private and government workers at each education level, and the mean wages of the two groups in each education level, and the public-private wage gap at each level.

## 3. Distribution across Education Categories, by Group 
Using the notation of Lecture 3, let private sector workers be called “group $a$” (who are the reference group), let government workers be called “group $b$”, and define the 6 category variables $D_{gi}$ for each level of education g=1...6. Let $x_{i}'=(D_{1i},D_{2i},...D_{6i})$ . Find $\overline{x}^{a}$ and $\overline{x}^{b}$.

### (a) OLS: Log wage on education categories, by group
Fit an OLS regression model for wages on $x_i$ for each group, and estimate the coefficients $\hat{\beta}^{a}$ and $\hat{\beta}^{b}$. Verify that these coefficients are the mean log wages of groups a and b, respectively, for each level of education.

*Hint*: see reg_dummy.py under Files/Lectures/lecture2 for example code. I used: smf.ols("y ~ 0 + C(x)", data=temp).fit(cov_type='HC1'), where 0 suppresses constant, $x$ is a categorical variable (so the regression coefficients would be means of each category). 

### (b) Mean log wage as a weighted average 
Using your estimates of $\overline{x}^{a}$, $\overline{x}^{b}$, $\hat{\beta}^{a}$ and $\hat{\beta}^{b}$, construct $\overline{y}^{a}=\sum_{g}\overline{p}_{g}^{a}\, \overline{y}_{g}^{a}=(\overline{x}^{a})'\hat{\beta}^{a}$ and $\overline{y}^{b}=\sum_{g}\overline{p}_{g}^{b}\overline{y}_{g}^{b}=(\overline{x}^{b})'\hat{\beta}^{b}$. Verify that these match the means of wages for groups $a$ and $b$ that you construct directly.

### (c) Counterfactual 1
Using your estimates of $\overline{x}^{a}$, and $\hat{\beta}^{b}$ construct $$\overline{y}_{counterf}^{b}=\sum_{g}\overline{p}_{g}^{a}\overline{y}_{g}^{b}=(\overline{x}^{a})'\hat{\beta}^{b}$$ Find the adjusted wage gap: $\overline{y}_{counterf}^{b}-\overline{y}^{a}$. How does this compare to the actual gap $\overline{y}^{b}-\overline{y}^{a}$?

### (d) Counterfactual 2 
Construct the weight $w_{g}=N_{g}^{a}/N_{g}^{b}$ for education categories $g=1...6$. Use this to construct $\overline{y}_{counterf2}^{b} = \frac{\sum_{i\in b}w_{g}y_{i}}{\sum_{i\in b}w_{g}}$ and verify that: $$\overline{y}_{counterf}^{b}=\sum_{g}\overline{p}_{g}^{a}\overline{y}_{g}^{b}=(\overline{x}^{a})'\hat{\beta}^{b}=\frac{\sum_{i\in b}w_{g}y_{i}}{\sum_{i\in b}w_{g}}=\overline{y}_{counterf2}^{b}$$.