You shall write a short (max 1 page, excluding graphs, tables) report that describes your major decisions,
your estimated models, interpretations, and summary. You may have a descriptive data table (optional), a
regression table and up to two graphs. All codes should be pushed to your Github repo, with appropriate
commit messages.

**The dataset**

Consider the cps-earnings dataset at https://osf.io/g8p9j/ (Crosssection. N=149,316 individuals).


Pick an occupation and filter data accordingly.
You must choose different occupation(s) from what is covered in Ch09 (1005 + 1240). Occupation codes
are here: https://osf.io/57n9q/. 

You may merge occupations as you see fit (eg. all tax/insurance
specialists, etc.).


**Tasks**
- Show the unconditional gender gap.
- Show how the gender gap varies with the level of education. Consider several options to model the
relationship.
- Interpret your key coefficients, including statistical inference.
- Summarize your findings

**What to submit**
1. A pdf-report.
2. Link to your Github homepage with your codes, markdowns, etc

## NOTE 1
In assignment 1, when you have to consider several options to model the conditional gender gap, you can look at
- models including also additional covariates (as we did today), and you might include also interactions
- models with earnings transformed or not (as we had in previous TA sessions)
- models with different reference groups when you include dummies (as today)
etc...

In [1]:
# importing all libraries
import os
import sys
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from mizani.formatters import percent_format
from plotnine import *
from datetime import datetime
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import norm
from IPython.core.display import HTML
from stargazer.stargazer import Stargazer
import statsmodels.nonparametric.kernel_regression as loess

from mizani.transforms import log_trans
from mizani.formatters import percent_format
from mizani.formatters import log_format

warnings.filterwarnings("ignore")

In [2]:
# importing the data

df = pd.read_csv('https://osf.io/download/4ay9x/')

In [3]:
# checking the number of rows and columns
df.shape

(149316, 23)

In [4]:
# check columns names
df.columns

Index(['Unnamed: 0', 'hhid', 'intmonth', 'stfips', 'weight', 'earnwke',
       'uhours', 'grade92', 'race', 'ethnic', 'age', 'sex', 'marital',
       'ownchild', 'chldpres', 'prcitshp', 'state', 'ind02', 'occ2012',
       'class', 'unionmme', 'unioncov', 'lfsr94'],
      dtype='object')

In [5]:
pd.set_option('display.max_columns', None)

In [6]:
# check na values
df.isna().sum()

Unnamed: 0         0
hhid               0
intmonth           0
stfips             0
weight             0
earnwke            0
uhours             0
grade92            0
race               0
ethnic        129245
age                0
sex                0
marital            0
ownchild           0
chldpres           0
prcitshp           0
state              0
ind02              0
occ2012            0
class              0
unionmme           0
unioncov       17096
lfsr94             0
dtype: int64

### SELECT OCCUPATION
keep only two occupation types: Financial Analysts and Legal Occupations

(for this case study we start with Market analysts)

(look at CPS occupation codes file)

In [7]:
# we are setting values 1 for the first sample
df.loc[df["occ2012"] == 840, "sample"] = 1 #Financial Analysts
# sample == 1

# we are setting values 2 for the second sample
df.loc[
    ((df["occ2012"] >= 2100) & (df["occ2012"] <= 2160)), "sample"
] = 2 #Legal occupations


# sample == 2
df.loc[df["sample"].isna(), "sample"] = 0 # were NAs input zeros
# sample == 0
df

Unnamed: 0.1,Unnamed: 0,hhid,intmonth,stfips,weight,earnwke,uhours,grade92,race,ethnic,age,sex,marital,ownchild,chldpres,prcitshp,state,ind02,occ2012,class,unionmme,unioncov,lfsr94,sample
0,3,2600310997690,January,AL,3151.6801,1692.00,40,43,1,,29,2,7,0,0,"Native, Born In US",63,Employment services (5613),630,"Private, For Profit",No,No,Employed-At Work,0.0
1,5,75680310997590,January,AL,3457.1138,450.00,40,41,2,,27,2,1,2,6,"Native, Born In US",63,Outpatient care centers (6214),5400,"Private, For Profit",No,No,Employed-Absent,0.0
2,6,75680310997590,January,AL,3936.9110,1090.00,60,41,2,,30,1,1,2,6,"Native, Born In US",63,Motor vehicles and motor vehicle equipment man...,8140,"Private, For Profit",No,No,Employed-At Work,0.0
3,10,179140131100930,January,AL,3288.3640,769.23,40,40,1,,48,1,1,2,4,"Native, Born In US",63,"**Publishing, except newspapers and software (...",8255,"Private, For Profit",Yes,,Employed-At Work,0.0
4,11,179140131100930,January,AL,3422.8500,826.92,40,43,1,,46,2,1,2,4,"Native, Born In US",63,"Banking and related activities (521, 52211,52219)",5940,"Private, For Profit",No,No,Employed-At Work,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149311,317051,896679860459501,December,WY,346.2296,692.30,40,39,1,,36,1,6,0,0,"Native, Born In US",8,Office supplies and stationery stores (45321),4760,"Private, For Profit",No,No,Employed-At Work,0.0
149312,317052,907086820569600,December,WY,294.9800,1984.61,40,44,1,,45,2,1,1,3,"Native, Born In US",8,Administration of human resource programs (923),430,Government - State,No,No,Employed-At Work,0.0
149313,317053,907086820569600,December,WY,324.1761,2884.61,55,43,1,,44,1,1,1,3,"Native, Born In US",8,Nursing care facilities (6231),10,"Private, For Profit",No,No,Employed-At Work,0.0
149314,317055,950868097156649,December,WY,321.6982,1153.84,40,42,1,,46,1,1,0,0,"Native, Born In US",8,Hospitals (622),5820,"Private, Nonprofit",No,No,Employed-At Work,0.0
