In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('max_colwidth', None)

plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline
pd.set_option('display.max_columns', None)

# Lab [Number]: Mastering Metrics Recreation

In this lab, we'll be recreating the findings from this [paper](http://assets.press.princeton.edu/chapters/s10363.pdf). The original analysis was done using a software called STATA. You can view the original STATA code [here](http://www.masteringmetrics.com/wp-content/uploads/2020/04/NHIS2009_hicompare_v2.do).  In this notebook, we'll being doing the same analysis using ```python``` and ```pandas``` instead.

## Load Data
Load the data.csv file into a pandas dataframe.  

In [5]:
# Run this cell to load our data
data_string = r"NHIS_dropped.csv" 
df = pd.read_csv(data_string) # csv already has marradult == 1 & perweight != 0 done
df#.head()

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.530,1,2,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,
4,2009,6,7871,1,8378,19284.0,65,2,0,0,1,14,0,3,41679.344,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29967,2009,41173,2220,2,2496,5732.0,57,2,1,0,1,12,0,4,167844.530,1,1,
29968,2009,41175,2624,1,3135,7200.0,67,2,1,0,1,14,0,3,61102.973,1,0,
29969,2009,41175,2624,2,3022,,68,2,0,0,1,14,0,2,61102.973,1,0,1.0
29970,2009,41176,2200,1,2532,18062.0,62,7,0,0,0,9,1,1,167844.530,1,2,0.0


Let's start by taking a look at the data to see if there are any issues to fix before we begin our analysis. 
 * Hint: There is a built in ```pandas``` tool that will help us do this. Check out how to use this tool [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

In [26]:
# Use the tool in the hint to get a description of the data set
#...

df.describe()

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
count,29972.0,29972.0,29972.0,29972.0,29972.0,11207.0,29972.0,29972.0,29972.0,29972.0,29972.0,29972.0,29972.0,29972.0,29972.0,29972.0,29972.0,14898.0
mean,2009.0,20383.014447,3063.161551,1.522688,3604.338783,9973.433836,48.81466,3.268451,0.502936,0.187775,0.854931,13.47094,0.643934,3.762612,83322.680317,1.0,1.290771,0.852866
std,0.0,11881.192432,2009.919715,0.560258,2341.954945,6873.291774,14.666767,1.399725,0.5,0.390539,0.352176,3.076254,0.478843,1.042617,56598.880746,0.0,0.758851,0.354251
min,2009.0,3.0,724.0,1.0,696.0,1214.0,16.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,19282.932,1.0,0.0,0.0
25%,2009.0,10024.0,1797.5,1.0,2183.0,5964.0,37.0,2.0,0.0,0.0,1.0,12.0,0.0,3.0,41679.344,1.0,1.0,1.0
50%,2009.0,20343.0,2606.0,2.0,3073.5,8385.0,48.0,3.0,1.0,0.0,1.0,14.0,1.0,4.0,61102.973,1.0,1.0,1.0
75%,2009.0,30685.0,3655.0,2.0,4267.0,11625.0,59.0,4.0,1.0,0.0,1.0,16.0,1.0,5.0,167844.53,1.0,2.0,1.0
max,2009.0,41176.0,26014.0,9.0,31494.0,102299.0,85.0,18.0,1.0,1.0,1.0,18.0,1.0,5.0,167844.53,1.0,2.0,1.0


If you take a look at the data frame, do you notice anything that could be wrong? Columns like ```hi_hsb1``` and ```sampweight``` have NaN values. This means that we need to do some data cleaning before we can build our analysis. Another indicator of issues with data comes from the output of running ```describe``` on the data frame. Columns such as ```yedu``` and ```hi_hsb1``` have mins of zero. While this does not necessarily mean that there are issues with these columns, it is a good thing to keep in mind.

## Cleaning the Data

In this section, we will be cleaning and selecting the data in preparation for the final output table. 

In [29]:
# Run this cell to begin the cleaning process
df2=df.groupby("serial").sum().reset_index()[["serial","hi_hsb1"]] # This fills some of the missing data
df3=df.merge(df2,how="left",on="serial").drop("hi_hsb1_x",axis=1).rename(columns={"hi_hsb1_y":"hi_hsb"})
df3

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,0.0
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.530,1,2,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,1.0
4,2009,6,7871,1,8378,19284.0,65,2,0,0,1,14,0,3,41679.344,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29967,2009,41173,2220,2,2496,5732.0,57,2,1,0,1,12,0,4,167844.530,1,1,1.0
29968,2009,41175,2624,1,3135,7200.0,67,2,1,0,1,14,0,3,61102.973,1,0,1.0
29969,2009,41175,2624,2,3022,,68,2,0,0,1,14,0,2,61102.973,1,0,1.0
29970,2009,41176,2200,1,2532,18062.0,62,7,0,0,0,9,1,1,167844.530,1,2,0.0


In [31]:
df3[df3["serial"]==3136]

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb
2279,2009,3136,5200,2,6520,8336.0,24,2,1,0,1,14,1,5,61102.973,1,2,0.0
