# Lab 6: Mastering Metrics Recreation

In this lab, we'll be recreating the findings from this [paper](http://assets.press.princeton.edu/chapters/s10363.pdf). The original analysis was done using a software called STATA. You can view the original STATA code [here](http://www.masteringmetrics.com/wp-content/uploads/2020/04/NHIS2009_hicompare_v2.do).  In this notebook, we'll being doing the same analysis using ```python``` and ```pandas``` instead. 


We'll be using data from the 2009 [National Health Interview Survey (NHIS)](https://www.cdc.gov/nchs/nhis/about_nhis.htm),  an annual survey of the U.S. population with detailed information on health and health insurance. Among many other things, the NHIS asks: 
*“Would you say your health in general is excellent, very good, good, fair, or poor?”* 

The NHIS uses this question data to code an index that assigns 5 to excellent health and 1 to poor health in a sample of married 2009 NHIS respondents who may or may not be insured. In this notebook, we'll be taking a look at the differences in statistics for those who have insurance, and those who do not. 

Run the next cell to import the libraries we'll be using to do our analysis

In [194]:
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None) # remove for actual version
pd.set_option('display.max_columns', None) # remove for actual ver

## Load Data
Load the NHIS_four_drop.csv file into a ```pandas``` dataframe called ```full_NHIS_data```.

Replace the ```...``` with the correct ```pandas``` function to read in the data.
 * If you're stuck, try reading the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [195]:
# Run this cell to load our data
#data_string = "NHIS_four_drop.csv" 
#full_NHIS_data = ...
#full_NHIS_data

#####

data_string = "NHIS_four_drop.csv" 
full_NHIS_data = pd.read_csv(data_string)
full_NHIS_data

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.530,1,2,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,
4,2009,6,7871,1,8378,19284.0,65,2,0,0,1,14,0,3,41679.344,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29729,2009,41173,2220,2,2496,5732.0,57,2,1,0,1,12,0,4,167844.530,1,1,
29730,2009,41175,2624,1,3135,7200.0,67,2,1,0,1,14,0,3,61102.973,1,0,
29731,2009,41175,2624,2,3022,,68,2,0,0,1,14,0,2,61102.973,1,0,1.0
29732,2009,41176,2200,1,2532,18062.0,62,7,0,0,0,9,1,1,167844.530,1,2,0.0


Looking at the data, is there anything that looks wrong? Try running the next cell if you are unsure.

In [196]:
full_NHIS_data[["sampweight","hi_hsb1"]].head() # selects the first 5 rows of the sampweight and hi_hsb1 columns

Unnamed: 0,sampweight,hi_hsb1
0,22029.0,
1,,0.0
2,,1.0
3,22190.0,
4,19284.0,1.0


If we look at the ```hi_hsb1``` and ```sampweight``` columns, we see that there are some values of NaN. These values are called Not a Number values. This means that there are missing values that we will have to fill before we can do our analysis.

## Cleaning the Data

In this section, we will be cleaning and selecting the data in preparation for the final output table. The goal of this cleaning process is to generate missing values to fill in the ```hi_hsb1``` column.

In this next cell, we'll be using ```groupby``` and ```sum``` to generate some missing values. These operations are similar to the ```group``` function in the ```datascience``` package.

The cell below does the following:
* ```groupby``` and ```sum``` creates one unique row per ```serial```

We're using the ```serial``` column as the value to create groupings. While ```serial``` is not unique to each record, it is unique to each household. By running ```groupby``` on the ```serial``` column, we are generating values for missing rows based on other members of the same household.

In [197]:
# Run this cell to begin the cleaning process
grouped_NHIS = full_NHIS_data.groupby("serial").sum()
grouped_NHIS.head()

Unnamed: 0_level_0,year,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
serial,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3,4018,15742,5,17905,22029.0,64,8,1,0,0,25,1,8,38565.864,2,2,0.0
5,4018,15742,3,17794,22190.0,66,8,1,0,2,28,2,6,335689.06,2,4,1.0
6,4018,15742,3,17393,19284.0,131,4,1,0,2,28,1,6,83358.688,2,2,1.0
7,4018,15742,3,16299,21587.0,123,4,1,0,2,24,1,3,83358.688,2,2,1.0
8,4018,17316,3,18563,0.0,112,8,1,0,2,30,1,8,335689.06,2,2,1.0


We'll continue below by resetting the index of the dataframe. (The index is the far left column.) If you look at the dataframe above, you will notice that the index is ```serial``` (this is because we used ```serial``` in ```groupby```) and not the numbers 0, 1, 2, 3, 4, $\dots$  

In order to be able to match this dataframe up with our original dataframe in the next step, we need to use ```reset_index``` to have ```serial``` be returned to just being a regular column. This also resets the index to be the default numbers 0, 1, 2, 3, 4, $\dots$

We'll also select just the ```serial``` and ```hi_hsb1``` columns. This is because we need ```serial``` to match records up with the original dataframe in the next step. We need ```hi_hsb1``` because this is the column whose missing values we are generating.

The cell below does the following:
* ```reset_index``` removes ```serial``` as the index of the dataframe 
* ```[["serial","hi_hsb1"]]``` selects the ```serial``` and ```hi_hsb1``` columns

In [198]:
# Run this cell
generated_values = grouped_NHIS.reset_index()[["serial","hi_hsb1"]]
generated_values.head()

Unnamed: 0,serial,hi_hsb1
0,3,0.0
1,5,1.0
2,6,1.0
3,7,1.0
4,8,1.0


In the next cell, we'll be using ```merge```. This is similar to ```join``` in the ```datascience``` package.

The next cell does the following:
* ```merge``` adds on the dataframe we made in the last cell onto the main dataframe using ```serial``` as a key.

The following cell adds on the ```hi_hsb1``` column we generated in the last cell onto our original dataframe. We are doing what is called a left join, which returns all of the rows in the "left" dataframe (```full_NHIS_data```, our original dataframe) AND the matching rows from our "right" dataframe (```generated_values```, the dataframe we created that contains the values we generated). We're using ```serial``` as a key because this is the common column between both dataframes. The key helps the ```merge``` operation match rows correctly.

In [199]:
# Run this cell to continue the data cleaning process
merged_NHIS = full_NHIS_data.merge(generated_values,how="left",on="serial")
merged_NHIS.head()

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1_x,hi_hsb1_y
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,,0.0
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.53,1,2,1.0,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.53,1,2,,1.0
4,2009,6,7871,1,8378,19284.0,65,2,0,0,1,14,0,3,41679.344,1,1,1.0,1.0


After performing the merge in the previous  cell, you can see that we have a duplicate column: the original ```hi_hsb1``` column, now labeled ```hi_hsb1_x``` and ```hi_hsb1_y```, which has the missing values we generated earlier. Because we no longer need the old values, we will drop the ```hi_hsb1_x``` column. We will also rename the ```hi_hsb1_y``` column to ```hi_hsb``` for clarity.

The next cell does the following:
* ```drop``` removes an extra column ```hi_hsb1_x```
* ```rename``` renames the column ```hi_hsb1_y``` to ```hi_hsb``` 

In [200]:
# Run this cell
clean_NHIS = merged_NHIS.drop("hi_hsb1_x",axis=1).rename(columns={"hi_hsb1_y":"hi_hsb"})
clean_NHIS.head()

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,0.0
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.53,1,2,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.53,1,2,1.0
4,2009,6,7871,1,8378,19284.0,65,2,0,0,1,14,0,3,41679.344,1,1,1.0


## Selecting the Data

In this section, we'll be selecting the data to use in the final output.

Here's a quick refresher of selecting using ```pandas```:

### Example 1:
Selects rows with a ```serial``` of 7372

In [201]:
full_NHIS_data[full_NHIS_data["serial"]==7372]

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
5392,2009,7372,1421,1,2330,,51,4,0,0,0,14,1,3,41679.344,1,2,0.0
5393,2009,7372,1421,2,1855,9373.0,47,4,1,0,0,9,1,4,41679.344,1,2,


### Example 2:
Selects rows with a ```yedu``` larger than or equal to 16

In [237]:
full_NHIS_data[full_NHIS_data["yedu"]>=16]

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.53,1,2,
8,2009,8,8658,1,8817,,60,4,0,0,1,16,0,4,167844.53,1,1,1.0
10,2009,9,12833,1,15011,,61,2,1,0,1,16,0,5,167844.53,1,1,
11,2009,9,12833,2,13578,37863.0,62,2,0,0,1,16,1,3,167844.53,1,1,1.0
14,2009,17,7871,1,10226,,49,3,0,0,1,16,1,4,167844.53,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29700,2009,41115,2220,1,2582,,40,4,0,0,1,16,1,4,167844.53,1,2,1.0
29701,2009,41115,2220,2,2499,,36,4,1,0,1,16,1,4,167844.53,1,2,
29710,2009,41140,829,1,888,2427.0,36,4,1,0,1,18,1,4,167844.53,1,2,
29711,2009,41140,829,2,962,,36,4,0,0,1,16,1,4,167844.53,1,2,1.0


### Example 3:
Selects rows with a ```yedu``` greater than or equal to 15 **AND** ```yedu``` less than or equal to 17

In [203]:
full_NHIS_data[(full_NHIS_data["yedu"]>=15) & (full_NHIS_data["yedu"]<=17)] 

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,
8,2009,8,8658,1,8817,,60,4,0,0,1,16,0,4,167844.530,1,1,1.0
10,2009,9,12833,1,15011,,61,2,1,0,1,16,0,5,167844.530,1,1,
11,2009,9,12833,2,13578,37863.0,62,2,0,0,1,16,1,3,167844.530,1,1,1.0
14,2009,17,7871,1,10226,,49,3,0,0,1,16,1,4,167844.530,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29694,2009,41109,2775,1,4001,,26,2,1,0,1,16,1,4,61102.973,1,2,
29695,2009,41109,2775,2,3854,9646.0,26,2,0,0,1,16,1,4,61102.973,1,2,1.0
29700,2009,41115,2220,1,2582,,40,4,0,0,1,16,1,4,167844.530,1,2,1.0
29701,2009,41115,2220,2,2499,,36,4,1,0,1,16,1,4,167844.530,1,2,


### Example 4

Selects all the females in the dataframe.

In [204]:
full_NHIS_data[full_NHIS_data["fml"] == 1]

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb1
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,
5,2009,6,7871,2,9015,,66,2,1,0,1,14,1,3,41679.344,1,1,
7,2009,7,7871,2,8558,,59,2,1,0,1,12,1,2,41679.344,1,1,
9,2009,8,8658,2,9746,,52,4,1,0,1,14,1,4,167844.530,1,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29725,2009,41166,2371,2,2566,,62,3,1,0,1,14,0,3,85985.780,1,1,
29726,2009,41168,953,1,1189,3852.0,19,2,1,1,0,12,0,4,19282.932,1,1,
29729,2009,41173,2220,2,2496,5732.0,57,2,1,0,1,12,0,4,167844.530,1,1,
29730,2009,41175,2624,1,3135,7200.0,67,2,1,0,1,14,0,3,61102.973,1,0,


Select data where ```age``` is greater than or equal to 26 **AND** ```age``` is less than or equal to 59 **AND** ```marradult``` is equal to 1 **AND** ```adltempl``` is greater then or equal to 1. Assign each ```...``` to the corresponding selection from the ```clean_NHIS``` dataframe.
* Hint: Take a look at the cells above for some pointers.


Python Operators:
* Equals: ```==```
* Greater than: ```>```
* Greater than or equal to: ```>=```
* Less than: ```<```
* Less than or equal to: ```<=```
* And: ```&```

In [205]:
# Replace each ... with the corresponding selection from the clean_NHIS dataframe. The first one is done for you.

#age_greater_than_or_equal_to_26 = clean_NHIS["age"] >= 26
#age_less_than_or_equal_to_59 = ...
#marradult_equal_to_1 = ...
#adltempl_is_greater_than_or_equal_to_1 = ...

#data_selected = full_NHIS_data[age_greater_than_or_equal_to_26 & 
  #                  age_less_than_or_equal_to_59 & 
 #                   marradult_equal_to_1 & 
 #                   adltempl_is_greater_than_or_equal_to_1]
#data_selected 

##############

age_greater_than_or_equal_to_26 = clean_NHIS["age"] >= 26
age_less_than_or_equal_to_59 = clean_NHIS["age"]<= 59
marradult_equal_to_1 = clean_NHIS["marradult"] == 1
adltempl_is_greater_than_or_equal_to_1 = clean_NHIS["adltempl"] >= 1

data_selected = clean_NHIS[age_greater_than_or_equal_to_26 & 
                    age_less_than_or_equal_to_59 & 
                    marradult_equal_to_1 & 
                    adltempl_is_greater_than_or_equal_to_1]
data_selected 

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,0.0
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.530,1,2,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,1.0
7,2009,7,7871,2,8558,,59,2,1,0,1,12,1,2,41679.344,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29722,2009,41164,928,1,1860,,29,3,0,1,0,14,1,5,19282.932,1,2,0.0
29723,2009,41164,928,2,1499,,28,3,1,1,0,14,1,3,19282.932,1,2,0.0
29728,2009,41173,2220,1,2654,,57,2,0,0,1,12,1,4,167844.530,1,1,1.0
29729,2009,41173,2220,2,2496,5732.0,57,2,1,0,1,12,0,4,167844.530,1,1,1.0


Run the next cell to remove single person households. ```groupby``` creates a unique values for each ```serial``` value and ```filter(lambda x: len(x)>1)``` selects rows with more than one record per serial number. Don't worry if it takes a couple of seconds for the cell to run.

In [206]:
# Run this cell to remove single person households
households_only = data_selected.groupby(["serial"]).filter(lambda x: len(x)>1)
households_only

Unnamed: 0,year,serial,hhweight,pernum,perweight,sampweight,age,famsize,fml,nwhite,hi,yedu,empl,hlth,inc,marradult,adltempl,hi_hsb
0,2009,3,7871,1,8938,22029.0,29,4,1,0,0,14,0,4,19282.932,1,1,0.0
1,2009,3,7871,4,8967,,35,4,0,0,0,11,1,4,19282.932,1,1,0.0
2,2009,5,7871,1,8905,,32,4,0,0,1,12,1,3,167844.530,1,2,1.0
3,2009,5,7871,2,8889,22190.0,34,4,1,0,1,16,1,3,167844.530,1,2,1.0
12,2009,10,7871,1,9587,24220.0,45,2,0,0,1,12,1,4,85985.780,1,2,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29721,2009,41161,1647,2,1776,,51,2,1,0,1,14,1,3,61102.973,1,2,1.0
29722,2009,41164,928,1,1860,,29,3,0,1,0,14,1,5,19282.932,1,2,0.0
29723,2009,41164,928,2,1499,,28,3,1,1,0,14,1,3,19282.932,1,2,0.0
29728,2009,41173,2220,1,2654,,57,2,0,0,1,12,1,4,167844.530,1,1,1.0


## Processing and Formatting the Data

In this section, we'll process the data and format it like it is presented in the original paper.

In order to calculate different statistics from the dataframe, we need a way to select a specific column. We also need different operations that we can perform on specific columns. Here's a quick refresher of some of these operations in ```pandas```.

### Examples

Let's select the ```hlth``` column from our original dataframe. This is similar to ```Table.column``` in the ```datascience``` package.

In [207]:
full_NHIS_data["hlth"]

0        4
1        4
2        3
3        3
4        3
        ..
29729    4
29730    3
29731    2
29732    1
29733    2
Name: hlth, Length: 29734, dtype: int64

As you can see, the column is formatted a little differently than how dataframes are normally shown. This way of displaying a dataframe column is called a Series. You can check out the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) if you want to learn more.

Series are useful, because they allows us to easily perform operations on whole columns. Let's see this in action by taking the mean of the ```hlth``` column of our original dataframe.

In [208]:
# Takes the mean of the hlth column
full_NHIS_data["hlth"].mean()

3.7598035918477164

Let's take a look at some other operations. All of them follow the same format of selecting a column and using an operator to calculate a statistic.

In [209]:
# Computes the standard deviation
full_NHIS_data["hlth"].std()

1.042508178620819

In [210]:
# Computes the difference in standard errors
full_NHIS_data["hlth"].sem() - full_NHIS_data["famsize"].sem()

-0.0020669293020563447

In [211]:
# Counts the number of rows in a dataframe
len(full_NHIS_data)

29734

### Processing the Data

In this section, we'll be calculating the statistics for our table. 



#### Calculating Statistics for the Health Index ```hlth```

Since our main goal is to compare those individuals with health insurance and those without health insurance, we'll need to do two selections. 

In the next cell, select:
* Females with health insurance
* Females without health insurance

from the ```households_only``` dataframe.

Hint: Health insurance is recorded in the ```hi``` column. Individuals *with* health insurance are labeled with the value of 1. Individuals *without* health insurance are labeled with the value of 0.

In [212]:
# Replace the ... with the correct selection using the households_only dataframe

#all_female = ...
#health_insurance = ...
#no_health_insurance = ...

#female_health_insurance = households_only[all_female & health_insurance]
#female_no_health_insurance = households_only[all_female & no_health_insurance]

######################

all_female = households_only["fml"] == 1
health_insurance = households_only["hi"] == 1
no_health_insurance = households_only["hi"] == 0

female_health_insurance = households_only[(all_female) & (health_insurance)]
female_no_health_insurance = households_only[all_female & no_health_insurance]

Using the operators we described earlier, calculate the following statistics. Don't worry about the ```np.round()``` function being used. This function helps round the statistics you calculate so they fit better in the final table. If you want to learn more about ```np.round```, check out the documentation [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.round.html#numpy.ndarray.round).

In [213]:
# Select the hlth column of female_health_insurance and calculate the mean
#female_health_mean_insurance = ...
#female_health_mean_insurance = np.round(female_health_mean_insurance, decimals = 2)
#female_health_mean_insurance

#########

female_health_mean_insurance = female_health_insurance["hlth"].mean()
female_health_mean_insurance = np.round(female_health_mean_insurance, decimals = 2)
female_health_mean_insurance

3.99

In [214]:
# Select the hlth column of female_no_health_insurance and calculate the mean
#female_health_mean_no_insurance = ...
#female_health_mean_no_insurance = np.round(female_health_mean_no_insurance, decimals = 2)
#female_health_mean_no_insurance

########

female_health_mean_no_insurance = female_no_health_insurance["hlth"].mean()
female_health_mean_no_insurance = np.round(female_health_mean_no_insurance, decimals = 2)
female_health_mean_no_insurance

3.61

In [215]:
# Compute the distance between health_mean_insurance and health_mean_no_insurance
#female_health_difference = ...
#female_health_difference = np.round(female_health_difference, decimals = 2)
#female_health_difference

############

female_health_difference = female_health_mean_insurance - female_health_mean_no_insurance
female_health_difference = np.round(female_health_difference, decimals = 2)
female_health_difference

0.38

In [216]:
# Compute the standard deviation of the hlth column of the female_health_insurance dataframe
#female_health_mean_insurance_sd = ...
#female_health_insurance_sd = np.round(female_health_insurance_sd, decimals = 2)
#female_health_mean_insurance_sd

##########

female_health_insurance_sd = female_health_insurance["hlth"].std()
female_health_insurance_sd = np.round(female_health_insurance_sd, decimals = 2)
female_health_insurance_sd

0.93

In [217]:
# Compute the standard deviation of the hlth column of the female_no_health_insurance dataframe
#female_health_mean_insurance_sd = ...
#female_health_no_insurance_sd = np.round(female_health_no_insurance_sd, decimals = 2)
#female_health_mean_insurance_sd

##########

female_health_no_insurance_sd = female_no_health_insurance["hlth"].std()
female_health_no_insurance_sd = np.round(female_health_no_insurance_sd, decimals = 2)
female_health_no_insurance_sd

1.02

In [218]:
# Compute the difference in standard error for the hlth column of female_no_health_insurance and the 
# hlth column of female_health_insurance
#female_health_insurance_se = ...
#female_health_insurance_se

female_health_insurance_se = female_no_health_insurance["hlth"].sem() - female_health_insurance["hlth"].sem()
female_health_insurance_se = np.round(female_health_insurance_se, decimals = 2)
female_health_insurance_se

0.02

#### Calculating Statistics for the Remaining Columns

Good job computing the statistics for the ```hlth``` column. To help speed up the rest of the columns, write a function that computes the statistics for the remaining columns (```nwhite```, ```age```, ```yedu```, ```famsize```, ```empl```, ```inc```). The code you fill in should resemble the code you wrote for the computing the statistics for the ```hlth``` column above. The first one is done for you.

In [219]:
# Student Version

def calculate_stats(health_insurance, no_health_insurance):
    """
    Calculates mean, difference, and standard error for a list of columns
    
    Parameters:
    health_insurance -- A pandas dataframe
    no_health_insurance -- A pandas dataframe
    """
    data_list = []
    # computes the values for all of the other columns
    for column in ["nwhite", "age", "yedu", "famsize", "empl", "inc"]:
        
        # Replace the ... with an expression that calculates the 
        # mean for the variable column in the health_insurance dataframe
        mean_health_insurance = health_insurance[column].mean()
        
        # Replace the ... with an expression that calculates the 
        # mean for the variable column in the no_health_insurance dataframe
        mean_point_no_health_insurance = ...
        
        # Replace the ... with an expression that calculates the 
        # difference in means between the mean_health_insurance and mean_point_no_health_insurance
        data_difference = ...
        
        # Replace the ... with an expression that calculates the 
        # standard error between the no_health_insurance and health_insurance dataframes for the variable column
        data_difference_se = ...
        
        data_list.append(column)
        data_list.append(mean_health_insurance)
        data_list.append(mean_point_no_health_insurance)
        data_list.append(data_difference)
        data_list.append(data_difference_se)
        
    return data_list

In [220]:
def calculate_stats(health_insurance, no_health_insurance):
    """
    Calculates mean, difference, and standard error for a list of columns
    
    Parameters:
    health_insurance -- A pandas dataframe
    no_health_insurance -- A pandas dataframe
    """
    data_list = []
    # computes the values for all of the other columns
    for column in ["nwhite", "age", "yedu", "famsize", "empl", "inc"]:
        
        # Replace the ... with an expression that calculates the 
        # mean for the variable column in the health_insurance dataframe
        mean_health_insurance = health_insurance[column].mean()
        
        # Replace the ... with an expression that calculates the 
        # mean for the variable column in the no_health_insurance dataframe
        mean_point_no_health_insurance = no_health_insurance[column].mean()
        
        # Replace the ... with an expression that calculates the 
        # difference in means between the mean_health_insurance and mean_point_no_health_insurance
        data_difference = mean_health_insurance - mean_point_no_health_insurance
        
        # Replace the ... with an expression that calculates the 
        # standard error between the no_health_insurance and health_insurance dataframes for the variable column
        data_difference_se = no_health_insurance[column].sem() - health_insurance[column].sem()
        
        # These lines add the values you calculates to a list for later
        
        data_list.append(np.round(mean_health_insurance, decimals=2))
        data_list.append(np.round(mean_point_no_health_insurance,decimals=2))
        data_list.append(np.round(data_difference,decimals=2))
        data_list.append(np.round(data_difference_se,decimals=2))
        
    return data_list

Now, use the function you just wrote to calculate the remaining statistics for ```female_health_insurance``` and ```female_no_health_insurance```. 

In [221]:
# Replace the ... with a call to calculate_stats
# using the female_health_insurance and female_no_health_insurance dataframes

#female_stats = ...
#female_stats

#########

female_stats = calculate_stats(female_health_insurance, female_no_health_insurance)
female_stats

[0.2,
 0.18,
 0.02,
 0.01,
 42.15,
 39.52,
 2.63,
 0.12,
 14.27,
 11.36,
 2.91,
 0.06,
 3.55,
 4.07,
 -0.52,
 0.03,
 0.76,
 0.54,
 0.22,
 0.01,
 103363.63,
 43641.39,
 59722.24,
 307.41]

In [222]:
# Replace the .. with an expression that finds the number of rows in female_health_insurance
#female_health_insurance_len = ...
#female_health_insurance_len

#########
female_health_insurance_len = len(female_health_insurance)
female_health_insurance_len

7950

In [223]:
# Replace the .. with an expression that finds the number of rows in female_no_health_insurance
#female_no_health_insurance_len = ...
#female_no_health_insurance_len

#########
female_no_health_insurance_len = len(female_no_health_insurance)
female_no_health_insurance_len

1445

You have now calculated all of the statistics for the ```female_no_health_insurance``` and ```female_health_insurance``` dataframes. Let's format the table in a way that is easier to understand. Run the next cell to format the values you calculated into a readable format.

In [224]:
print("Wives")
data_frame = pd.DataFrame(data={' ': ["Health index", "Nonwhite","Age","Education","Family Size","Employed","Family Income","Sample Size"],
                   'Some HI (1)': ["{0} [{1}]".format(female_health_mean_insurance,female_health_insurance_sd), female_stats[0],female_stats[4],female_stats[8],female_stats[12],female_stats[16],female_stats[20],female_health_insurance_len],
                  'No HI (0)': ["{0} [{1}]".format(female_health_mean_no_insurance,female_health_no_insurance_sd), female_stats[1],female_stats[5],female_stats[9],female_stats[13],female_stats[17],female_stats[21],female_no_health_insurance_len],
                  'Difference (3)': ["{0} ({1})".format(female_health_difference,female_health_insurance_se), "{0} ({1})".format(female_stats[2],female_stats[3]),"{0} ({1})".format(female_stats[6],female_stats[7]),"{0} ({1})".format(female_stats[10],female_stats[11]),"{0} ({1})".format(female_stats[14],female_stats[15]),"{0} ({1})".format(female_stats[18],female_stats[19]),"{0} ({1})".format(female_stats[22],female_stats[23])," "]})
display(data_frame)
print("""Notes: This table reports average characteristics for insured and uninsured married couples in the
          2009 National Health Interview Survey (NHIS). Columns (1), (2), (4), and (5) show average characteristics
          of the group of individuals specified by the column heading. Columns (3) and (6) report the difference
          between the average characteristic for individuals with and without health insurance (HI).
          Standard deviations are in brackets; standard errors are reported in parentheses.""")

Wives


Unnamed: 0,Unnamed: 1,Some HI (1),No HI (0),Difference (3)
0,Health index,3.99 [0.93],3.61 [1.02],0.38 (0.02)
1,Nonwhite,0.2,0.18,0.02 (0.01)
2,Age,42.15,39.52,2.63 (0.12)
3,Education,14.27,11.36,2.91 (0.06)
4,Family Size,3.55,4.07,-0.52 (0.03)
5,Employed,0.76,0.54,0.22 (0.01)
6,Family Income,103364,43641.4,59722.24 (307.41)
7,Sample Size,7950,1445,


Notes: This table reports average characteristics for insured and uninsured married couples in the
          2009 National Health Interview Survey (NHIS). Columns (1), (2), (4), and (5) show average characteristics
          of the group of individuals specified by the column heading. Columns (3) and (6) report the difference
          between the average characteristic for individuals with and without health insurance (HI).
          Standard deviations are in brackets; standard errors are reported in parentheses.


#### Calculating Statistics for Males

Now it is time to do the same thing for males in the dataframe.

Since our main goal is to compare those individuals with health insurance and those without health insurance, we'll need to do two selections. 

In the next cell, select:
* Males with health insurance
* Males without health insurance

from the ```households_only``` dataframe.

Hint: Health insurance is recorded in the ```hi``` column. Individuals *with* health insurance are labeled with the value of 1. Individuals *without* health insurance are labeled with the value of 0.

In [225]:
# Replace the ... with the correct selection using the households_only dataframe

#all_male = ...
#health_insurance = ...
#no_health_insurance = ...

#male_health_insurance = households_only[all_male & health_insurance]
#male_no_health_insurance = households_only[all_male & no_health_insurance]

######################

all_male = households_only["fml"] == 0
health_insurance = households_only["hi"] == 1
no_health_insurance = households_only["hi"] == 0

male_health_insurance = households_only[(all_male) & (health_insurance)]
male_no_health_insurance = households_only[all_male & no_health_insurance]

Now fill in the $\dots$ to complete each of the expressions. Your code should be similar to the code you wrote in the previous section.

In [226]:
# Select the hlth column of male_health_insurance and calculate the mean
#male_health_mean_insurance = ...
#male_health_mean_insurance = np.round(male_health_mean_insurance, decimals = 2)
#male_health_mean_insurance

#########

male_health_mean_insurance = male_health_insurance["hlth"].mean()
male_health_mean_insurance = np.round(male_health_mean_insurance, decimals = 2)
male_health_mean_insurance

3.98

In [227]:
# Select the hlth column of male_no_health_insurance and calculate the mean
#male_health_mean_no_insurance = ...
#male_health_mean_no_insurance = np.round(male_health_mean_no_insurance, decimals = 2)
#male_health_mean_no_insurance

########

male_health_mean_no_insurance = male_no_health_insurance["hlth"].mean()
male_health_mean_no_insurance = np.round(male_health_mean_no_insurance, decimals = 2)
male_health_mean_no_insurance

3.7

In [228]:
# Compute the distance between health_mean_insurance and health_mean_no_insurance
#male_health_difference = ...
#male_health_difference = np.round(male_health_difference, decimals = 2)
#male_health_difference

############

male_health_difference = male_health_mean_insurance - male_health_mean_no_insurance
male_health_difference = np.round(male_health_difference, decimals = 2)
male_health_difference

0.28

In [229]:
# Compute the standard deviation of the hlth column of the male_health_insurance dataframe
#male_health_mean_insurance_sd = ...
#male_health_insurance_sd = np.round(male_health_insurance_sd, decimals = 2)
#male_health_mean_insurance_sd

##########

male_health_insurance_sd = male_health_insurance["hlth"].std()
male_health_insurance_sd = np.round(male_health_insurance_sd, decimals = 2)
male_health_insurance_sd

0.93

In [230]:
# Compute the standard deviation of the hlth column of the male_no_health_insurance dataframe
#male_health_mean_insurance_sd = ...
#male_health_no_insurance_sd = np.round(male_health_no_insurance_sd, decimals = 2)
#male_health_mean_insurance_sd

##########

male_health_no_insurance_sd = male_no_health_insurance["hlth"].std()
male_health_no_insurance_sd = np.round(male_health_no_insurance_sd, decimals = 2)
male_health_no_insurance_sd

1.01

In [231]:
# Compute the difference in standard error for the hlth column of male_no_health_insurance and the 
# hlth column of male_health_insurance
#male_health_insurance_se = ...
#male_health_insurance_se

male_health_insurance_se = male_no_health_insurance["hlth"].sem() - male_health_insurance["hlth"].sem()
male_health_insurance_se = np.round(male_health_insurance_se, decimals = 2)
male_health_insurance_se

0.02

Since you already defined a function, no need to re-write it. Use ```calculate_stats``` with the new dataframes.

In [232]:
# Replace the ... with a call to calculate_stats
# using the male_health_insurance and male_no_health_insurance dataframes

#male_stats = ...
#male_stats

#########

male_stats = calculate_stats(male_health_insurance, male_no_health_insurance)
male_stats

[0.2,
 0.19,
 0.01,
 0.01,
 44.16,
 41.27,
 2.89,
 0.12,
 14.13,
 11.21,
 2.92,
 0.06,
 3.55,
 4.06,
 -0.51,
 0.02,
 0.92,
 0.85,
 0.07,
 0.01,
 104002.44,
 43636.02,
 60366.41,
 294.68]

In [233]:
# Replace the .. with an expression that finds the number of rows in male_health_insurance
#male_health_insurance_len = ...
#male_health_insurance_len

#########
male_health_insurance_len = len(male_health_insurance)
male_health_insurance_len

7866

In [234]:
# Replace the .. with an expression that finds the number of rows in male_no_health_insurance
#male_no_health_insurance_len = ...
#male_no_health_insurance_len

#########
male_no_health_insurance_len = len(male_no_health_insurance)
male_no_health_insurance_len

1529

In [235]:
print("Husbands")
data_frame = pd.DataFrame(data={' ': ["Health index", "Nonwhite","Age","Education","Family Size","Employed","Family Income","Sample Size"],
                   'Some HI (1)': ["{0} [{1}]".format(male_health_mean_insurance,male_health_insurance_sd), male_stats[0],male_stats[4],male_stats[8],male_stats[12],male_stats[16],male_stats[20],male_health_insurance_len],
                  'No HI (0)': ["{0} [{1}]".format(male_health_mean_no_insurance,male_health_no_insurance_sd), male_stats[1],male_stats[5],male_stats[9],male_stats[13],male_stats[17],male_stats[21],male_no_health_insurance_len],
                  'Difference (3)': ["{0} ({1})".format(male_health_difference,male_health_insurance_se), "{0} ({1})".format(male_stats[2],male_stats[3]),"{0} ({1})".format(male_stats[6],male_stats[7]),"{0} ({1})".format(male_stats[10],male_stats[11]),"{0} ({1})".format(male_stats[14],male_stats[15]),"{0} ({1})".format(male_stats[18],male_stats[19]),"{0} ({1})".format(male_stats[22],male_stats[23])," "]})
display(data_frame)
print("""Notes: This table reports average characteristics for insured and uninsured married couples in the
          2009 National Health Interview Survey (NHIS). Columns (1), (2), (4), and (5) show average characteristics
          of the group of individuals specified by the column heading. Columns (3) and (6) report the difference
          between the average characteristic for individuals with and without health insurance (HI).
          Standard deviations are in brackets; standard errors are reported in parentheses.""")

Husbands


Unnamed: 0,Unnamed: 1,Some HI (1),No HI (0),Difference (3)
0,Health index,3.98 [0.93],3.7 [1.01],0.28 (0.02)
1,Nonwhite,0.2,0.19,0.01 (0.01)
2,Age,44.16,41.27,2.89 (0.12)
3,Education,14.13,11.21,2.92 (0.06)
4,Family Size,3.55,4.06,-0.51 (0.02)
5,Employed,0.92,0.85,0.07 (0.01)
6,Family Income,104002,43636,60366.41 (294.68)
7,Sample Size,7866,1529,


Notes: This table reports average characteristics for insured and uninsured married couples in the
          2009 National Health Interview Survey (NHIS). Columns (1), (2), (4), and (5) show average characteristics
          of the group of individuals specified by the column heading. Columns (3) and (6) report the difference
          between the average characteristic for individuals with and without health insurance (HI).
          Standard deviations are in brackets; standard errors are reported in parentheses.


So what did we just create? The tables above has three columns: Some HI (1), No HI (0), and Difference (3). The first two columns are those husbands who have some health insurance. The second column is those husbands who have no health insurance. The third column is the difference between the two groups. 

It might be easy to use these comparisons as evidence of certain causal effects.  More often than not, however, such  comparisons are misleading. Once again the problem is other things equal, or lack thereof. Comparisons of people with and without health insurance are not apples toapples; such contrasts are apples to oranges, or worse.

Many of the differences in the table are large (for example, a nearly 3-year schooling gap); most are statistically precise enough to rule out the hypothesis that these discrepancies are merely chance findings. It won’t surprise you  to learn that most variables tabulated here are highly correlated with health as well as with health  insurance  status. More-educated people,for example, tend to be healthier as well asbeing overrepresented in the insured group. This may be because more-educated people exercise more, smoke less, and are more likely to wear seat belts. It stands  to reason  thatthe difference in health between insured and uninsured NHIS respondents at least partly reflects the extra schooling of the insured.

Congratulations! You have successfully recreated tables from a paper!