# Missing Values Exercises

One of the defining features of `pandas` is the use of indices for data alignment. Like many features in `pandas`, it can make live very easy, but if you aren't careful, it can also lead to problems. This is especially true because indices lead to behavior that is very different from what one sees in other languages and library (like `R`, `numpy`, and `julia`). So let's spend a little timing practicing interacting with indices (and missing values)!

### Exercise 1

Today, we will be using the ACS data we used during out first `pandas` exercise to examine the US income distribution, and how it varies by race. Note that because the US income distribution has a very small number of people with *extremely* high incomes, and the ACS is just a sample of Americans, the far right tail of the distribution will not be very well estimated. However, this data should suffice for helping to understand wealth inequality in the United States. 

To begin, load the ACS Data we used in our first pandas exercise. That [data can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). We'll be working with `US_ACS_2017_10pct_sample.dta`. 

In [1]:
import pandas as pd
import numpy as np
# Download the data
acs = pd.read_stata("https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta")

### Exercise 2

Let's begin by calculating the median US incomes from this data (recall that income is stored in the `inctot` variable).

In [2]:
acs.inctot.mean()

1723646.2703978634

### Exercise 3

Hmmm... That doesn't look right. The average American is definitely not earning 1.7 million dollars a year. Let's look at the values of `inctot` using `value_counts()`. Do you see a problem?

Now use `value_counts()` with the argument `normalize=True` to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? What percentage has an income of 0?

In [3]:
acs.inctot.value_counts(normalize=True)

9999999    0.168967
0          0.105575
30000      0.014978
50000      0.013837
40000      0.013834
20000      0.012749
60000      0.011342
12000      0.011113
25000      0.010552
10000      0.009179
35000      0.009179
15000      0.008859
45000      0.008185
24000      0.007605
70000      0.006915
18000      0.006664
100000     0.006226
80000      0.006222
55000      0.006044
5000       0.005928
65000      0.005918
6000       0.005627
32000      0.005204
75000      0.005000
36000      0.004994
22000      0.004909
8000       0.004871
28000      0.004790
3000       0.004759
9000       0.004759
             ...   
11370      0.000003
40060      0.000003
23004      0.000003
77130      0.000003
32096      0.000003
59850      0.000003
204300     0.000003
142890     0.000003
92410      0.000003
48760      0.000003
61550      0.000003
18910      0.000003
117590     0.000003
72180      0.000003
38780      0.000003
67150      0.000003
34430      0.000003
243480     0.000003
150430     0.000003


### Exercise 4

As we discussed before, the ACS uses a value of 9999999 to denote that income information is not available for someone. The problem with using this kind of "sentinel value" is that pandas doesn't understand that this is supposed to denote missing data, and so when it averages the variable, it doesn't know to ignore 9999999. 

To help out `pandas`, use the `replace` command to replace all values of 9999999 with `np.nan`. 

In [4]:
acs['inctot'] = acs['inctot'].replace(9999999, np.nan)

### Exercise 5

Now that we've properly labeled our missing data as `np.nan`, let's calculate the average US income once more. 

In [5]:
acs.inctot.mean()

40890.177564946454

### Exercise 6

OK, now we've been able to get a reasonable average income number. As we can see, a major advantage of using `np.nan` is that `pandas` knows that `np.nan` observations should just be ignored when we are calculating means. 

But it's not enough to just get rid of the people who had `inctot` values of 9999999. We also need to know why those values were missing. Suppose, for example, that the value of 9999999 was used for anyone who made more than 100,000 dollars: if we just dropped those people, then our estimate of average income wouldn't mean much, would it?

So let's make sure we understand *why* data is missing for some people. If you recall from our last exercise, it seemed to be the case that most of the people who had incomes of 9999999 were children. Let's make sure that's true by looking at the distribution of the variable `age` for people for whom `inctot` is missing (i.e. subset the data to people with `inctot` missing, then look at the values of `age` with `value_counts()`).

Then do the opposite: look at the distribution of the `age` variable for people who whom `inctot` is *not* missing. 

Can you determine when 9999999 was being used? Is it ok we're excluding those people from our analysis?

Note: In this data, Python doesn't understand `age` is a number; it thinks it is a string because the original data has categories like "90 (90+ in 1980 and 1990)" and "less than 1 year old". So you can't just use `min()` or `max()`. We'll discuss converting string variables into numbers in a future class.

In [6]:
acs.loc[pd.isnull(acs.inctot), 'age'].value_counts()

10                           3997
9                            3977
14                           3847
12                           3845
13                           3800
11                           3791
8                            3648
7                            3527
6                            3524
5                            3512
2                            3405
1                            3340
4                            3318
3                            3220
less than 1 year old         3150
30                              0
45                              0
44                              0
43                              0
42                              0
41                              0
40                              0
39                              0
38                              0
37                              0
36                              0
35                              0
34                              0
33                              0
32            

In [7]:
acs.loc[pd.notnull(acs.inctot), 'age'].value_counts()

60                           4950
54                           4821
56                           4776
59                           4776
58                           4734
57                           4720
55                           4693
61                           4644
62                           4614
53                           4600
18                           4496
63                           4488
52                           4418
65                           4362
19                           4342
64                           4287
50                           4272
47                           4256
66                           4106
16                           4106
46                           4064
67                           4055
17                           4021
51                           4021
20                           3992
48                           3956
70                           3953
68                           3951
34                           3942
15            

### Exercise 7

Great, so now we know why those people had missing data, and we're ok with excluding them. 

But as we previously noted, there are also a lot of observations of zero income in our data, and it's not clear that we want everyone with a zero-income *should* be included in this average, since those may be people who are retired, or in school. 

Let's limit our attention to people who are currently working. We can do this using `empstat`. Remember you can use `value_counts()` to see what values of `empstat` are in the data!

In [8]:
acs.empstat.value_counts()

employed              148758
not in labor force    104676
n/a                    57843
unemployed              7727
Name: empstat, dtype: int64

In [9]:
acs.loc[(acs.empstat == "employed"), 'inctot'].mean()

57854.723914007984

### Exercise 8

Now let's estimate the racial income gap in the United States. What is the average salary for employed Black Americans, and what is the average salary for employed White Americans? In percentage terms, how much more does the average White American make than the average Black American?

**Note:** these values are not quite accurate estimates. As we'll discuss in later lessons, to get completely accurate estimates from the ACS we have to take into account how people were selected to be interviewed. But you get pretty good estimates in most cases even without weights -- your estimate of the racial wage gap without weights is within 5\% of the corrected value. 

**Note:** This is actually an underestimate of the wage gap. The US Census treats hispanic respondents as a sub-category of "white", so in pooling what we traditionally think of as "white" respondents with hispanic respondents (who tend to earn less), we get an underestimate of the average white salary in the US.

In [33]:
white = acs.loc[(acs['empstat'] == 'employed') & (acs['race'] == 'white'), 'inctot'].mean()

In [34]:
acs['race'].value_counts()

white                               243751
black/african american/negro         31691
other asian or pacific islander      12508
other race, nec                      12304
two major races                       8826
chinese                               4313
american indian or alaska native      3595
three or more major races             1207
japanese                               809
Name: race, dtype: int64

In [35]:
black = acs.loc[(acs['empstat'] == 'employed') & (acs['race'] == 'black/african american/negro'), 'inctot'].mean()

In [36]:
(white - black) / black

0.44852990062751974

In [37]:
# Now the exact estimates taking into account sampling weights

# Subset for each of notation
white_employed = acs[(acs['empstat'] == 'employed') & (acs['race'] == 'white')]
white_weighted = (white_employed['inctot'] * white_employed['perwt'] / white_employed['perwt'].sum()).sum()

In [38]:
# Subset for each of notation
black_employed = acs[(acs['empstat'] == 'employed') & (acs['race'] == 'black/african american/negro')]
black_weighted = (black_employed['inctot'] * black_employed['perwt'] / black_employed['perwt'].sum()).sum()

In [39]:
((white-black) - (white_weighted - black_weighted)) / (white_weighted - black_weighted)

0.04431967593018212