# Week 3 Quiz


**Note**: 

> This exercise has been written out in something called a Jupyter Notebook. We'll discuss Jupyter Notebooks in more detail later in the course—they are very a powerful tool for data science communication!—but for the time being, the notebook is just a convenient way for us to write out the exercise. You don't need to *do* anything with the notebook except read its contents—just use write your Python code in a regular `.py` file.


As previously discussed, we frequently use matrices and data science because they are a natural format for representing data generated by collecting the same type of information from numerous entities. For example, below is a toy dataset that you could imagine was created by collecting information about employees at a company—each column is a different type of information being collected (income, age, years of education), and each row is the information about a different employee.

In the following questions, you will be asked to answer a number of questions about this toy dataset. As with other exercises in this class, you will find the directions for this graded exercise here, but please submit your actual answers in the graded quiz.

In [54]:
import numpy as np

incomes = [22_000, 65_000, 19_000, 110_000, 14_000, 12_000, 35_000]
ages = [20, 35, 55, 35, 21, 19, 42]
years_of_education = [12, 16, 11, 22, 12, 8, 12]

survey = np.array([incomes, ages, years_of_education]).T
survey


array([[ 22000,     20,     12],
       [ 65000,     35,     16],
       [ 19000,     55,     11],
       [110000,     35,     22],
       [ 14000,     21,     12],
       [ 12000,     19,      8],
       [ 35000,     42,     12]])

## Part 1: Summarizing Data

1. What is the average (mean) age of all employees (rounded to 1 decimal place)?
2. What is the average (mean) income of employees over 30 (rounded to 1 decimal place)?
3. What is the average (mean) number of years of education for employees with incomes above the average income of all employees (rounded to 1 decimal place)?

In [55]:
mean_age = np.mean(survey[:, 1])
print(round(mean_age, 1))

32.4


In [56]:
above_30 = survey[survey[:, 1] > 30]
above_30

array([[ 65000,     35,     16],
       [ 19000,     55,     11],
       [110000,     35,     22],
       [ 35000,     42,     12]])

In [57]:
mean_incomes_30 = np.mean(above_30[:, 0])
print(round(mean_incomes_30, 1))

57250.0


In [58]:
mean_incomes = np.mean(survey[:, 0])
above_mean_incomes = survey[survey[:, 0] > mean_incomes]
above_mean_incomes

array([[ 65000,     35,     16],
       [110000,     35,     22]])

In [59]:
mean_years_education_above_mean_incomes = np.mean(above_mean_incomes[:, 2])
print(round(mean_years_education_above_mean_incomes, 1))

19.0



## Part 2: Editing Data

The US government is thinking about offering a 1,500 tax credit to anyone making less than 20,000 a year. You can think of this tax credit as effectively an additional $1,500 of income to each person receiving the credit.

4. Using the data from `survey`, modify income values to calculate a new estimate of the employees' incomes after we take this credit into account. What will the average income be for all employees if those making less than 20,000 a year were to receive this credit (rounded to 1 decimal place)?
    - Do so by subsetting and editing values programmatically, *not* just typing values by hand. (Yes, writing out a new vector by hand is easy to do in this example, but you couldn't do it with a large, real dataset!)

In [60]:
survey

array([[ 22000,     20,     12],
       [ 65000,     35,     16],
       [ 19000,     55,     11],
       [110000,     35,     22],
       [ 14000,     21,     12],
       [ 12000,     19,      8],
       [ 35000,     42,     12]])

In [72]:
survey[:, 0] = np.where(survey[:, 0] < 20000, survey[:, 0] + 1500, survey[:, 0])
survey

array([[ 22000,     20,     12],
       [ 65000,     35,     16],
       [ 20500,     55,     11],
       [110000,     35,     22],
       [ 17000,     21,     12],
       [ 15000,     19,      8],
       [ 35000,     42,     12]])

In [73]:
update_mean_incomes = np.mean(survey[:, 0])
print(round(update_mean_incomes, 1))

40642.9


## Part 3: Measuring Income Inequality (with Real Data!)

In this exercise, we'll be working with data from the [US Current Population Survey, provided by the National Bureau of Economic Research (NBER)](https://www.nber.org/research/data/current-population-survey-cps-merged-outgoing-rotation-group-earnings-data). This is a regular survey conducted by the US Bureau of Labor to calculate the US employment rate.

In this exercise, we'll use this data to study gender and racial wage inequality in the US.

Load data from the 2018 CPS survey with the following command:

```python
cps = np.loadtxt("data/cps.txt")
```

This data is a *subset* of the full CPS survey and contains only data on **employed respondents working at least 35 hours a week (e.g., full-time).**


In [74]:
cps  = np.loadtxt("data/cps.txt")
cps

array([[9.03000000e+02, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        7.41241160e+03],
       [4.00000000e+02, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        1.62090217e+04],
       [1.25000000e+03, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        6.95413390e+03],
       ...,
       [6.80000000e+02, 4.00000000e+01, 2.00000000e+00, 1.00000000e+00,
        1.17532490e+03],
       [5.42500000e+02, 4.00000000e+01, 2.00000000e+00, 1.00000000e+00,
        1.15256500e+03],
       [3.15000000e+02, 3.50000000e+01, 1.00000000e+00, 1.00000000e+00,
        1.33595210e+03]])

In [75]:
np.shape(cps)

(122603, 5)

5. How many rows does this matrix have?

6. The five columns of this matrix correspond to:
    - Column 1: Weekly income in dollars.
    - Column 2: Usual hours respondent works per week.
    - Column 3: Gender. 2 is "Female", 1 is "Male"
    - Column 4: Race. This can take on a lot of values for those who identify as mixed race, but for simplicity, in this exercise, we'll just focus on a few values. For those interested, the full set of codes can be found on page 19 of the [CPS codebook](https://data.nber.org/morg/docs/cpsx.pdf).
        - 1: White
        - 2: Black
        - 3: American Indian
        - 4: Asian only
        - 5: Hawaiian/Pacific Islander only
    - Column 5: Survey weights.

Note that race does not break out Hispanic / non-Hispanic identities. In US government surveys, Hispanic / non-Hispanic is usually recorded in a separate `ethnicity` variable, so many people who identify as Hispanic are identified as White or Black in the `race` variable.

For the moment, let's ignore survey weights—they don't impact results here significantly.

**What is the *average hourly wage* for all workers in this data (round it to one decimal place)?**

**Hint:** This will require more than just using `mean` on a single column!

In [81]:
average_hours_wage = np.mean(cps[:, 0] / cps[:, 1])
print(round(average_hours_wage, 1))

26.0


7. What is the average hourly wage of working men (rounded to one decimal place)? 

8. What is the average hourly wage of working women (rounded to one decimal place)?

9.  What share (e.g., a value between 0 and 1) of men's average hourly wage is women's average hourly wage? (rounded to three decimal places). In other words, what is women's average hourly wage divided by men's average hourly wage? *Don't round anything until you have your final answer!*

Congratulations! You've just calculated the US gender wage gap, on your own, using real data! I mean... I guess "congratulations" is a weird thing to say after directly measuring one of the more egregious inequities in US society, but one of the reasons many of us study data science is so that we will have the ability to directly measure these types of phenomena in the hopes of being able to better understand and address them.

In [84]:
Male_cps = cps[cps[:, 2] == 1]
Male_cps

array([[6.8000000e+02, 4.0000000e+01, 1.0000000e+00, 1.0000000e+00,
        8.2596860e+03],
       [8.0000000e+02, 4.0000000e+01, 1.0000000e+00, 2.0000000e+00,
        7.3649142e+03],
       [1.3461500e+03, 4.0000000e+01, 1.0000000e+00, 1.0000000e+00,
        8.2113853e+03],
       ...,
       [1.0769200e+03, 4.0000000e+01, 1.0000000e+00, 1.0000000e+00,
        1.3363443e+03],
       [1.4423000e+03, 6.0000000e+01, 1.0000000e+00, 1.0000000e+00,
        1.3460681e+03],
       [3.1500000e+02, 3.5000000e+01, 1.0000000e+00, 1.0000000e+00,
        1.3359521e+03]])

In [85]:
average_hours_wage_Male = np.mean(Male_cps[:, 0] / Male_cps[:, 1])
print(round(average_hours_wage_Male, 1))

27.9


In [86]:
Female_cps = cps[cps[:, 2] == 2]
Female_cps

array([[9.03000000e+02, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        7.41241160e+03],
       [4.00000000e+02, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        1.62090217e+04],
       [1.25000000e+03, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        6.95413390e+03],
       ...,
       [1.05769000e+03, 4.00000000e+01, 2.00000000e+00, 1.00000000e+00,
        1.30522680e+03],
       [6.80000000e+02, 4.00000000e+01, 2.00000000e+00, 1.00000000e+00,
        1.17532490e+03],
       [5.42500000e+02, 4.00000000e+01, 2.00000000e+00, 1.00000000e+00,
        1.15256500e+03]])

In [88]:
average_hours_wage_Female = np.mean(Female_cps[:, 0] / Female_cps[:, 1])
print(round(average_hours_wage_Female, 1))

23.8


In [89]:
gap_wages = average_hours_wage_Female / average_hours_wage_Male
print(round(gap_wages, 3))

0.854


10.  Now, speaking of egregious inequities, what is the average hourly wage for respondents who identify as Black (rounded to one decimal place)?

11.  What is the average hourly wage for respondents who identify as White (rounded to one decimal place)?

12.  What share (e.g., a value between 0 and 1) of respondents who identify as White's average hourly wage is respondents who identify as Black's average hourly wage? (rounded to three decimal places). In other words, what is the respondents who identify as Black's average hourly wage divided by respondents who identify as White's average hourly wage? *Don't round anything until you have your final answer!*

Note that this will only be an approximation—one would normally also include all respondents of mixed-race into non-mutually exclusive categories like "Any Part Black" or "Any Part White", and we would also break out Hispanic and non-Hispanic respondents. But as most respondents only pick on racial category, this will still give us a reasonable approximation.

In [90]:
Black_cps = cps[cps[:, 3] == 2]
Black_cps

array([[9.03000000e+02, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        7.41241160e+03],
       [4.00000000e+02, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        1.62090217e+04],
       [1.25000000e+03, 4.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        6.95413390e+03],
       ...,
       [8.40000000e+02, 4.00000000e+01, 1.00000000e+00, 2.00000000e+00,
        1.45730739e+04],
       [7.33200000e+02, 4.00000000e+01, 1.00000000e+00, 2.00000000e+00,
        1.60719168e+04],
       [8.07690000e+02, 6.00000000e+01, 2.00000000e+00, 2.00000000e+00,
        1.25778864e+04]])

In [91]:
average_hours_wage_Black = np.mean(Black_cps[:, 0] / Black_cps[:, 1])
print(round(average_hours_wage_Black, 1))

21.5


In [92]:
White_cps = cps[cps[:, 3] == 1]
White_cps

array([[6.8000000e+02, 4.0000000e+01, 1.0000000e+00, 1.0000000e+00,
        8.2596860e+03],
       [7.5600000e+02, 3.6000000e+01, 2.0000000e+00, 1.0000000e+00,
        7.9495972e+03],
       [1.3461500e+03, 4.0000000e+01, 1.0000000e+00, 1.0000000e+00,
        8.2113853e+03],
       ...,
       [6.8000000e+02, 4.0000000e+01, 2.0000000e+00, 1.0000000e+00,
        1.1753249e+03],
       [5.4250000e+02, 4.0000000e+01, 2.0000000e+00, 1.0000000e+00,
        1.1525650e+03],
       [3.1500000e+02, 3.5000000e+01, 1.0000000e+00, 1.0000000e+00,
        1.3359521e+03]])

In [93]:
average_hours_wage_White = np.mean(White_cps[:, 0] / White_cps[:, 1])
print(round(average_hours_wage_White, 1))

26.4


In [94]:
gap_race_wages = average_hours_wage_Black / average_hours_wage_White
print(round(gap_race_wages, 3))

0.816


## Bonus Question

As noted above, the fifth column of our data contains something called "sampling weights." That's because when the government conducted this survey, they didn't draw a random sample of respondents from the US population where everyone had the same probability of being interviewed. As a result, when we calculate the average hourly wage of the people in the survey, it isn't *quite* the best estimate we have for the average hourly wage for everyone working in the United States. To calculate this number as accurately as possible, we have to take into account the fact that some respondents in the data were more likely to be included than others, and thus can be thought of as standing in for a smaller group of people in the US.

Why would the government do this? The main reason is that if we want to make statements about a group in a survey like this, the accuracy of those statements will depend on the number of individuals who have actually ended up taking the survey. If we are interested in a big group—like White men—we are almost guaranteed to have enough of them in any reasonably sized survey to be able to make accurate statements about that subpopulation. But if we were interested in the life experiences of, say, people in their twenties who have a high school diploma but never attended college, and who live in rural communities, we may have to make a deliberate effort to ensure that there are more of those people in our survey than when we would get if we just took a random sample where everyone in the United States had the same probability of being included.

As I mentioned above, for the questions above, the sampling weights don't make a very big difference to our answers, but the way to get the *most accurate* estimates would be to take them into account. So let's give that a try!

When we calculate the average of a variable, we do so by multiplying all the values of the variable by $1/N$ (where $N$ is the number of observations we have) and then adding up those multiplied values.

For a *weighted* average, we take the value for each observation $i$ and multiply it by

$$weight_i / \sum_{j=1}^N weights_j$$

where $weight_i$ is the observation's weight, and $\sum_{j=1}^N weights_j$ is the total of all the weights in the population being averaged. Then we just add up those values!

Given that, what is the average hourly wage of Americans working full-time jobs (i.e., the group in this survey) taking into account survey weights (rounded to **two** decimal places)?
