# Week 3 Quiz


**Note**: 

> This exercise has been written out in something called a Jupyter Notebook. We'll discuss Jupyter Notebooks in more detail later in the course—they are very a powerful tool for data science communication!—but for the time being, the notebook is just a convenient way for us to write out the exercise. You don't need to *do* anything with the notebook except read its contents—just use write your Python code in a regular `.py` file.


As previously discussed, we frequently use matrices and data science because they are a natural format for representing data generated by collecting the same type of information from numerous entities. For example, below is a toy dataset that you could imagine was created by collecting information about employees at a company—each column is a different type of information being collected (income, age, years of education), and each row is the information about a different employee.

In the following questions, you will be asked to answer a number of questions about this toy dataset. As with other exercises in this class, you will find the directions for this graded exercise here, but please submit your actual answers in the graded quiz.

In [1]:
import numpy as np
import pandas as pd

incomes = [22_000, 65_000, 19_000, 110_000, 14_000, 12_000, 35_000]
ages = [20, 35, 55, 35, 21, 19, 42]
years_of_education = [12, 16, 11, 22, 12, 8, 12]

survey = np.array([incomes, ages, years_of_education]).T
survey


array([[ 22000,     20,     12],
       [ 65000,     35,     16],
       [ 19000,     55,     11],
       [110000,     35,     22],
       [ 14000,     21,     12],
       [ 12000,     19,      8],
       [ 35000,     42,     12]])

## Part 1: Summarizing Data

1. What is the average (mean) age of all employees (rounded to 1 decimal place)?
2. What is the average (mean) income of employees over 30 (rounded to 1 decimal place)?
3. What is the average (mean) number of years of education for employees with incomes above the average income of all employees (rounded to 1 decimal place)?

In [2]:
# 1. Average age of all employees
average_age = np.mean(survey[:, 1])

# 2. Average income of employees over 30
over_30_incomes = survey[survey[:, 1] > 30, 0]
average_income_over_30 = np.mean(over_30_incomes)

# 3. Average years of education for employees with incomes above the overall average
average_income = np.mean(survey[:, 0])
above_average_income_education = survey[survey[:, 0] > average_income, 2]
average_years_education_above_avg_income = np.mean(above_average_income_education)

# Print results
average_age, average_income_over_30, average_years_education_above_avg_income


(32.42857142857143, 57250.0, 19.0)


## Part 2: Editing Data

The US government is thinking about offering a 1,500 tax credit to anyone making less than 20,000 a year. You can think of this tax credit as effectively an additional $1,500 of income to each person receiving the credit.

4. Using the data from `survey`, modify income values to calculate a new estimate of the employees' incomes after we take this credit into account. What will the average income be for all employees if those making less than 20,000 a year were to receive this credit (rounded to 1 decimal place)?
    - Do so by subsetting and editing values programmatically, *not* just typing values by hand. (Yes, writing out a new vector by hand is easy to do in this example, but you couldn't do it with a large, real dataset!)

In [3]:
# Let’s compute the new average income with the tax credit! 🚀

# Add $1,500 tax credit to those making less than $20,000
survey[survey[:, 0] < 20_000, 0] += 1_500

# Calculate the new average income
new_average_income = np.mean(survey[:, 0])
new_average_income = round(new_average_income, 1)
new_average_income

40214.3

## Part 3: Measuring Income Inequality (with Real Data!)

In this exercise, we'll be working with data from the [US Current Population Survey, provided by the National Bureau of Economic Research (NBER)](https://www.nber.org/research/data/current-population-survey-cps-merged-outgoing-rotation-group-earnings-data). This is a regular survey conducted by the US Bureau of Labor to calculate the US employment rate.

In this exercise, we'll use this data to study gender and racial wage inequality in the US.

Load data from the 2018 CPS survey with the following command:

```python
cps = np.loadtxt("data/cps.txt")
```

This data is a *subset* of the full CPS survey and contains only data on **employed respondents working at least 35 hours a week (e.g., full-time).**


5. How many rows does this matrix have?

6. The five columns of this matrix correspond to:
    - Column 1: Weekly income in dollars.
    - Column 2: Usual hours respondent works per week.
    - Column 3: Gender. 2 is "Female", 1 is "Male"
    - Column 4: Race. This can take on a lot of values for those who identify as mixed race, but for simplicity, in this exercise, we'll just focus on a few values. For those interested, the full set of codes can be found on page 19 of the [CPS codebook](https://data.nber.org/morg/docs/cpsx.pdf).
        - 1: White
        - 2: Black
        - 3: American Indian
        - 4: Asian only
        - 5: Hawaiian/Pacific Islander only
    - Column 5: Survey weights.

Note that race does not break out Hispanic / non-Hispanic identities. In US government surveys, Hispanic / non-Hispanic is usually recorded in a separate `ethnicity` variable, so many people who identify as Hispanic are identified as White or Black in the `race` variable.

For the moment, let's ignore survey weights—they don't impact results here significantly.

**What is the *average hourly wage* for all workers in this data (round it to one decimal place)?**

**Hint:** This will require more than just using `mean` on a single column!

In [4]:
import numpy as np

# Load the CPS data (assuming cps.txt is in the correct directory)
cps = np.loadtxt("cps.txt")

# 5. Number of rows in the matrix
num_rows = cps.shape[0]

# 6. Calculate the average hourly wage
weekly_income = cps[:, 0]
hours_per_week = cps[:, 1]

# Avoid division by zero just in case (though the dataset should be clean)
hourly_wage = weekly_income / hours_per_week
average_hourly_wage = np.mean(hourly_wage)

# Round the result to 1 decimal place
average_hourly_wage = round(average_hourly_wage, 1)

# Print results
print("Number of rows:", num_rows)
print("Average hourly wage:", average_hourly_wage)

Number of rows: 122603
Average hourly wage: 26.0


In [5]:
cps_df = pd.DataFrame(cps, columns=['income','hours','gender','race','weights'])

In [6]:
cps_df

Unnamed: 0,income,hours,gender,race,weights
0,903.00,40.0,2.0,2.0,7412.4116
1,400.00,40.0,2.0,2.0,16209.0217
2,1250.00,40.0,2.0,2.0,6954.1339
3,680.00,40.0,1.0,1.0,8259.6860
4,800.00,40.0,1.0,2.0,7364.9142
...,...,...,...,...,...
122598,1076.92,40.0,1.0,1.0,1336.3443
122599,1442.30,60.0,1.0,1.0,1346.0681
122600,680.00,40.0,2.0,1.0,1175.3249
122601,542.50,40.0,2.0,1.0,1152.5650


In [7]:
cps_df.gender.unique()

array([2., 1.])

In [8]:
cps_df.gender.unique()

array([2., 1.])

7. What is the average hourly wage of working men (rounded to one decimal place)? 

8. What is the average hourly wage of working women (rounded to one decimal place)?

9.  What share (e.g., a value between 0 and 1) of men's average hourly wage is women's average hourly wage? (rounded to three decimal places). In other words, what is women's average hourly wage divided by men's average hourly wage? *Don't round anything until you have your final answer!*

Congratulations! You've just calculated the US gender wage gap, on your own, using real data! I mean... I guess "congratulations" is a weird thing to say after directly measuring one of the more egregious inequities in US society, but one of the reasons many of us study data science is so that we will have the ability to directly measure these types of phenomena in the hopes of being able to better understand and address them.

In [9]:
# 7. Average hourly wage of working men
men_wages = hourly_wage[cps[:, 2] == 1]
average_men_wage = round(np.mean(men_wages), 1)

# 8. Average hourly wage of working women
women_wages = hourly_wage[cps[:, 2] == 2]
average_women_wage = round(np.mean(women_wages), 1)

# 9. Share of men’s average wage that women earn
wage_ratio = np.mean(women_wages) / np.mean(men_wages)
wage_ratio = round(wage_ratio, 3)

# Print results
print("Number of rows:", num_rows)
print("Average hourly wage (all workers):", average_hourly_wage)
print("Average hourly wage (men):", average_men_wage)
print("Average hourly wage (women):", average_women_wage)
print("Share of men’s wage that women earn:", wage_ratio)

Number of rows: 122603
Average hourly wage (all workers): 26.0
Average hourly wage (men): 27.9
Average hourly wage (women): 23.8
Share of men’s wage that women earn: 0.854


10.  Now, speaking of egregious inequities, what is the average hourly wage for respondents who identify as Black (rounded to one decimal place)?

11.  What is the average hourly wage for respondents who identify as White (rounded to one decimal place)?

12.  What share (e.g., a value between 0 and 1) of respondents who identify as White's average hourly wage is respondents who identify as Black's average hourly wage? (rounded to three decimal places). In other words, what is the respondents who identify as Black's average hourly wage divided by respondents who identify as White's average hourly wage? *Don't round anything until you have your final answer!*

Note that this will only be an approximation—one would normally also include all respondents of mixed-race into non-mutually exclusive categories like "Any Part Black" or "Any Part White", and we would also break out Hispanic and non-Hispanic respondents. But as most respondents only pick on racial category, this will still give us a reasonable approximation.

In [10]:
# 10. Average hourly wage for respondents who identify as Black
black_wages = hourly_wage[cps[:, 3] == 2]
average_black_wage = round(np.mean(black_wages), 1)

# 11. Average hourly wage for respondents who identify as White
white_wages = hourly_wage[cps[:, 3] == 1]
average_white_wage = round(np.mean(white_wages), 1)

# 12. Share of White wage that Black respondents earn
racial_wage_ratio = np.mean(black_wages) / np.mean(white_wages)
racial_wage_ratio = round(racial_wage_ratio, 3)

print("Average hourly wage (Black respondents):", average_black_wage)
print("Average hourly wage (White respondents):", average_white_wage)
print("Share of White wage that Black respondents earn:", racial_wage_ratio)

Average hourly wage (Black respondents): 21.5
Average hourly wage (White respondents): 26.4
Share of White wage that Black respondents earn: 0.816


## Bonus Question

As noted above, the fifth column of our data contains something called "sampling weights." That's because when the government conducted this survey, they didn't draw a random sample of respondents from the US population where everyone had the same probability of being interviewed. As a result, when we calculate the average hourly wage of the people in the survey, it isn't *quite* the best estimate we have for the average hourly wage for everyone working in the United States. To calculate this number as accurately as possible, we have to take into account the fact that some respondents in the data were more likely to be included than others, and thus can be thought of as standing in for a smaller group of people in the US.

Why would the government do this? The main reason is that if we want to make statements about a group in a survey like this, the accuracy of those statements will depend on the number of individuals who have actually ended up taking the survey. If we are interested in a big group—like White men—we are almost guaranteed to have enough of them in any reasonably sized survey to be able to make accurate statements about that subpopulation. But if we were interested in the life experiences of, say, people in their twenties who have a high school diploma but never attended college, and who live in rural communities, we may have to make a deliberate effort to ensure that there are more of those people in our survey than when we would get if we just took a random sample where everyone in the United States had the same probability of being included.

As I mentioned above, for the questions above, the sampling weights don't make a very big difference to our answers, but the way to get the *most accurate* estimates would be to take them into account. So let's give that a try!

When we calculate the average of a variable, we do so by multiplying all the values of the variable by $1/N$ (where $N$ is the number of observations we have) and then adding up those multiplied values.

For a *weighted* average, we take the value for each observation $i$ and multiply it by

$$weight_i / \sum_{j=1}^N weights_j$$

where $weight_i$ is the observation's weight, and $\sum_{j=1}^N weights_j$ is the total of all the weights in the population being averaged. Then we just add up those values!

Given that, what is the average hourly wage of Americans working full-time jobs (i.e., the group in this survey) taking into account survey weights (rounded to **two** decimal places)?


In [11]:
# BONUS: Weighted average hourly wage
weights = cps[:, 4]
weighted_average_hourly_wage = np.sum(hourly_wage * weights) / np.sum(weights)
weighted_average_hourly_wage = round(weighted_average_hourly_wage, 2)

print("Weighted average hourly wage (full-time workers):", weighted_average_hourly_wage)

Weighted average hourly wage (full-time workers): 25.93
