In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("proj04.ipynb")

<table style="width: 100%;" id="nb-header">
    <tr style="background-color: transparent;"><td>
        <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
    </td><td>
        <p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Fall 2021<br>
            Dr. Eric Van Dusen <br>
        Notebook by Vaidehi Bulusu <br>
</table>

# Project 4: Econometrics

In this project, we will be using econometric analysis to investigate the relationship between a person's height and labor market outcomes. The project is divided into 3 sections:

1. Simple Linear Regression

2. Multiple Linear Regression

3. Reading Econometrics Tables

You can refer to the [*Econometrics*](https://data-88e.github.io/textbook/content/11-econometrics/index.html) chapter in the textbook and week 12 resources for help on this project.

In [1]:
from datascience import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings
warnings.simplefilter(action='ignore')
plt.style.use('seaborn-muted')
%matplotlib inline
plt.rcParams["figure.figsize"] = [10,7]

## Part 1: Simple Linear Regression

Several studies have identified a positive correlation between a person's height and labor market outcomes. On average, taller people have jobs that are of higher status and pay more. In their paper, *[Stature and Status:
Height, Ability, and Labor Market Outcomes](https://www.nber.org/system/files/working_papers/w12466/w12466.pdf)* (2008), Anne Case and Christina Paxson analyze the data from the US National Health Interview Survey in 1994 to explain this association. This is the data we will be using for our analysis.

In the first part of the project, we will be using simple (bivariate) linear regression to look at the association between a person's height and their earnings and consider the problems with limiting our analysis to just 2 variables.

We start by importing the `earnings.csv` dataset which contains information about the characteristics and labor market outcomes for 17,870 workers.

In [2]:
earnings = Table().read_table("earnings.csv")
earnings.show(5)

<div class="alert alert-warning">
    
Before proceeding, please read the <a href="https://wps.pearsoned.com/wps/media/objects/11422/11696965/data3eu/Earnings_and_Height_Description.pdf" target="_blank">data description</a> for this study which gives you information about each variable. **This is very important.**
    
</div>

<!-- BEGIN QUESTION -->

**Question 1.1:** What would you expect the sign of the relationship between height and earnings to be? Explain your answer.

<!--
BEGIN QUESTION
name: q1_1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.2:** Here's a simple linear regression equation to model the association between height and earnings:

$$\text{Earnings} = \beta_1 \times \text{Height} + \alpha$$

Perform a regression of `earnings` on `height`. Don't forget to add a constant term.

<!--
BEGIN QUESTION
name: q1_2
points: 2
-->

In [3]:
y_1_2 = ...
x_1_2 = ...
model_1_2 = ...
results_1_2 = ...
results_1_2.summary()

In [None]:
grader.check("q1_2")

<!-- BEGIN QUESTION -->

**Question 1.3:** Why should we include a constant term in the regression model?

<!--
BEGIN QUESTION
name: q1_3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.4:** What is the estimated slope of the association between `earnings` and `height`?

<!--
BEGIN QUESTION
name: q1_4
points:
    - 0
    - 1
-->

In [7]:
result_1_4 = ...
result_1_4

In [None]:
grader.check("q1_4")

**Question 1.5:** Is the association statistically significant? Answer `True` or `False`.

<!--
BEGIN QUESTION
name: q1_5
points:
    - 0
    - 1
-->

In [10]:
result_1_5 = ...
result_1_5

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

**Question 1.6:** Interpret the slope on the `height` variable (including the units). Does this match with what you expected in question 1.1?

<!--
BEGIN QUESTION
name: q1_6
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.7:** Interpret the intercept of the regression (including the units). Does this make practical sense? Explain your answer.

<!--
BEGIN QUESTION
name: q1_7
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.8:** If the slope on the independent variable is statistically significant, can we infer a causal relationship between the two variables? Why or why not? When can you infer a causal relationship between 2 variables based on the results of the study (hint: think about the type of study)?

<!--
BEGIN QUESTION
name: q1_8
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.9:** Use the slope and intercept from the regression in question 1.2 to generate an array has predictions for `earnings` based on the `height` column on our dataset.

<!--
BEGIN QUESTION
name: q1_9
-->

In [13]:
predictions_1_9 = ...
predictions_1_9

In [None]:
grader.check("q1_9")

**Question 1.10:** Calculate the RMSE for your regression predictions for `earnings`.

<!--
BEGIN QUESTION
name: q1_10
-->

In [16]:
rmse_1_10 = ...
rmse_1_10

In [None]:
grader.check("q1_10")

<!-- BEGIN QUESTION -->

**Question 1.11:** Interpret the RMSE value (including the units). What does it tell you about the accuracy of your predictions from question 1.9?

<!--
BEGIN QUESTION
name: q1_11
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Part 2: Multiple Linear Regression

Now, let's do multiple linear regression to account for potential confounding variables in our model.

<!-- BEGIN QUESTION -->

**Question 2.1:** Take a look at the columns in the `earnings` table. Pick 2 variables you think might be potential confounders in the study and provide a brief explanation for why you think each of the variables may be a confounder. In your answer, also talk about how each variable is related to `height` and `earnings`.

<!--
BEGIN QUESTION
name: q2_1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.2:** Perform another regression with the following new regressors: `age`, `educ`, `sex` and `weight` (remember to also include `height`).

<!--
BEGIN QUESTION
name: q2_2
points: 2
-->

In [19]:
y_2_2 = ...
x_2_2 = ...
model_2_2 = ...
results_2_2 = ...
results_2_2.summary()

In [None]:
grader.check("q2_2")

<!-- BEGIN QUESTION -->

**Question 2.3:** Compare the coefficient on `height` from the regression model in question 2.2 to the coefficient on `height` from the regression model in question 1.2. What does this tell you about the nature of the omitted variable bias in the previous model (hint: is the bias positive or negative)?

<!--
BEGIN QUESTION
name: q2_3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.4:** Look at the regression coefficients for the variables you chose in question 2.1. Do the coefficients align with your intuition?

<!--
BEGIN QUESTION
name: q2_4
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.5:** Let's consider the effect of gender on earnings. Intuitively, do you think the relationship between height and earnings would be different based on gender? Explain your answer.

<!--
BEGIN QUESTION
name: q2_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.6:** Let's say we perform a new regression in which we took all the variables in the earnings table into account. How would the RMSE you obtain from this regression differ from the RMSE you calculated in question 1.10? Why?  
*Hint:* Would the new RMSE be higher or lower than the one you obtained in 1.10?

<!--
BEGIN QUESTION
name: q2_6
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.7:** Now that we've accounted for confounding variables such as age, occupation, gender, region, etc. do you think it makes sense for us to infer a causal relationship between height and earnings? Explain your answer.

<!--
BEGIN QUESTION
name: q2_7
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.8:** Now let’s try to predict a person’s earnings based on their characteristics.

Sofia is a 35-year-old Hispanic female who works as an executive for a private company. She earned a Bachelor's degree (completing 15 years of education) and lives in Pennsylvania (which is in the northeastern part of the US) with her husband. She is 64 inches tall and weighs 124 pounds.

What would her predicted earnings be?

<!--
BEGIN QUESTION
name: q2_8
-->

In [23]:
prediction_2_8 = ...
prediction_2_8

In [None]:
grader.check("q2_8")

<!-- BEGIN QUESTION -->

**Question 2.9:** Say we were to regress `earnings` (which is a continuous variable) on just `sex` (which is a binary variable). Which value would the regression coefficient be equal to (hint: it's a difference between two values)?

<!--
BEGIN QUESTION
name: q2_9
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Part 3: Reading Econometrics Tables

Below is one of the regression tables taken from the paper. It shows the regression results from 2 studies on the relationship between height and earnings: the British Cohort Study (BCS) and National Child Development Study (NCDS).

<img src = "https://i.imgur.com/a2o9OPA.png">

<div class="alert alert-warning">
    
Make sure to read the table (including the note at the bottom) before proceeding.
    
</div>

<!-- BEGIN QUESTION -->

**Question 3.1:** We can see that in each study, the researchers estimate separate models for men and women. Why do you think this is?

<!--
BEGIN QUESTION
name: q3_1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.2:** Look at the results of the British Cohort Study for men. There are 3 coefficients reported for height at age 30. Give a brief interpretation of each of these coefficients (including the units). In your explanation, talk about what is causing the differences in these coefficients (hint: look at the *Test scores* and *Extended controls* rows).

Note: Assume height is measured in inches for this study.

<!--
BEGIN QUESTION
name: q3_2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.3:** Compare the third coefficient for men and women in the BCS (look at the 3rd and 6th columns). What does this tell you about the relationship between height and earnings for men and women? Does this align with your intuition in question 2.5?

<!--
BEGIN QUESTION
name: q3_3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



**Congratulations for finishing your last project in Data 88E!** We hope you enjoyed working on this project (and this course in general) as much as we enjoyed creating it :)

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()