In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab08.ipynb")

<div id="nb-header" style="display: flex; justify-content: space-between; align-items: center; background-color: transparent;">
    <div>
        <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
    </div>
    <div style="text-align: right; font-size: 10pt;">
        <strong>Economic Models</strong>, edX<br>
        Dr. Eric Van Dusen<br>
        Vaidehi Bulusu<br>
        Akhil Venkatesh
    </div>
</div>

# Lab 8: Econometrics

We gave you a theoretical introduction to econometrics. In this lab, you'll get the chance to apply what you've learned and see how econometrics is actually used by economists!

This lab is based on an influential study on the relationship between a person's height and labor market outcomes, and is divided into 3 sections:

1. Simple Linear Regression

2. Multiple Linear Regression

3. Reading Econometrics Tables

You can refer to the [*Econometrics*](https://data-88e.github.io/textbook/content/11-econometrics/index.html) chapter in the textbook for help!

In [None]:
from datascience import *
import numpy as np
import statsmodels.api as sm

 ## Part 1: Simple Linear Regression

Several studies have identified a positive correlation between a person's height and labor market outcomes: on average, taller people have jobs that are of a higher status and pay more. In their paper, *[Stature and Status:
Height, Ability, and Labor Market Outcomes](https://www.nber.org/system/files/working_papers/w12466/w12466.pdf)* (2008), economists Anne Case and Christina Paxson analyze the data from the US National Health Interview Survey in 1994 to explain this association. This is the data we will be using for our analysis.

In the first part of the lab, we will use simple (bivariate) linear regression to look at the association between a person's height and earnings, and consider the problems with limiting our regression model to just 1 regressor.

We start by importing the `earnings.csv` dataset which contains information about the characteristics and labor market outcomes for 17,870 workers.

In [None]:
earnings = Table().read_table("earnings.csv")
earnings.show(5)

<div class="alert alert-warning">
    
Before proceeding, please read the <a href="https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/EE_Datasets/Earnings_and_Height_Description.pdf" target="_blank">data description</a> for this study which gives you information about each variable. **This is very important.**
    
</div>

Note: We generally take the log of earnings in regression models but we’re not doing that in this lab for the sake of simplicity. We’ll be using log earnings as the dependent variable in section 3.

**Question 1.1:** Here's a simple linear regression equation to model the association between height and earnings:

$$\text{Earnings} = \beta_1 \times \text{Height} + \alpha$$

Perform a regression of `earnings` on `height`. Don't forget to add a constant term!


In [None]:
y_1_1 = ...
x_1_1 = ...
model_1_1 = ...
results_1_1 = ...
results_1_1.summary()

In [None]:
grader.check("q1_1")

**Question 1.2:**
Why should we include a constant term in this regression model?

<ol type="A" style="list-style-type: lower-alpha;">
    <li>Because we expect the slope of the line of best fit may be zero.  </li>
    <li>Because we expect the slope of the line of best fit may be non-zero.  </li>
    <li>Because we expect the x-intercept of the line of best fit may be non-zero.  </li>
    <li>Because we expect the y-intercept of the line of best fit may be non-zero.  </li>
</ol>

Assign a letter corresponding to your answer to `q1_2` below. For example, `q1_2 = 'a'`.


In [None]:
q1_2 = ...

In [None]:
grader.check("q1_2")

**Question 1.3:** What is the estimated association between `earnings` and `height`?


In [None]:
result_1_3 = ...
result_1_3

In [None]:
grader.check("q1_3")

**Question 1.4:** Is the association statistically significant? Answer `True` or `False`.


In [None]:
result_1_4 = ...
result_1_4

In [None]:
grader.check("q1_4")

**Question 1.5:** Interpret the slope on the `height` variable (including the units). 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>A 1 inch increase in height corresponds to around a $707.7 increase in earnings  </li>
    <li>A 1 inch increase in height corresponds to around a $707.7 decrease in earnings  </li>
    <li>A 1% increase in height corresponds to around a $707.7 increase in earnings  </li>
    <li>A 1% increase in height corresponds to around a $707.7 decrease in earnings  </li>
</ol>

Assign a letter corresponding to your answer to `q1_5` below. For example, `q1_5 = 'a'`.


In [None]:
q1_5 = ...

In [None]:
grader.check("q1_5")

**Question 1.6:** Interpret the intercept of the regression (including the units). Does this make practical sense? 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>When height is 0, earnings are estimated to be around -$512.7.  </li>
    <li>When height is 0, earnings are estimated to be around $512.7.  </li>
    <li>When earnings are 0, height is estimated to be around -512.7 inches.  </li>
    <li>When earnings are 0, height is estimated to be around 512.7 inches.  </li>
</ol>

Assign a letter corresponding to your answer to `q1_6` below. For example, `q1_6 = 'a'`.


In [None]:
q1_6 = ...

In [None]:
grader.check("q1_6")

**Question 1.7:** Use the slope and intercept from the regression in question 1.1 to generate predictions for `earnings`.


In [None]:
predictions_1_7 = ...
predictions_1_7

In [None]:
grader.check("q1_7")

**Question 1.8:** Calculate the RMSE for your regression predictions from question 1.7.


In [None]:
rmse_1_8 = ...
rmse_1_8

In [None]:
grader.check("q1_8")

**Question 1.9:** Which one of the following is true about the RMSE? 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>RMSE is the total sum of squared error in the regression.   </li>
    <li>RMSE means that on average the predicted earnings are off by $26,775.7.  </li>
    <li>It is possible for the RMSE to increase if we add unrelated variables to the regression.  </li>
    <li>Higher RMSE means the model more accurately predicts the dependent variable.  </li>
</ol>

Assign a letter corresponding to your answer to `q1_9` below. For example, `q1_9 = 'a'`.


In [None]:
q1_9 = ...

In [None]:
grader.check("q1_9")

## Part 2: Multiple Linear Regression

Now, let's perform multiple linear regression to account for potential confounding variables in our model. For simplicity, we will be using only the following additional regressors: `age`, `educ`, `sex` and `weight`.


**Question 2.1:** Perform another regression with the following new regressors: `age`, `educ`, `sex` and `weight` (also include `height`).


In [None]:
y_2_1 = ...
x_2_1 = ...
model_2_1 = ...
results_2_1 = ...
results_2_1.summary()

In [None]:
grader.check("q2_1")

**Question 2.2:** Compare the coefficient on `height` from the regression model in question 2.1 to the coefficient on `height` from the regression model in question 1.1. What does this tell you about the nature of the omitted variable bias in the previous model (is it positive or negative)?

Fill in the blanks: The coefficient in 2.1 is \_\_\_\_\_ which means that the omitted variable bias in 1.1 is overall \_\_\_\_\_. 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>higher; positive</li>
    <li>lower; positive</li>
    <li>higher; negative</li>
    <li>lower; negative</li>
</ol>

Assign a letter corresponding to your answer to `q2_2` below. For example, `q2_2 = 'a'`.


In [None]:
q2_2 = ...

In [None]:
grader.check("q2_2")

**Question 2.3:** If we computed the RMSE for this new regression model, do you think it would be higher or lower than the RMSE we computed in 1.10? 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>Lower</li>
    <li>Higher</li>
    <li>It depends on the specific choice of variables </li>
</ol>

Assign a letter corresponding to your answer to `q2_3` below. For example, `q2_3 = 'a'`.


In [None]:
q2_3 = ...

In [None]:
grader.check("q2_3")

**Question 2.4:** Now that we've accounted for some additional confounding variables do you think it makes sense for us to infer a causal relationship between height and earnings?

<ol type="A" style="list-style-type: lower-alpha;">
    <li>Yes, because the coefficient for height is now highly significant. </li>
    <li>Yes, because we have eliminated omitted variable bias by adding control variables in 2.2. </li>
    <li>No, because there may be other omitted variables we haven't accounted for.</li>
    <li>No, because the set of control variables added in 2.2 is a poor choice. </li>
</ol>

Assign a letter corresponding to your answer to `q2_4` below. For example, `q2_4 = 'a'`.


In [None]:
q2_4 = ...

In [None]:
grader.check("q2_4")

**Question 2.5:** Using regression results in 2.1, now let’s try to predict a person’s earnings based on their characteristics.

What would the predicted earnings of a 35-year-old woman with a Bachelor's degree who is 64 inches tall and 124 pounds?

*Hint: A person with a Bachelor's degree has completed 15 years of education.*


In [None]:
prediction_2_5 = ...
prediction_2_5

In [None]:
grader.check("q2_5")

**Question 2.6:** Say you wanted to know the relationship between gender (`sex`: a binary variable) and income (`earnings`: a continuous variable). Based on our regression results, how is gender correlated with income?

*Hint*: Think about what it means when the coefficient of `sex` is 0 and 1. Also, you can try changing your input for `sex` in 2.5, and see what happens.

Assuming all else equal,

<ol type="A" style="list-style-type: lower-alpha;">
    <li>On average, male earns $586.92 more than female. </li>
    <li>On average, female earns $586.92 more than male. </li>
    <li>For every 1 inch increase in height, male's income will increase on average $586.92 more than that of female. </li>
    <li>For every 1 inch increase in height, female's income will increase on average $586.92 more than that of male. </li>
</ol>

Assign a letter corresponding to your answer to `q2_6` below. For example, `q2_6 = 'a'`.


In [None]:
q2_6 = ...

In [None]:
grader.check("q2_6")

## Part 3: Reading Econometrics Tables

Researchers tend to run multiple regression models which they summarize in econometrics tables.

Below is a table taken from the paper. It shows the regression results from 2 studies on the relationship between height and earnings: the British Cohort Study (BCS) and National Child Development Study (NCDS).

<img src = "https://i.imgur.com/a2o9OPA.png">

<div class="alert alert-warning">
    
Make sure to read the table (including the note at the bottom) before proceeding.
    
</div>

Note that for the questions below, **log earnings** is the dependent variable. Recall the [interpretation of the slope](https://data-88e.github.io/textbook/content/01-demand/03-log-log.html) in this case.

**Question 3.1:** According to the table, what did the British Cohort Study (1970) find about the relationship between height at age 30, test scores ages 5 and 10, and earnings for women? Use the results with extended controls added in. Which of the followings are correct? There can be 1-4 correct answers. 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>The coefficient on height is statistically significant. </li>
    <li>The coefficient on height means for every 1 inch increase in height at age 30, the annual earnings will on average increase by 0.002 dollars. </li>
    <li>The coefficient on test scores is statistically significant. </li>
    <li>The coefficient on test scores means for every 1 point increase in test scores ages 5 and 10, the annual earnings will on average increase by 19.75 dollars. </li>
</ol>

Assign an array of letters corresponding to your answer to `q3_1` below. For example, `q3_1 = make_array('a', 'b', 'c', 'd')`.


In [None]:
q3_1 = ...

In [None]:
grader.check("q3_1")

## Conclusion

This brings us to the end of Lab 8! You've learned some key econometrics skills such as running regressions and reading econometrics tables. You've also developed an intuition for ordinary linear regression, omitted variable bias and regression with dummy variables.

Also, we didn't cover a large part of Case and Paxson's [fascinating study](https://www.nber.org/system/files/working_papers/w12466/w12466.pdf) so if you're interested in how they explain the positive association between height and earnings, we recommend giving the paper a read!

---