In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("proj04.ipynb")

<table style="width: 100%;" id="nb-header>">
        <tr style="background-color: transparent;"><td>
            <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
        </td><td>
            <p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Spring 2021<br>
                Dr. Eric Van Dusen<br>
            Notebook by Andrei Caprau and Alan Liang</p></td></tr>
    </table>

# Project 4: Econometrics

In this project we'll be taking a look at datasets related to college education and exploring questions revolving around the relationships between years of education and various factors.

In [2]:
import numpy as np
from datascience import *
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore')
plt.style.use('seaborn-muted')
%matplotlib inline
plt.rcParams["figure.figsize"] = [10,7]

## Part 1: College Distance

We will begin by examining the relationship between years of schooling and a person's distance to the nearest college when in high school. The idea here is to see if there are any effects of proximity to a college and how much education a person receives.

The data for this section is from the paper *Democratization or Diversion? The Effect of Community Colleges on Educational Attainment* by Cecilia Rouse (1995).

To explore this problem, we will import a dataset called `college_distance.csv`, which contains several relevant features for a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986.
The table contains the following columns:

- `yrsed`: Years of Education Completed$^1$
- `female`: Binary variable (1 = female, 0 otherwise)
- `black`: Binary variable (1 = black, 0 otherwise)
- `hispanic`: Binary variable (1 = Hispanic, 0 otherwise)
- `bytest`: Basic year composite test score. These are achievement tests given to high school seniors.
- `dadcoll`: Binary variable (1 = father is a college graduate, 0 otherwise)
- `momcoll`: Binary variable (1 = mother is a college graduate, 0 otherwise)
- `incomehi`: Binary variable (1 = family income > \$25,000 per year, 0 otherwise)
- `ownhome`: Binary variable (1 = family owns their home, 0 otherwise)
- `urban`: Binary variable (1 = high school in urban area, 0 otherwise)
- `cue80`: County unemployment rate in 1980 (%)
- `stwmfg80`: Average state hourly wage in manufacturing in 1980
- `dist`: Distance from 4-year college (in 10s of miles)
- `tuition`: Average state 4 year college tuition (in 1000s of dollars)

$^1$: Rouse computed years of education by assigning 12 years to all members of the senior class. Each additional year of secondary education counted as a one year. Student’s with vocational degrees were assigned 13 years, AA degrees were assigned 14 years, BA degrees were assigned 16 years, those with some graduate education were assigned 17 years, and those with a graduate degree were assigned 18 years.


In [3]:
distance = Table.read_table('college_distance.csv')
distance

<!-- BEGIN QUESTION -->

**Question 1.1:** What do you expect the sign of the relationship between years of schooling and distance to nearest college to be? Provide a possible and brief explanation for the sign.

<!--
BEGIN QUESTION
name: q1_1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.2:** Consider the following single-variable regression:

$$\text{Years of Education} = \beta_1 \times \text{Distance from College} + \alpha$$ 

Fit the above regression of years of education `yrsed` onto distance to the nearest college `dist`. 

*Hint*: Make sure to always add a column of 1's.

<!--
BEGIN QUESTION
name: q1_2
-->

In [4]:
y_1_2 = ...
X_1_2 = ...
model_1_2 = ...
results_1_2 = ...
results_1_2.summary()

In [None]:
grader.check("q1_2")

**Question 1.3:** What is the estimated relationship between distance and years of schooling? Assign `slope_1_3` to the estimated slope (to at least 4 decimal places). Is this statistically significant? Assign `significant_1_3` to either `True` or `False`, corresponding to whether or not the slope if statistically significant.
<!--
BEGIN QUESTION
name: q1_3
-->

In [8]:
slope_1_3 = ...
significant_1_3 = ...

In [None]:
grader.check("q1_3")

<!-- BEGIN QUESTION -->

**Question 1.4:** Interpret the slope coefficient on `dist` with the appropriate units. Does this value align with our intuition from 1.1? 
Interpret the y-intercept term. 
<!--
BEGIN QUESTION
name: q1_4
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.5:** What could be a potential confounding variable in the above regression? Do you expect this confounding factor to overstate or understate our coefficient on `dist`? Why? Give clear reasoning for how the confounding variable affects both independent and dependent variables, like discussed in lecture.
<!--
BEGIN QUESTION
name: q1_5
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.6:** Now consider the following longer regression:

$$\text{Years of Education} = \beta_1 \times \text{Distance from College} + \beta_2 \times \text{High Income Family} + \alpha$$ 

fit a regression model for years of education `yrsed` onto both distance to the nearest college `dist`, and family income `incomehi`.
<!--
BEGIN QUESTION
name: q1_6
-->

In [15]:
y_1_6 = ...
X_1_6 = ...
model_1_6 = ...
results_1_6 = ...
results_1_6.summary()

In [None]:
grader.check("q1_6")

**Question 1.7:** Now what is the estimated relationship between distance and years of schooling? Assign `slope_1_7` to the estimated slope (to at least 4 decimal places). Is this statistically significant? Assign `significant_1_7` to either `True` or `False`, corresponding to whether or not the slope if statistically significant.
<!--
BEGIN QUESTION
name: q1_7
-->

In [18]:
slope_1_7 = ...
significant_1_7 = ...

In [None]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

**Question 1.8:** How does this new slope for distance to college compare to the previous one? What does this say about family income?
<!--
BEGIN QUESTION
name: q1_8
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.9:** Now fit a linear regression with the additional regressors `bytest`, `female`, `black`, `hispanic`, `incomehi`,  `ownhome`, `dadcoll`, `momcoll`, `cue80`, and `stwnfg80` (along with `dist`).

<!--
BEGIN QUESTION
name: q1_9
-->


In [23]:
y_1_9 = ...
X_1_9 = ...
model_1_9 = ...
results_1_9 = ...
results_1_9.summary()

In [None]:
grader.check("q1_9")

<!-- BEGIN QUESTION -->

**Question 1.10:** Further compare the slope on distance to college in our most recent regression with the previous two regressions. What does this suggest about the collective group of variables we included in addition to `dist`, regarding the idea of omitted variable bias?
<!--
BEGIN QUESTION
name: q1_10
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.11:** The value of the coefficient on `dadcoll` should be positive. What does this coefficient measure? Interpret this effect.
<!--
BEGIN QUESTION
name: q1_11
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.12:** Explain why `cue80` and `stwmfg80` appear in the regression. Are the signs of their estimated coefficients what you would have believed? Explain.

<!--
BEGIN QUESTION
name: q1_12
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.13:** A policymaker who wants to increase the average years of schooling of the population sees your results and concludes that more colleges should to be built such that people are closer to colleges. Do you agree with this proposal? Why or why not?

<!--
BEGIN QUESTION
name: q1_13
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.14:** Let's try and make a prediction. 

Bob is a white non-Hispanic male. His high school was 20 miles from the nearest college. His base-year composite test score `bytest` was 58. 
His family income in 1980 was \\$26,000,
and his family owned a home. His mother attended college, but his father did not.
The unemployment rate in his county was 7.5%, and the state average manufacturing
hourly wage was \\$9.75. Predict Bob’s years of completed schooling using the regressions
in 1.6 and 1.9. 

Assign `bob_schooling_1_9` to the predicted schooling using the model in 1.9 and assign `bob_schooling_1_6` to the predicted schooling using the model in 1.6.

<!--
BEGIN QUESTION
name: q1_14
-->


In [28]:
bob_schooling_1_9 = ...
bob_schooling_1_6 = ...

print("Bob's predicted years of schooling in the long model is:", bob_schooling_1_9)
print("Bob's predicted years of schooling in the short model is:", bob_schooling_1_6)

In [None]:
grader.check("q1_14")

## Part 2: Diploma Effect

Before we were only considering years of schooling as a continuous variable and not paying much attention to the significance of certain years. 
In reality, it seems natural to think that 16 years of schooling, which is how long it takes for most people to obtain a Bachelor's degree, is a more significant jump from 15 than a typical one-year increase in schooling would be.

In this next part we examine the question of whether or not a diploma makes a significant difference in a person's earnings. In other words, is there a significant difference between certain jumps in schooling (11 to 12, 15 to 16) that would indicate a benefit from a diploma in addition to the additional year of schooling?

Below you will see a table from Jaeger and Page's paper entitled *Degrees Matter: New Evidence on Sheepskin Effects in the Returns to Education* (1996). Let's take a minute to understand it.
First, each column corresponds to a diffferent regression that they performed, with the title of the column denoting the demographic of people that were regressed on. So in the first column, they performed a regression only on the white men of their dataset. 
- The outcome variable is log hourly wage.
- The years of schooling dummy variables bucket individuals into 10 groups *except* the dummy variable corresponding to 12 years of schooling. The groups are 0-8, 9, 10, 11, 13, 14, 15, 16, 17, and 18+ years of schooling. Note that these buckets are *mutually exclusive and collectively exhaustive*.
- The diploma variables are  dummy variables that represent the individual's highest diploma received. For example, if Alice received 16 years of schooling and received a High School, Bachelor's, and Master's degree, only the dummy variable for 16 years of schooling and dummy variable for Master's degree will be 1.

The regression Jaeger and Page conduct roughly looks as follows:

$$\text{Log income} = \sum_i \beta_i \times \text{Dummy variable for having i years of education} + \sum_j \gamma_j \times \text{Dummy variable for having highest degree j} + \alpha$$

![title](jaeger_page.png)

<!-- BEGIN QUESTION -->

**Question 2.1:** 
Why do you think Jaeger and Page estimate 4 different models separately for white men, white women, black men, and black women? 

<!--
BEGIN QUESTION
name: q2_1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



From now on, to keep the rest of the project short, we will only examine the regression results of model 1, i.e. the regression conducted on white men.

<!-- BEGIN QUESTION -->

**Question 2.2:** Why might the effect in earnings of the 14th year of education be larger than that of the 15th?

<!--
BEGIN QUESTION
name: q2_2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3:** Notice that there is no dummy variable associated for individuals with 12 years of schooling. Why might this be? As a result of this exclusion, how do you interpret the coefficient on 14 years of schooling.

<!--
BEGIN QUESTION
name: q2_3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.4:** Why is the coefficient on doctoral degrees less than that on high school degrees? 
Does this mean that high school graduates make more than PhD graduates? Why or why not?

<!--
BEGIN QUESTION
name: q2_4
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Conclusion
Very nice, you've finished Project 4! You've conducted a soup to nuts analysis that involved performing, comparing, and interpreting several regressions to examine sources of omitted variable bias. 
You've also interpreted a jam-packed table from a noteworthy economics paper that fortified your intuition on ordinary least squares linear regression. We hope you enjoyed the project just as much as we did writing it :')

Congratulations for finishing your last project in Data 88! 

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()