In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("ps3.ipynb")

# Econ 140 – Problem Set 3

We'll go through a demonstration on how to do multivariable regression, however we also went through a breif demonstration in PS0 so it might be worth going through that one as well. It is very similar to single variable. The only difference is we need to select multiple columns for our independent `X` variable. Suppose we have a dataset called `df` that has three columns of observations, one called `wage`, another `educ`, and another `parents_wealth`, and suppose we want to regress `wage` onto `educ`, `parents_wealth`, and a constant. To do so, we would first identify the endogenous (dependent) variable and the exogenous (independent) variables.

```python
y = df['wage']
X = sm.add_constant(df[['educ', 'parents_wealth']])
```

Notice the double square brackets when we select multiple columns. `df['educ', 'parents_wealth']` will not work.

Next, we will pass in our endogenous and exogenous variables (in that order) to `sm.OLS`, just like before.

```python
my_ols_model = sm.OLS(y, X)
results = my_ols_model.fit(cov_type = 'HC1')
results.summary()
```

And that's it!

Before getting started on the assignment, run the cell at the very top that imports `otter` and the cell below which will import the packages we need.

**Important:** As mentioned in problem set 0, if you leave this notebook alone for a while and come back, to save memory datahub will "forget" which code cells you have run, and you may need to restart your kernel and run all of the cells from the top. That includes this code cell that imports packages. If you get `<something> not defined` errors, this is because you didn't run an earlier code cell that you needed to run. It might be this cell or the `otter` cell above.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

## Problem 1. Multivariate Linear Regression

This problem will create a dataset by having generated variables in the same way as we did in Problem Set 2. A main advantage of such an exercise is that we can control the true data generating process (“DGP”), which is not possible in practical econometric analysis.

<!-- BEGIN QUESTION -->

**Question 1.a.**
Set the sample size at 1,000 and generate an error term, $u_i$, by randomly selecting from a normal distribution with mean 0, and standard deviation 5. Draw an explanatory variable, $X_{1i}$, from a standard normal distribution, $\mathcal{N}(0,1)$, and then define a second explanatory variable, $X_{2i}$, to be equal to $e^{X_{1i}}$ for all $i$. Finally, set the dependent variable to be linearly related to the two regressors plus an additive error term: $y_i = 2 + 4X_{1i} − 6X_{2i} + u_i$. Note that, by construction, the error term of this multivariate linear regression is homoskedastic.

*Hint*: You may want to refer to how you did this in Problem Set 2. Also, the function `np.exp()` takes a list/array of numbers and applies the exponential function to each element. This is basically the opposite funciton of `np.log()`.

<!--
BEGIN QUESTION
name: q1_a
manual: true
-->

In [4]:
u = ...
X1 = ...
X2 = ...
y = ...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.b.**
Regress $y$ on $X_1$ with homoskedasticity-only standard errors (`statsmodels` does this by default, just don't specify a `cov_type` like we usually do to get robust errors). Do the same analysis for $y$ and $X_2$. Compare the results with the true data generating process. Explain why differences arise between the population slopes and the estimated slopes, if there are any.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_b
manual: true
-->

In [5]:
X1_const = ...
model_1b_X1 = ...
results_1b_X1 = ...
results_1b_X1.summary()

<!-- END QUESTION -->

In [6]:
X2_const = ...
model_1b_X2 = ...
results_1b_X2 = ...
results_1b_X2.summary()

<!-- BEGIN QUESTION -->

**Question 1.c.**
Explain.

<!--
BEGIN QUESTION
name: q1_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.d.**
Next, regress $y$ on both $X_1$ and $X_2$. Compare the estimation results with those you did in part (b/c), especially the model with only the regressor $X_1$. Examine differences across the three regressions in terms of the coefficient estimates, their standard errors, the $R^2$, and the adjusted $R^2$.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_d
manual: true
-->

In [8]:
X_const = sm.add_constant(np.stack([X1, X2], axis=1)) # This just puts our two variables together with a const
model_1d = ...
results_1d = ...
results_1d.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.e.**
Explain.

<!--
BEGIN QUESTION
name: q1_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.f.**
Generate a third regressor: $X_{3i} = 1 + X_{1i} − X_{2i} + v_i$ where $v_i$ is drawn from a normal distribution with mean 0 and standard deviation 0.5. Estimate the model $y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + w_i$. Compare the result with part (d/e). Do changes in OLS estimates, standard errors, the $R^2$, and the adjusted $R_2$ make sense to you? Explain why or why not.

*Hint: Think about the concept of “imperfect multicollinearity".*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_f
manual: true
-->

In [9]:
v = ...
X3 = ...

X_const_f = sm.add_constant(np.stack([X1, X2, X3], axis=1))
model_1f = ...
results_1f = ...
results_1f.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.g.**
Explain.

<!--
BEGIN QUESTION
name: q1_g
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 2. Teaching Ratings

We will use `teaching_ratings.csv` which contains data on course evaluations, course characteristics, and professor characteristics for 463 courses at the University of Texas at Austin. One of the characteristics is an index of the professor’s “beauty” as rated by a panel of six judges. The variable `course_eval` is an overall teaching evaluation score, on a scale of 1 (very unsatisfactory) to 5 (excellent). In this exercise, you will investigate how course evaluations are related to the professor’s beauty.

In [10]:
ratings = pd.read_csv("teaching_ratings.csv")
ratings.head()

<!-- BEGIN QUESTION -->

**Question 2.a.**
Run a regression of `course_eval` on `beauty` using robust standard errors. What is the estimated slope? Is it statistically significant?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_a
manual: true
-->

In [12]:
y_2a = ...
X_2a = ...
model_2a = ...
results_2a = ...
results_2a.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.b.**
Explain.

<!--
BEGIN QUESTION
name: q2_b
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.c.**
Run a regression of `course_eval` on `beauty`, including some additional variables to control for the type of course and professor characteristics. In particular, include as additional regressors `intro`, `onecredit`, `female`, `minority`, and `nnenglish`. What is the estimated effect of `beauty` on `course_eval`? Does the regression in (a) suffer from important omitted variable bias (OVB)? What happens with the $R^2$? Based on
the confidence interval from the regression, can you reject the null hypothesis that the effect of beauty is the same as in part (a)? What can you say about the effect of the new variables included?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_c
manual: true
-->

In [13]:
y_2c = ...
X_2c = ...
model_2c = ...
results_2c = ...
results_2c.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.d.**
Explain.

<!--
BEGIN QUESTION
name: q2_d
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.e.**
Estimate the coefficient on beauty for the multiple regression model in (c) using the three-step process in Appendix 6.3 (the Frisch-Waugh theorem). Verify that the three-step process yields the same estimated coefficient for beauty as that obtained in (c). Comment.

*Hint: Recall that if your regression results are called `results`, you could get the residuals using `results.resid`.*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_e
manual: true
-->

In [14]:
# Do the first step here (regress the outcome variable on covariates)
course_eval = ...
covariates = ...
model_eval_on_covariates = ...
results_eval = ...
eval_residuals = ...

# Do the second step here (regress the explanatory variable on covariates)
beauty = ...
model_beauty_on_covariates = ...
results_beauty = ...
beauty_residuals = ...

# Do the last step here (regress the outcome variable's residuals on the explanatory variable's residuals)
model_fw = ...
results_fw = ...
results_fw.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.f.**
Explain.

<!--
BEGIN QUESTION
name: q2_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.g.**
Professor Smith is a black male with average beauty and is a native English speaker. He teaches a three-credit upper-division course. Predict Professor Smith’s course evaluation.

<!--
BEGIN QUESTION
name: q2_g
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 3. Education and Distance to College

The file `college_distance.csv` contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. A detailed description is given in `college_distance_description.pdf`, which will be shared on Piazza and bCourses. In this exercise, you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student’s high school to the nearest four-year college.

In [15]:
dist = pd.read_csv("college_distance.csv")
dist.head()

<!-- BEGIN QUESTION -->

**Question 3.a.**
What do you expect for the sign of the relationship and what mechanism can you think about to explain it?

<!--
BEGIN QUESTION
name: q3_a
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.b.**
Run a regression of years of completed education (`yrsed`) on distance to the nearest college (`dist`), measured in tens of miles (For example, dist = 2 means that the distance is 20 miles). What is the estimated slope? Is it statistically significant? Does distance to college explain a large fraction of the variance in educational attainment across individuals? Explain.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q3_b
manual: true
-->

In [16]:
y_3b = ...
X_3b = ...
model_3b = ...
results_3b = ...
results_3b.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.c.**
Explain.

<!--
BEGIN QUESTION
name: q3_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.d.**
Now run a regression of `yrsed` on `dist`, but include some additional regressors to control for characteristics of the student, the student’s family, and the local labor market. In particular, include as additional regressors: `bytest`, `female`, `black`, `hispanic`, `incomehi`, `ownhome`, `dadcoll`, `cue80`, and `stwmfg80`.  What is the estimated effect of `dist` on `yrsed`?  Is it substantively different from the regression in (b)? Based on this, does the regression in (b) seem to suffer from important omitted variable bias?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q3_d
manual: true
-->

In [17]:
y_3d = ...
X_3d = ...
model_3d = ...
results_3d = ...
results_3d.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.e.**
Explain.

<!--
BEGIN QUESTION
name: q3_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.f.**
The value of the coefficient on `dadcoll` is positive. What does this coefficient measure?
Interpret this effect.

<!--
BEGIN QUESTION
name: q3_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.g.** Explain why `cue80` and `stwmfg80` appear in the regression. Are the signs of their estimated coefficients what you would have believed? Explain.

<!--
BEGIN QUESTION
name: q3_g
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.h.**
Bob is a black male. His high school was 20 miles from the nearest college. His base-year composite test score (`bytest`) was 58. His family income in 1980 was \\$26,000, and his family owned a home. His mother attended college, but his father did not. The unemployment rate in his county was 7.5%, and the state average manufacturing hourly wage was \\$9.75. Predict Bob’s years of completed schooling using the regression in (d).

<!--
BEGIN QUESTION
name: q3_h
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.to_pdf(pagebreaks=False, display_link=True)