In [2]:
# Initialize Otter
import otter
grader = otter.Notebook("ps3.ipynb")

# Econ 140 – Problem Set 3

We'll go through a demonstration on how to do multivariable regression, however we also went through a breif demonstration in PS0 so it might be worth going through that one as well. It is very similar to single variable. The only difference is we need to select multiple columns for our independent `X` variable. Suppose we have a dataset called `df` that has three columns of observations, one called `wage`, another `educ`, and another `parents_wealth`, and suppose we want to regress `wage` onto `educ`, `parents_wealth`, and a constant. To do so, we would first identify the endogenous (dependent) variable and the exogenous (independent) variables.

```python
y = df['wage']
X = sm.add_constant(df[['educ', 'parents_wealth']])
```

Notice the double square brackets when we select multiple columns. `df['educ', 'parents_wealth']` will not work.

Next, we will pass in our endogenous and exogenous variables (in that order) to `sm.OLS`, just like before.

```python
my_ols_model = sm.OLS(y, X)
results = my_ols_model.fit(cov_type = 'HC1')
results.summary()
```

And that's it!

Before getting started on the assignment, run the cell at the very top that imports `otter` and the cell below which will import the packages we need.

**Important:** As mentioned in problem set 0, if you leave this notebook alone for a while and come back, to save memory datahub will "forget" which code cells you have run, and you may need to restart your kernel and run all of the cells from the top. That includes this code cell that imports packages. If you get `<something> not defined` errors, this is because you didn't run an earlier code cell that you needed to run. It might be this cell or the `otter` cell above.

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

## Problem 1. Multivariate Linear Regression

This problem will create a dataset by having generated variables in the same way as we did in Problem Set 2. A main advantage of such an exercise is that we can control the true data generating process (“DGP”), which is not possible in practical econometric analysis.

<!-- BEGIN QUESTION -->

**Question 1.a.**
Set the sample size at 1,000 and generate an error term, $u_i$, by randomly selecting from a normal distribution with mean 0, and standard deviation 5. Draw an explanatory variable, $X_{1i}$, from a standard normal distribution, $\mathcal{N}(0,1)$, and then define a second explanatory variable, $X_{2i}$, to be equal to $e^{X_{1i}}$ for all $i$. Finally, set the dependent variable to be linearly related to the two regressors plus an additive error term: $y_i = 2 + 4X_{1i} − 6X_{2i} + u_i$. Note that, by construction, the error term of this multivariate linear regression is homoskedastic.

*Hint*: You may want to refer to how you did this in Problem Set 2. Also, the function `np.exp()` takes a list/array of numbers and applies the exponential function to each element. This is basically the opposite funciton of `np.log()`.

<!--
BEGIN QUESTION
name: q1_a
manual: true
-->

In [3]:
u = np.random.normal(0,5,1000)
X1 = np.random.normal(0,1,1000)
X2 = np.exp(X1)
y = 2+4*X1-6*X2+u

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.b.**
Regress $y$ on $X_1$ with homoskedasticity-only standard errors (`statsmodels` does this by default, just don't specify a `cov_type` like we usually do to get robust errors). Do the same analysis for $y$ and $X_2$. Compare the results with the true data generating process. Explain why differences arise between the population slopes and the estimated slopes, if there are any.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_b
manual: true
-->

In [4]:
X1_const = sm.add_constant(X1)
model_1b_X1 = sm.OLS(y,X1_const)
results_1b_X1 = model_1b_X1.fit()
results_1b_X1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.283
Model:,OLS,Adj. R-squared:,0.283
Method:,Least Squares,F-statistic:,394.7
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,2.88e-74
Time:,17:10:14,Log-Likelihood:,-3541.6
No. Observations:,1000,AIC:,7087.0
Df Residuals:,998,BIC:,7097.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.7675,0.264,-29.370,0.000,-8.287,-7.249
x1,-5.2744,0.265,-19.867,0.000,-5.795,-4.753

0,1,2,3
Omnibus:,493.785,Durbin-Watson:,1.95
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4573.134
Skew:,-2.056,Prob(JB):,0.0
Kurtosis:,12.636,Cond. No.,1.02


<!-- END QUESTION -->

In [5]:
X2_const = sm.add_constant(X2)
model_1b_X2 = sm.OLS(y,X2_const)
results_1b_X2 = model_1b_X2.fit()
results_1b_X2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.677
Model:,OLS,Adj. R-squared:,0.677
Method:,Least Squares,F-statistic:,2093.0
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,2.92e-247
Time:,17:11:53,Log-Likelihood:,-3142.9
No. Observations:,1000,AIC:,6290.0
Df Residuals:,998,BIC:,6300.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.7883,0.233,-3.388,0.001,-1.245,-0.332
x1,-4.3046,0.094,-45.754,0.000,-4.489,-4.120

0,1,2,3
Omnibus:,6.945,Durbin-Watson:,2.003
Prob(Omnibus):,0.031,Jarque-Bera (JB):,6.838
Skew:,-0.194,Prob(JB):,0.0327
Kurtosis:,3.115,Cond. No.,3.49


<!-- BEGIN QUESTION -->

**Question 1.c.**
Explain.

<!--
BEGIN QUESTION
name: q1_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.d.**
Next, regress $y$ on both $X_1$ and $X_2$. Compare the estimation results with those you did in part (b/c), especially the model with only the regressor $X_1$. Examine differences across the three regressions in terms of the coefficient estimates, their standard errors, the $R^2$, and the adjusted $R^2$.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_d
manual: true
-->

In [6]:
X_const = sm.add_constant(np.stack([X1, X2], axis=1)) # This just puts our two variables together with a const
model_1d = sm.OLS(y,X_const)
results_1d = model_1d.fit()
results_1d.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.735
Model:,OLS,Adj. R-squared:,0.735
Method:,Least Squares,F-statistic:,1385.0
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,1.58e-288
Time:,17:12:56,Log-Likelihood:,-3043.6
No. Observations:,1000,AIC:,6093.0
Df Residuals:,997,BIC:,6108.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.1377,0.289,7.399,0.000,1.571,2.705
x1,4.1370,0.279,14.805,0.000,3.589,4.685
x2,-6.0874,0.148,-41.263,0.000,-6.377,-5.798

0,1,2,3
Omnibus:,4.23,Durbin-Watson:,2.01
Prob(Omnibus):,0.121,Jarque-Bera (JB):,4.088
Skew:,-0.147,Prob(JB):,0.13
Kurtosis:,3.105,Cond. No.,6.43


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.e.**
Explain.

<!--
BEGIN QUESTION
name: q1_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.f.**
Generate a third regressor: $X_{3i} = 1 + X_{1i} − X_{2i} + v_i$ where $v_i$ is drawn from a normal distribution with mean 0 and standard deviation 0.5. Estimate the model $y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + w_i$. Compare the result with part (d/e). Do changes in OLS estimates, standard errors, the $R^2$, and the adjusted $R_2$ make sense to you? Explain why or why not.

*Hint: Think about the concept of “imperfect multicollinearity".*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_f
manual: true
-->

In [8]:
v = np.random.normal(0,0.5,1000)
X3 = 1+X1-X2+v

X_const_f = sm.add_constant(np.stack([X1, X2, X3], axis=1))
model_1f = sm.OLS(y,X_const_f)
results_1f = model_1f.fit()
results_1f.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.736
Model:,OLS,Adj. R-squared:,0.735
Method:,Least Squares,F-statistic:,924.7
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,2.83e-287
Time:,17:15:43,Log-Likelihood:,-3042.7
No. Observations:,1000,AIC:,6093.0
Df Residuals:,996,BIC:,6113.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.5354,0.420,6.043,0.000,1.712,3.359
x1,4.5511,0.422,10.775,0.000,3.722,5.380
x2,-6.4950,0.345,-18.825,0.000,-7.172,-5.818
x3,-0.4047,0.310,-1.307,0.192,-1.012,0.203

0,1,2,3
Omnibus:,3.82,Durbin-Watson:,2.013
Prob(Omnibus):,0.148,Jarque-Bera (JB):,3.674
Skew:,-0.136,Prob(JB):,0.159
Kurtosis:,3.118,Cond. No.,12.9


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.g.**
Explain.

<!--
BEGIN QUESTION
name: q1_g
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 2. Teaching Ratings

We will use `teaching_ratings.csv` which contains data on course evaluations, course characteristics, and professor characteristics for 463 courses at the University of Texas at Austin. One of the characteristics is an index of the professor’s “beauty” as rated by a panel of six judges. The variable `course_eval` is an overall teaching evaluation score, on a scale of 1 (very unsatisfactory) to 5 (excellent). In this exercise, you will investigate how course evaluations are related to the professor’s beauty.

In [9]:
ratings = pd.read_csv("teaching_ratings.csv")
ratings.head()

Unnamed: 0,minority,age,female,onecredit,beauty,course_eval,intro,nnenglish
0,1.0,36.0,1.0,0,0.289916,4.3,0.0,0.0
1,0.0,59.0,0.0,0,-0.737732,4.5,0.0,0.0
2,0.0,51.0,0.0,0,-0.571984,3.7,0.0,0.0
3,0.0,40.0,1.0,0,-0.677963,4.3,0.0,0.0
4,0.0,31.0,1.0,0,1.509794,4.4,0.0,0.0


<!-- BEGIN QUESTION -->

**Question 2.a.**
Run a regression of `course_eval` on `beauty` using robust standard errors. What is the estimated slope? Is it statistically significant?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_a
manual: true
-->

In [11]:
y_2a = ratings['course_eval']
X_2a = sm.add_constant(ratings['beauty'])
model_2a = sm.OLS(y_2a,X_2a)
results_2a = model_2a.fit(cov_type='HC1')
results_2a.summary()

0,1,2,3
Dep. Variable:,course_eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,16.94
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,4.58e-05
Time:,17:26:17,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.949,4.048
beauty,0.1330,0.032,4.115,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.41
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.b.**
Explain.

<!--
BEGIN QUESTION
name: q2_b
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.c.**
Run a regression of `course_eval` on `beauty`, including some additional variables to control for the type of course and professor characteristics. In particular, include as additional regressors `intro`, `onecredit`, `female`, `minority`, and `nnenglish`. What is the estimated effect of `beauty` on `course_eval`? Does the regression in (a) suffer from important omitted variable bias (OVB)? What happens with the $R^2$? Based on
the confidence interval from the regression, can you reject the null hypothesis that the effect of beauty is the same as in part (a)? What can you say about the effect of the new variables included?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_c
manual: true
-->

In [12]:
y_2c = ratings['course_eval']
X_2c = sm.add_constant(ratings[['beauty','intro','onecredit','female','minority','nnenglish']])
model_2c = sm.OLS(y_2c,X_2c)
results_2c = model_2c.fit(cov_type='HC1')
results_2c.summary()

0,1,2,3
Dep. Variable:,course_eval,R-squared:,0.155
Model:,OLS,Adj. R-squared:,0.144
Method:,Least Squares,F-statistic:,17.03
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,8.67e-18
Time:,17:28:49,Log-Likelihood:,-344.85
No. Observations:,463,AIC:,703.7
Df Residuals:,456,BIC:,732.7
Df Model:,6,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.0683,0.037,109.926,0.000,3.996,4.141
beauty,0.1656,0.032,5.246,0.000,0.104,0.227
intro,0.0113,0.056,0.202,0.840,-0.099,0.121
onecredit,0.6345,0.108,5.871,0.000,0.423,0.846
female,-0.1735,0.049,-3.505,0.000,-0.270,-0.076
minority,-0.1666,0.067,-2.472,0.013,-0.299,-0.034
nnenglish,-0.2442,0.094,-2.608,0.009,-0.428,-0.061

0,1,2,3
Omnibus:,22.413,Durbin-Watson:,1.516
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24.406
Skew:,-0.555,Prob(JB):,5.02e-06
Kurtosis:,3.179,Cond. No.,5.81


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.d.**
Explain.

<!--
BEGIN QUESTION
name: q2_d
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.e.**
Estimate the coefficient on beauty for the multiple regression model in (c) using the three-step process in Appendix 6.3 (the Frisch-Waugh theorem). Verify that the three-step process yields the same estimated coefficient for beauty as that obtained in (c). Comment.

*Hint: Recall that if your regression results are called `results`, you could get the residuals using `results.resid`.*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_e
manual: true
-->

In [14]:
# Do the first step here (regress the outcome variable on covariates)
course_eval = ...
covariates = ...
model_eval_on_covariates = ...
results_eval = ...
eval_residuals = ...

# Do the second step here (regress the explanatory variable on covariates)
beauty = ...
model_beauty_on_covariates = ...
results_beauty = ...
beauty_residuals = ...

# Do the last step here (regress the outcome variable's residuals on the explanatory variable's residuals)
model_fw = ...
results_fw = ...
results_fw.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.f.**
Explain.

<!--
BEGIN QUESTION
name: q2_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.g.**
Professor Smith is a black male with average beauty and is a native English speaker. He teaches a three-credit upper-division course. Predict Professor Smith’s course evaluation.

<!--
BEGIN QUESTION
name: q2_g
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 3. Education and Distance to College

The file `college_distance.csv` contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. A detailed description is given in `college_distance_description.pdf`, which will be shared on Piazza and bCourses. In this exercise, you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student’s high school to the nearest four-year college.

In [13]:
dist = pd.read_csv("college_distance.csv")
dist.head()

Unnamed: 0,female,black,hispanic,bytest,dadcoll,momcoll,ownhome,urban,cue80,stwmfg80,dist,tuition,yrsed,incomehi
0,0.0,0.0,0.0,39.15,1.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,1.0
1,1.0,0.0,0.0,48.87,0.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,0.0
2,0.0,0.0,0.0,48.74,0.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,0.0
3,0.0,1.0,0.0,40.4,0.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,0.0
4,1.0,0.0,0.0,40.48,0.0,0.0,0.0,1.0,5.6,8.09,0.4,0.88915,13.0,0.0


<!-- BEGIN QUESTION -->

**Question 3.a.**
What do you expect for the sign of the relationship and what mechanism can you think about to explain it?

<!--
BEGIN QUESTION
name: q3_a
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.b.**
Run a regression of years of completed education (`yrsed`) on distance to the nearest college (`dist`), measured in tens of miles (For example, dist = 2 means that the distance is 20 miles). What is the estimated slope? Is it statistically significant? Does distance to college explain a large fraction of the variance in educational attainment across individuals? Explain.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q3_b
manual: true
-->

In [16]:
model_3b = sm.OLS(dist['yrsed'],sm.add_constant(dist[['dist','bytest','female','black','hispanic','incomehi','ownhome','dadcoll','momcoll','cue80','stwmfg80']]))
results_3b = model_3b.fit(cov_type='HC1')
results_3b.summary()

0,1,2,3
Dep. Variable:,yrsed,R-squared:,0.283
Model:,OLS,Adj. R-squared:,0.281
Method:,Least Squares,F-statistic:,183.5
Date:,"Sat, 06 Mar 2021",Prob (F-statistic):,0.0
Time:,17:38:58,Log-Likelihood:,-7015.1
No. Observations:,3796,AIC:,14050.0
Df Residuals:,3784,BIC:,14130.0
Df Model:,11,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,8.8614,0.241,36.757,0.000,8.389,9.334
dist,-0.0308,0.012,-2.651,0.008,-0.054,-0.008
bytest,0.0924,0.003,30.806,0.000,0.087,0.098
female,0.1434,0.050,2.851,0.004,0.045,0.242
black,0.3538,0.067,5.242,0.000,0.222,0.486
hispanic,0.4024,0.074,5.457,0.000,0.258,0.547
incomehi,0.3666,0.062,5.890,0.000,0.245,0.489
ownhome,0.1456,0.065,2.247,0.025,0.019,0.273
dadcoll,0.5699,0.076,7.474,0.000,0.420,0.719

0,1,2,3
Omnibus:,116.663,Durbin-Watson:,1.928
Prob(Omnibus):,0.0,Jarque-Bera (JB):,98.499
Skew:,0.326,Prob(JB):,4.08e-22
Kurtosis:,2.554,Cond. No.,539.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.c.**
Explain.

<!--
BEGIN QUESTION
name: q3_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.d.**
Now run a regression of `yrsed` on `dist`, but include some additional regressors to control for characteristics of the student, the student’s family, and the local labor market. In particular, include as additional regressors: `bytest`, `female`, `black`, `hispanic`, `incomehi`, `ownhome`, `dadcoll`, `cue80`, and `stwmfg80`.  What is the estimated effect of `dist` on `yrsed`?  Is it substantively different from the regression in (b)? Based on this, does the regression in (b) seem to suffer from important omitted variable bias?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q3_d
manual: true
-->

In [17]:
y_3d = ...
X_3d = ...
model_3d = ...
results_3d = ...
results_3d.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.e.**
Explain.

<!--
BEGIN QUESTION
name: q3_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.f.**
The value of the coefficient on `dadcoll` is positive. What does this coefficient measure?
Interpret this effect.

<!--
BEGIN QUESTION
name: q3_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.g.** Explain why `cue80` and `stwmfg80` appear in the regression. Are the signs of their estimated coefficients what you would have believed? Explain.

<!--
BEGIN QUESTION
name: q3_g
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.h.**
Bob is a black male. His high school was 20 miles from the nearest college. His base-year composite test score (`bytest`) was 58. His family income in 1980 was \\$26,000, and his family owned a home. His mother attended college, but his father did not. The unemployment rate in his county was 7.5%, and the state average manufacturing hourly wage was \\$9.75. Predict Bob’s years of completed schooling using the regression in (d).

<!--
BEGIN QUESTION
name: q3_h
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 4. The Sheepskin Effect

The table of results that will be shared on Piazza and bCourses is copied from a paper by Jaeger and Page (1996) entitled “Degrees Matter: New Evidence on Sheepskin Effects in the Returns to Education,” The Review of Economics and Statistics. The question is whether employers pay in relation to years of education or whether there is an additional premium for obtaining a degree. Such premium might be called the “sheepskin effect” (because diplomas at one time were printed on a sheet of sheepskin) or the “diploma effect.” The Jaeger and Page paper estimates the magnitude of this effect. Note: an empty cell in the table means that variable is not included in the regression.

<!-- BEGIN QUESTION -->

**Question 4.a.**
Why do you think Jaeger and Page estimate their model using only people of a single race and gender (in this particular case the sample consists of white males)?

<!--
BEGIN QUESTION
name: q4_a
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.b.**
Look at column (3) of the table. In words, interpret the coefficient on the dummy variable “9”.

*Hint: Note that “12” is the omitted category.*

<!--
BEGIN QUESTION
name: q4_b
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.c.**
Why do you think the effect of the 14th year of education is larger than that of the 15th?

<!--
BEGIN QUESTION
name: q4_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.d.**
Now look at column (4). Think about a student who is currently a senior. What is the average difference in the student's wage now and the one that the student could get at the end of the year following graduation?

<!--
BEGIN QUESTION
name: q4_d
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.e.**
Based on the results presented in this column, would you rather choose to complete a PhD or a professional degree? Explain.

<!--
BEGIN QUESTION
name: q4_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.f.**
Using the results from columns (3) and (4), how would you test the presence of a “diploma effect”? Carry out the test at a 5% significance level.

*Hint: You may find some of the information you need in the footnote of the table.*

<!--
BEGIN QUESTION
name: q4_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [4]:
# Save your notebook first, then run this cell to export your submission.
grader.to_pdf(pagebreaks=False, display_link=True)

<IPython.core.display.Javascript object>