In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("ps2.ipynb")

# Econ 140 – Problem Set 2

In this problem set we will be conducting a variety of single-variable linear regressions. 
There are many ways to do linear regressions in Python; we will be using a package called `statsmodels` (which we import as `sm`) and its `OLS` class.
[Here](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html) is the documentation; you may find the examples especially helpful if you are stuck.
We'll go through an example of how to use `sm.OLS` below.

Suppose we have a dataset called `df` that has two columns of observations, one called `wage` and the other `educ`, and suppose we want to regress `wage` onto `educ` and a constant.
To do so, we would first identify the endogenous (dependent) variable and the exogenous (independent) variables.

```python
y = df['wage']
X = df['educ']
```

To add an intercept term into our model, we must add a column of 1's to our independent variable using `sm.add_constant`. This will return a two column table, with 1 column only having the value 1.

```python
X = sm.add_constant(df['educ'])
```

Next, we will pass in our endogenous and exogenous variables (in that order) to `sm.OLS`, which will create an OLS model. Make sure to store the model!


```python
my_ols_model = sm.OLS(y, X)
```

So far, we have initialized our model but have not actually fitted it. To do so, we run the `fit` command on our model and store it. 
In order to incorporate robust standard errors, we also have to pass in the argument `cov_type = 'HC1'`. 

```python
results = my_ols_model.fit(cov_type = 'HC1')
```

Lastly, displaying our fitted results `results` should display the results. You display by running the code `results.summary()` at the end of your cell.
In the middle table of our results, you should see something like this.
![](statsmodels_example.jpeg)

Before getting started on the assignment, run the cell at the very top that imports `otter` and the cell below which will import the packages we need.

**Important:** As mentioned in problem set 0, if you leave this notebook alone for a while and come back, to save memory datahub will "forget" which code cells you have run, and you may need to restart your kernel and run all of the cells from the top. That includes this code cell that imports packages. If you get `<something> not defined` errors, this is because you didn't run an earlier code cell that you needed to run. It might be this cell or the `otter` cell above.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy import stats

## Problem 1. Bivariate Linear Regression

In this question we create a synthetic dataset using random number generation commands. This time we create two random variables that are related to one another, and we fit that relationship using a bivariate linear regression. The beauty of this approach is that we know the population parameters because we pick them when generating the data. We can then check to see how well least squares estimation performs.

<!-- BEGIN QUESTION -->

**Question 1.a.**
Begin by specifying that there are 100 observations and generate the regressor to be $x = 10 + 20v$, where $v$ is a uniform random variable on the unit interval. 
As a result, $x$ is a random variable uniformly distributed on the interval $[10, 30]$. 
Next specify the dependent variable to be linearly related to this regressor according to $y = 30 + 5x + u$, where $u$ is a random draw from a normal distribution with population mean 0 and population standard deviation 100. 
Then, generate a scatter plot of $x$ and $y$.

*Hint*: You may want to check out [`np.random.random_sample`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_sample.html) to generate $v$. 
You also may want to check out [`np.random.normal`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html) to generate $u$.

<!--
BEGIN QUESTION
name: q1_a
manual: true
-->

In [None]:
v = np.random.random_sample(...)
x = ...
u = np.random.normal(..., ..., ...)
y = ...

plt.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y");

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.b.**
Next regress $y$ on $x$ (calling for robust standard errors). Is each one of the three OLSE assumptions satisfied in this case? Explain why for each one. Give your assessment of how well least squares regression performs in estimating the true intercept and slope.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_b
manual: true
-->

In [None]:
X_1b = sm.add_constant(...)
model_1b = sm.OLS(..., ...)
results_1b = model_1b.fit(...)
results_1b.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.c.**
Explain.

<!--
BEGIN QUESTION
name: q1_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



Below we have printed out for you the square root of the mean squared error of the residuals. This is another term for the standard error of the regression.

In [None]:
results_1b.mse_resid ** 0.5

<!-- BEGIN QUESTION -->

**Question 1.d.**
Looking at the results of this regression including the number shown above, assess how close least squares estimation is to the true variance of the error term.

<!--
BEGIN QUESTION
name: q1_d
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.e.**
Generate the regression residuals and confirm they add up to zero. Also, confirm that the residuals are uncorrelated with the regressor.

*Hint: The command `results_1c.resid` will give you an array of the residuals of the regression. The function `np.sum()` takes an array as an argument inside the parenthases and sums all of the elements together. Remember that `results_1c.resid` is an array. Also, the function `np.corrcoef()` takes in two arrays of equal length, separated by a comma, and computes the correlation matrix of the two arrays. For example, usage might look like `np.corrcoef(array1, array2)`.*

<!--
BEGIN QUESTION
name: q1_e
manual: true
-->

In [None]:
sum_of_residuals = np.sum(...)
print("Sum of residuals: ", sum_of_residuals)
np.corrcoef(..., ...)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.f.**
Now generate the variables $x$ and $y$ as you did above but do it for $n = 1000$ observations. Run the regression of $y$ on $x$ and compare the results with the earlier case of $n = 100$. Explain the differences.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_f
manual: true
-->

In [None]:
v_1000 = np.random.sample(...)
x_1000 = ...
u_1000 = np.random.normal(..., ..., ...)
y_1000 = ...

X_1f = ...
model_1f = sm.OLS(..., ...)
results_1f = model_1f.fit(...)
results_1f.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.h.**
Explain.

<!--
BEGIN QUESTION
name: q1_h
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 2. Wages and Education

For this problem we will use the dataset `wages.csv`. This dataset contains information on about 300 American workers. It includes their average monthly wage (`wage`), gender (`male`) and completed years of formal education (`educ`). You suspect (hope?) that people with higher educational attainment earn more on average.

In [None]:
wages = pd.read_csv("wages.csv")
wages.head()

<!-- BEGIN QUESTION -->

**Question 2.a.**
Plot a scatter diagram of the average monthly wage against education level. Does it confirm your intuition? What differences do you see between individuals who did not complete high school and those that did?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_a
manual: true
-->

In [None]:
plt.scatter(..., ...)
plt.xlabel("educ")
plt.ylabel("wage")
plt.title("Wages vs. Education Level");

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.b.**
Explain.

<!--
BEGIN QUESTION
name: q2_b
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.c.**
Perform an OLS regression of wages on education. Be sure to include the robust option. Give a precise interpretation of least squares estimate of the intercept and evaluate its sign, size and statistical significance. Does its value make economic sense? Do the same for the least squares estimate of the slope. Does this slope estimate confirm the scatter plot above?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_c
manual: true
-->

In [None]:
y_2c = ...
X_2c = sm.add_constant(...)
model_2c = sm.OLS(..., ...)
results_2c = ...
results_2c.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.d.**
Explain.

<!--
BEGIN QUESTION
name: q2_d
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.e.**
List the three OLS assumptions and give a concrete example of when each of those would hold in this context. Are these assumptions plausible in this context?

<!--
BEGIN QUESTION
name: q2_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.f.**
You are rightfully concerned whether education will, in fact, be rewarded in the labor market. You wonder if another year of education will yield an expected \\$100 more per month (which if discounted over a typical working lifetime at say, 5\%, amounts to roughly a year at Berkeley). Test the following null hypothesis:
$H_0: \beta_1 = 100$ vs $H_1: \beta_1 \neq 100$.

<!--
BEGIN QUESTION
name: q2_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.g.**
Let’s now return to a familiar empirical question: do men and women earn the same amount? As in part (a) above, generate a scatterplot of `wage` against the dummy variable `male`. Don't forget to label your axes! What is your answer to the question based on this graph?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_g
manual: true
-->

In [None]:
plt.scatter(..., ...)
plt.xlabel(...)
plt.ylabel(...)
plt.title(...);

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.h.**
Explain.

<!--
BEGIN QUESTION
name: q2_h
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.i.**
Run an OLS regression of `wage` on `male`. Provide a precise interpretation of the slope. Do you believe you have found evidence of wage discrimination in this data, or do you believe there is another explanation for the differences? Explain.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_i
manual: true
-->

In [None]:
y_2i = ...
X_2i = ...
model_2i = ...
results_2i = ...
results_2i.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.j.**
Explain.

<!--
BEGIN QUESTION
name: q2_j
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.k.**
As we did in problem set 1, perform a t-test of a difference in wages between men and women and report the t-stat and p-value. Compare the output of that test with the regression results you got using the male dummy. To make the two results (in terms of t-stat and p-value) correspond, do you assume equal or unequal variance of men’s and women’s wages?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_k
manual: true
-->

In [None]:
wages_men = ...
wages_women = ...

ttest_2k = stats.ttest_ind(..., ..., ...)

tstat_2k = ttest_2k.statistic
pval_2k = ttest_2k.pvalue

print("t-stat: {}".format(tstat_2k))
print("p-value: {}".format(pval_2k))

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.l.**
Explain.

<!--
BEGIN QUESTION
name: q2_l
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 3. Wine Prices and Vintage

Suppose you are interested in the relationship between the price of wine bottles and their vintage (a vintage wine is a wine which is made up of grapes harvested in a specific year) and you write the following model: $price_i = \beta_0 + \beta_1 vintage_i + u_i$, where price is expressed in dollars, vintage in years (i.e., 1 if the grapes were harvested one year ago, 2 if the grapes were harvested two years ago, etc.), $u_i$ are the error terms, and $i$ indexes the bottles. Assume a very large sample size (like tens of thousands of bottles).

<!-- BEGIN QUESTION -->

**Question 3.a.**
What is contained in the error term? Provide a couple of examples. Do you think that the first OLS assumption is plausible in this context?

<!--
BEGIN QUESTION
name: q3_a
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.b.**
Suppose you estimate your model via OLS and you obtain the following estimated coefficients (standard errors are reported in parenthesis), with $R^2 = 0.77$:
$$price_i = \underset{(2.57)}{1.75} + \underset{(1.02)}{5.5} vintage_i + \hat{u}_i$$

Interpret the regression coefficients.

<!--
BEGIN QUESTION
name: q3_b
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.c.**
Comment on the $R^2$. Given this statistic what can you infer about causality in the relationship of prices and vintage?

<!--
BEGIN QUESTION
name: q3_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.d.**
Predict the fitted value of price of a bottle whose grapes were harvested ten years ago, and that for a bottle harvested nine years ago; then compute the difference between the two values.

<!--
BEGIN QUESTION
name: q3_d
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.e.**
Derive the marginal effect of the increase in one year in vintage on price. Do you get the same result as in part (d)? Why? Explain.

<!--
BEGIN QUESTION
name: q3_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.f.**
Using the results above, give a 95\% confidence interval for the difference in average price for a ten year bottle vs a five year bottle. Can you reject the null hypothesis that this difference is \\$40?

<!--
BEGIN QUESTION
name: q3_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Problem 4. Family Size and Consumption

The 2015 Nobel Prize winner, Prof. Angus Deaton of Princeton, spent a lifetime studying the consumption behavior of individuals and households, in contrast to the earlier tradition of modeling aggregate consumption. You will follow in his footsteps in this problem by examining the role of family size on consumption patterns. In particular, you will examine how food expenditures are related to the size of the household. It is hypothesized that as family size increases (e.g. people move in together), economies of scale are realized. We might expect per capita food consumption to increase with increases in household size. Especially in poor households, where food expenditures are at bare minimum, we would expect per capita food consumption to rise with household size. To do your research, you will work with a selection from the U.S. Consumer Expenditure Survey for 1,000 U.S. households from 2014 `ces.csv`. A few of the key variables from the data file are described in the table below.

| Variable     | Description                                                   |
|--------------|---------------------------------------------------------------|
| age_ref      | age of reference person                                       |
| fam_size     | number of members in household                                |
| no_earnr     | number of earners                                             |
| totexppq     | total expenditures during previous quarter                    |
| foodpq       | total food expenditures during previous quarter               |
| fractearners | fraction of adults in household who work                      |
| ratioover64  | number of family members older than 64 to total family size   |
| ratioless18  | number of family members younger than 18 to total family size |
| rationless2  | number of family members younger than 2 to total family size  |

In [None]:
ces = pd.read_csv("ces.csv")
ces.head()

<!-- BEGIN QUESTION -->

**Question 4.a.**
Since we want to see what happens to the share of expenditures spent on food, create the variable `foodshare` = `foodpq`/`totexppq`. Run a regression of food share on family size. What is the interpretation of the estimated coefficient on family size? Is it statistically and economically significant? Do your findings support the theory that large families can enjoy economies of scale (e.g., house, TV, etc.) and allocate more of their expenses to food?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q4_a
manual: true
-->

In [None]:
ces['foodshare'] = ...
y_4a = ...
X_4a = ...
model_4a = ...
results_4a = ...
results_4a.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.b.**
Explain.

<!--
BEGIN QUESTION
name: q4_b
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.c.**
What is the predicted share of expenditures spent on food for a single mother with two kids?

<!--
BEGIN QUESTION
name: q4_c
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.d.**
Now regress food share on the logarithm of family size. Do the regression results differ? How does the interpretation of the coefficient on log family size differ from the prior regression?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q4_d
manual: true
-->

In [None]:
ces['log_fam_size'] = ...
y_4d = ...
X_4d = ...
model_4d = ...
results_4d = ...
results_4d.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.e.**
Explain.

<!--
BEGIN QUESTION
name: q4_e
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.f.**
The $R^2$ is pretty small for both of the above regressions. Does this cast doubt on whether there is a relationship between family size and food share? Explain.

<!--
BEGIN QUESTION
name: q4_f
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.g.**
The theory applies in particular to poor households whose food expenses are at a bare minimum. Rerun the same regression for families who expenditure per capita are less than \\$3,000. Does that change your answer to the previous question?

*Hint: First you may need to create a new per capita expenditure variable.*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q4_g
manual: true
-->

In [None]:
ces['exp_pc'] = ...
ces_3000 = ...
y_4g = ...
X_4g = ...
model_4g = ...
results_4g = ...
results_4g.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.h.**
Explain.

<!--
BEGIN QUESTION
name: q4_h
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.i.**
Now regress expenditure per capita on family size and interpret the coefficient. What does this tell you about the validity of your former results?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q4_i
manual: true
-->

In [None]:
y_4i = ...
X_4i = ...
model_4i = ...
results_4i = ...
results_4i.summary()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.j.**
Explain.

<!--
BEGIN QUESTION
name: q4_j
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.to_pdf(pagebreaks=False, display_link=True)