---
title: "Topics in Econometrics and Data Science: Tutorial 7"

---

#### General Note

You will very likely find the solution to these exercises online. We, however, strongly encourage you to work on these exercises without doing so. Understanding someone else’s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

# Section 2: Linear Regression

## Exercise 1: Linear Regression: Prediction

We consider a regression problem where we want to predict our dependent variable $Y$ in terms of the explanatory variable $X$. In this exercise we try to create an optimal prediction of $Y$ given $X$.

### A)

1. Load the [`prediction.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/prediction.csv) data set and assess its structure. \
\
**Hint**: You can use [`pandas.DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) to assess the structure of the data set.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [None]:
os.chdir("[INSERT YOUR PATH HERE!]")

data = pd.read_csv('data/prediction.csv', sep=',', na_values=".")
data.head()

Unnamed: 0,x,y
0,1.047198,9.242999
1,1.186824,6.86105
2,1.32645,-2.258883
3,1.466077,3.645338
4,1.605703,5.755031


2. Generate a scatter plot of the $Y$ values against the $X$ values using [`plt.figure`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html).

### B)

Now, we want to run a linear regression using the [`sklearn`](https://scikit-learn.org/stable/) package and add the regression line to the scatter plot. Then, we compute the in-sample Means Squared Error (MSE) in prediction. \
\
**Hint:** *in-sample* means that we estimate the model and evaluate its performance using the same data.

1. To solve this task, first, save `x` and `y` in two separate data frames using [`pd.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

2. Run a linear regression using [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html). \
\
**Hint**: First, fit the linear model (`sklearn.linear_model.LinearRegression.fit`) using the entire data set ($X$ and $Y$). Then, use the estimated/fitted linear model to predict $Y$ (`sklearn.linear_model.LinearRegression.predict`) based on the same $X$ data that was used to estimate/fit the model. 

In [1]:
from sklearn.linear_model import LinearRegression
linmod = LinearRegression()

3. Add the regression line to the scatter plot created in part A.

4. Compute the in-sample MSE in prediction.

### C)

In part C, we evaluate how well the model estimated in part B predicts the new observations provided in the data set `predictiontest.csv`. Compute the out-of-sample MSE in prediction and illustrate your results in the scatter plot. \
\
**Hint:** *out-of-sample* means that the model is estimated using one data set and its performance is evaluated on a different data set.

Load the [`predictiontest.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/predictiontest.csv) data set:

In [None]:
os.chdir("[INSERT YOUR PATH HERE!]")

datatest = pd.read_csv('data/predictiontest.csv', sep=',', na_values=".")
datatest.head()

Unnamed: 0,xtest,ytest
0,1.047198,5.808596
1,1.186824,1.27423
2,1.32645,4.892476
3,1.466077,-3.124069
4,1.605703,6.282259


In [None]:
datatest.shape

(49, 2)

1. First, rename the columns of `datatest` to `x` and `y` using [`pandas.DataFrame.rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html#pandas-dataframe-rename).

2. Save the columns `x` and `y` in two separate data frames using [`pd.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and name them, e.g., `xtest` and `ytest`.

3. Then, generate out-of-sample predictions. For this, use the estimated model from part B and predict the new observations provided in `predictiontest.csv`. \
\
**Hint:** Use `sklearn.linear_model.LinearRegression.predict`.

4. Create a scatter plot showing the test data and the predicted values.

5. Compute the out-of-sample MSE in prediction.

### D) 

In part D, we try to improve the predictive performance by including high-order polynomials of the variable $X$ in our regression. Then, we add the predictions from the polynomial model to the scatter plot. \
Do this for both the in-sample predictions and and the out-of-sample predictions.

1. Transform `x` into polynomial features of, e.g., degree $q=5$ using [`sklearn.preprocessing.PolynomialFeatures`](https://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) and its method `fit_transform`.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

2. Convert your transformed `x` into a data frame using `pandas.DataFrame` and display the first few rows of the transformed data.

3. Then, estimate/fit the model based on your transformed `x` data and `y`.

4. Generate the in-sample and out-of-sample predictions. \
\
**Hint:** Before you can generate the out-of-sample predictions, you need to generate transformations (i.e. high-order polynomials) for the x-values (`xtest`) of the test sample.

5. Compute the in-sample and out-of-sample MSE.

6. Add the predictions from the polynomial model to the scatter plot. For this, create two scatter plots: one for the in-sample predictions and one for the out-of-sample predictions. \
\
**Hint:** The first scatter plot shows the train data (based on the data set `prediction.csv`), the in-sample predicted values of the linear model of task B and the in-sample predicted values of the polynomial model. The second scatter plot shows the test data (based on the data set `predictiontest.csv`), the out-of sample predicted values of the linear model of task C and the out-of-sample predicted values of the polynomial model.

### E)

Increase the polynomial order in your approximation of the regression curve and see how the in-sample and out-of-sample MSE behave.

### F)

In task F, generate a plot of the in-sample and out-of-sample MSE depending on the order of the polynomial, $q$, in the regression function.

1. First, you need to write a for-loop that iterates over different values of degree $q$ to generate the in-sample and out-of-sample predictions and calculate the in-sample MSE and out-of-sample MSE. Include the following steps in your for-loop: \
\
i. For each value of $q$, generate polynomial features of degree $q$ for both the training data (`x`) and test data (`xtest`). \
ii. Estimate/fit a polynomial model using the polynomial features of degree $q$ on the training data (transformed `x` and `y`). \
iii. Make in-sample and out-of-sample predictions. \
iv. For each value of $q$, compute the in-sample MSE and the out-of-sample MSE.

In [6]:
q = np.arange(1,12,1)
MSE_ins = np.zeros(len(q))
MSE_oos = np.zeros(len(q))

2. Finally, generate the plot.

## Exercise 2: Linear Regression: Inference I - Case Study Oregon Health Experiment

In this problem, we analyze a random subset from the Oregon Health Experiment Data ([`Oregon.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/Oregon.csv)). In this large-scale experiment, access to Medicaid (U.S. public health insurance) was provided randomly in a lottery. More information can be found [here](http://www.nber.org/oregon/1.home.html). The problem set is based on the published study by Finkelstein et al. (2012, Quarterly Journal of Economics). 

Load the data and create a pandas data frame.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
os.chdir("[INSERT YOUR PATH HERE!]")

Oregdata = pd.read_csv('data/Oregon.csv', sep=',', na_values=".")
Oregdata.head()

Unnamed: 0,household_id,treatment,english_list,female_list,zip_msa,weight_12m,docvis,hhinc_pctfpl_12m,race_hisp_12m,race_white_12m,...,ddddraXnum_3_3,ddddraXnum_4_2,ddddraXnum_5_2,ddddraXnum_6_2,ddddraXnum_7_2,edu_12m_2,edu_12m_3,edu_12m_4,age2008,chronicdis
0,100002,1,1,1,1,1.0,0,60.054909,0,1,...,0,0,0,0,0,1,0,0,24,0
1,100005,1,0,1,1,1.0,0,129.71054,1,0,...,0,0,0,0,0,1,0,0,39,1
2,100006,1,1,0,1,1.0,1,34.134354,0,1,...,0,0,0,0,0,0,0,1,62,3
3,100009,0,1,1,1,1.0,1,47.788094,0,1,...,0,0,0,0,0,1,0,0,31,0
4,100013,1,1,0,0,1.0,10,11.542013,0,1,...,0,0,0,0,0,1,0,0,45,2


### 1)

Our dependent variable is the number of doctor visits in the last 6 months (`docvis`). Inspect the data set and generate a frequency table for the outcome variable `docvis` as well as a barplot.\
\
**Hint:** For the frequency table use [`pd.crosstab`](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) and for the barplot [`plt.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).

### 2) 

We are interested in estimating an intent-to-treatment effect, i.e. how does *access* to Medicaid increase the number of doctor visits. Thus, our regression model is 

$$ DOCVIS_i = \beta_0 + \beta_1 * TREATMENT_i + X_i'\beta + \varepsilon_i,$$

where $DOCVIS_i$ is the number of doctor visits, $TREATMENT_i$ is access to health insurance. The controls $X_i$ contain the variables: 

`dddraw_sur_2 ddddraw_sur_3 ddddraw_sur_4 ddddraw_sur_5 ddddraw_sur_6 ddddraw_sur_7 dddnumhh_li_2 dddnumhh_li_3 ddddraXnum_2_2 ddddraXnum_2_3 ddddraXnum_3_2 ddddraXnum_3_3 ddddraXnum_4_2 ddddraXnum_5_2 ddddraXnum_6_2 ddddraXnum_7_2`

The list with column names that we will need to build the data frame is the following:

In [36]:
columnnames = ['treatment','ddddraw_sur_2', 'ddddraw_sur_3', 'ddddraw_sur_4',
            'ddddraw_sur_5', 'ddddraw_sur_6', 'ddddraw_sur_7', 
            'dddnumhh_li_2', 'dddnumhh_li_3', 'ddddraXnum_2_2', 'ddddraXnum_2_3',
            'ddddraXnum_3_2', 'ddddraXnum_3_3', 'ddddraXnum_4_2',
            'ddddraXnum_5_2', 'ddddraXnum_6_2', 'ddddraXnum_7_2']

In task 2.2, we build the data frame with the relevant variables and then estimate the regression model above using the package [`statsmodels`](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS). Is there a significant effect of $TREATMENT_i$, i.e., access to Medicaid?

### 3)

Now, we would like to extend the model and include variables on gender, education and race/ethnicity in addition to the regressors from part 2.2

`female_list, hhinc_pctfpl_12m, age2008, edu_12m_2, edu_12m_3, edu_12m_4, english_list, zip_msa, race_white_12m, race_black_12m, race_hisp_12m`

Does your conclusion from part 2.2 change?


### 4)

How would you summarize your results? Are the OLS conditions met in the Oregon Health Example? 