<center><h1> EC485: In-Class Case Study</h1></center>

**Author(s):**
1. Belicia Rodriguez (belicia.rodriguez@emory.edu)

**Objectives**: This <ins>case study</ins> aims at
 1. Familiarize you with *real* requests in any entry-level data analyst job;
 2. Use *GitHub* to retrieve and submit computer code for *reference*, *version control*, and *future collaboration*.

**Instructions**:
 1. Please write down your Python code and <ins>execute</ins> it in the cell below each question.
 
**Data Source**: [Introductory Econometrics: A Modern Approach](https://cran.r-project.org/web/packages/wooldridge/index.html) by Jeffrey Wooldridge

**Data Description**: 

```
Contains data from hprice1.dta
  obs:            88                          
 vars:            10                          17 Mar 2002 12:21
 size:         3,168 (99.5% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
price           float  %9.0g                  house price, $1000s
assess          float  %9.0g                  assessed value, $1000s
bdrms           byte   %9.0g                  number of bdrms
lotsize         float  %9.0g                  size of lot in square feet
sqrft           int    %9.0g                  size of house in square feet
colonial        byte   %9.0g                  =1 if home is colonial style
lprice          float  %9.0g                  log(price)
lassess         float  %9.0g                  log(assess
llotsize        float  %9.0g                  log(lotsize)
lsqrft          float  %9.0g                  log(sqrft)
-------------------------------------------------------------------------------
Sorted by:  
 ```

<center><h2> Questions</h2></center>

1. [10 points] Using the ```read_stata``` function from the ```pandas``` library in Python, download the ```ceosal2``` used in Assignment 1 using the address ```http://fmwww.bc.edu/ec-p/data/wooldridge/hprice1.dta```. **Note:** You need a working connection to the internet.

In [4]:
import pandas as pd

hprice1 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice1.dta')

2. [40 points] Use the ```patsy``` library in Python to create the corresponding vectors of features and design matrices for the following _nested_ models

$$
\begin{aligned}
\texttt{lprice} &= \beta_{0} + \beta_{1}\texttt{llotsize} +  \beta_{2}\texttt{lsqrft} +\beta_{3}\texttt{colonial}+\beta_{4}\texttt{bdrms} + e_1,\\
\texttt{lprice} &= \beta_{0} + \beta_{1}\texttt{llotsize} +  \beta_{2}\texttt{lsqrft} +\beta_{3}\texttt{colonial}+\beta_{4}\texttt{bdrms}\\
&+ \beta_{5}\texttt{colonial}\times\texttt{llotsize}+ \beta_{6}\texttt{colonial}\times\texttt{lsqrft}++ \beta_{7}\texttt{colonial}\times\texttt{bdrms}+ e_2
\end{aligned}
$$

In [6]:
import patsy

eq1 = 'lprice ~ llotsize + lsqrft + colonial + bdrms'
eq2 = 'lprice ~ llotsize + lsqrft + colonial + bdrms + colonial:llotsize + colonial:lsqrft + colonial:bdrms'

y1,X1 = patsy.dmatrices(eq1, data=hprice1, return_type = 'dataframe')
y2,X2 = patsy.dmatrices(eq2, data=hprice1, return_type = 'dataframe')

Comment: Notice that the code you wrote calls the ```ceosal2``` instead of the ```hprice1``` pandas data frame you defined above. Imade the correction so I can run it.

3. [50 points] On the internet you found out that there is an _alternative_ measure of fit, $\widetilde{R}^2$, defined as

$$
\widetilde{R}^{2}=1-\frac{\sum_{i=1}^{n} \widetilde{e}_{i}^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}},
$$

where $\{\widetilde{e}_i;i=1,\dots,n\}$ are the _prediction errors_ previously discussed, and $\{y_i;i=1,\dots,n\}$ represents the elements $\{\texttt{lprice} _i;i=1,\dots,n\}$ in the data set. $\widetilde{R}^2$ estimates the percentage of the forecast variance which is explained by the regression forecast. Proceed to calculate this quantity for both specifications above and use it to select a specification.

In [7]:
from statsmodels.regression.linear_model import OLS

# for first specification
model1 = OLS(y1, X1).fit()

predict_errors = ((model1.resid/(1 - model1.get_influence().hat_matrix_diag))**2).sum()

sum_of_squares = (model1.centered_tss).sum()

R2_1 = 1 - (predict_errors/sum_of_squares)

print('R2 for first model specification: ', round(R2_1,6))

# for second specification
model2 = OLS(y2, X2).fit()

predict_errors = ((model2.resid/(1 - model2.get_influence().hat_matrix_diag))**2).sum()

sum_of_squares = (model2.centered_tss).sum()

R2_2 = 1 - (predict_errors/sum_of_squares)

print('R2 for second model specification:', round(R2_2,6))

# select preferred specification
print('I would select the second model specification.')
print('Model two has a lower R2 value, meaning there is less forecast variance in the regression forecast.')
print('The model with a lower variance is preferred.')




R2 for first model specification:  0.588301
R2 for second model specification: 0.452847
I would select the second model specification.
Model two has a lower R2 value, meaning there is less forecast variance in the regression forecast.
The model with a lower variance is preferred.


Comment: Your answer is now correct after I changed the data set. Please read the jupyter notebook user manual. In order to include text like this one, you simply need to press ```esc``` followed by ```m``` and the cell turns into a ```markdown``` type.