# Multiple Linear Regression Model
I will review the following topics:
* The Algebra of the OLS Estimator
* Asymptotic Properties of the OLS Estimator
* Regression Intervals
* Forecast Intervals

In [26]:
# download packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
import patsy

# download dataset to use throughout
hprice2 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice2.dta')

In [53]:
# view first six rows of the dataset
hprice2.head()

# view data information such as count, mean, max, etc
hprice2.describe()

# specify outcome variable (y) and regressors/predictors (x) using string
f = 'lprice ~ lnox + lproptax + crime + rooms + dist + radial + stratio + lowstat'

# select columns of the dataframe as an attribute (or using brackets) - creates panda series
hprice2.crime
hprice2['crime']

# use double brackets to select columns as a dataframe
hprice2[['crime', 'lnox']]

# create a design matrix using patsy package
y, X = patsy.dmatrices(f, data=hprice2, return_type='dataframe')

# calculate OLS, fit model to a regression, then use summary to view
model = OLS(y,X)
reg = model.fit()
reg.summary()
reg.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.763
Dependent Variable:,lprice,AIC:,-188.7488
Date:,2020-03-08 19:20,BIC:,-150.71
No. Observations:,506,Log-Likelihood:,103.37
Df Model:,8,F-statistic:,204.8
Df Residuals:,497,Prob (F-statistic):,5.769999999999999e-152
R-squared:,0.767,Scale:,0.039616

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,12.6516,0.3473,36.4288,0.0000,11.9693,13.3340
lnox,-0.4503,0.0920,-4.8937,0.0000,-0.6311,-0.2695
lproptax,-0.2274,0.0477,-4.7634,0.0000,-0.3212,-0.1336
crime,-0.0113,0.0014,-8.2754,0.0000,-0.0139,-0.0086
rooms,0.0990,0.0168,5.9008,0.0000,0.0660,0.1320
dist,-0.0488,0.0073,-6.6939,0.0000,-0.0631,-0.0345
radial,0.0115,0.0023,5.0277,0.0000,0.0070,0.0160
stratio,-0.0404,0.0050,-8.1325,0.0000,-0.0502,-0.0307
lowstat,-0.0283,0.0019,-14.7579,0.0000,-0.0320,-0.0245

0,1,2,3
Omnibus:,60.676,Durbin-Watson:,1.047
Prob(Omnibus):,0.0,Jarque-Bera (JB):,204.257
Skew:,0.517,Prob(JB):,0.0
Kurtosis:,5.936,Condition No.:,1090.0


In [43]:
# create a dictionary
ols_var = {'y': hprice2.lprice, 'y_hat': reg.fittedvalues, 'e_hat': reg.resid}

# print first six outcomes, fitted vales, and residuals
ols_info = pd.DataFrame(ols_var)
ols_info.head()

Unnamed: 0,y,y_hat,e_hat
0,10.08581,10.303026,-0.217217
1,9.980402,10.145426,-0.165024
2,10.4545,10.365117,0.089383
3,10.41631,10.33025,0.08606
4,10.49679,10.277121,0.219669


## analysis of variance
total sum of squares = explained sum of squares + residual sum of squares

$\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (\hat{y_i} + \bar{y})^2 + \sum_{i=1}^n \hat{e_i}^2$

In [48]:
# view OLS total sum of squares, explained sum of squares, and residual sum of squares (statsmodel site down)
reg.ssr

19.689097978248622

## coefficient of determination ($R^2$)

$R^2 = \frac{\sum_{i=1}^n (\hat{y_i} - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = 1 - \frac{\sum_{i=1}^n \hat{e_i}^2}{\sum_{i=1}^n (y_i - \bar{y})^2}$

* known as the square of sample correlation coefficient between the true and fitted values

## adjusted $R^2$

$\bar{R^2} = 1 - \frac{n\sum_{i=1}^n \hat{e_i}^2}{(n-k-1)\sum_{i=1}^n (y_i - \bar{y})^2}$ 

* unlike $R^2$ which cannot decrease as k increases, $\bar{R^2}$ can either increase or decrease with k

In [62]:
# view R2
print(reg.rsquared)

# view adjusted R2
print(reg.rsquared_adj) 

0.767219563142833
0.7634725943403031


## leverage values (hii)
* a measure of how far away the independent variable values of an observation are from those of the other observations
* diagonal of hat matrix
* hat matrix: the projection matrix that expresses the values of the observations in the independent variable, 𝐲, in terms of the linear combinations of the column vectors of the model matrix
* This entry in the hat matrix will have a direct influence on the way entry $y_i$ will result in $\hat{y_i}$( high-leverage of the 𝑖-th observation $y_i$ in determining its own prediction value $\hat{y_i})


In [64]:
# create instance of influence
reg_influence = reg.get_influence()

# get leverage values
hii = reg_influence.hat_matrix_diag
print(hii[1:6])

[0.00465546 0.00745928 0.01159636 0.01147143 0.01015461]


## prediction error (leave one out residual or prediction residual)

$\tilde{e_i} =  y_i - \tilde{y_i}$ where $\tilde{y_i}$ is the leave-one-out predicted value

* there's a leave one out estimator for each value
* to calculate $\tilde{e_i}$ use the following:

$\tilde{e_i} = (1-h_{ii})^1 \hat{e_i}$

In [66]:
# original OLS residual
e_hat = reg.resid

# calculate prediction error
e_tilde = e_hat / (1-hii)

# calculate OLS without 156 observation (later)

0     -0.219350
1     -0.165796
2      0.090055
3      0.087070
4      0.222218
         ...   
501    0.006959
502   -0.057587
503   -0.098375
504   -0.127720
505   -0.631556
Length: 506, dtype: float64


## Estimation of Error Variance
* the unconditional error variance $\sigma^2 = E[e_i^2]$ can be estimated as follows:

(1) $s^2 = \frac{1}{n-k-1} \sum_{i=1}^n \hat{e_i}^2$

(2) $\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n \hat{e_i}^2$
 
(3) $\bar{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \tilde{e_i}^2 = \frac{1}{n} \sum_{i=1}^n (1-h_{ii})^{-2} \hat{e_i}^2$

* when k\n is small, any will do

* when k\n is big, use (1) or (3)

## Asymptotic Properties of the OLS Estimator

(will investigate more in depth at a later time)

OLS Variance Estimation ($V_{\hat{\beta}}$)

(1) HC0

(2) HC1 - most common

(3) HC2

(4) HC3

Confidence Interval

In [69]:
# calculating vcov matrices

# calculate 99% confidence interval
reg.conf_int()

Unnamed: 0,0,1
Intercept,11.969265,13.33397
lnox,-0.631135,-0.269532
lproptax,-0.321159,-0.133591
crime,-0.01394,-0.008591
rooms,0.066035,0.131961
dist,-0.06313,-0.03448
radial,0.006987,0.015951
stratio,-0.050183,-0.030654
lowstat,-0.032032,-0.024505


## Regression Interval

In [70]:
# calculate regression interval - figure out manually

## Forecast Interval
* suppose we are given value of regressor vector $x_{n+1}$ for individual outside of sample and want to forecast $y_{n+1}$.

In [None]:
# calculate forecast interval - figure out manually