## Selection Criteria: MLR

* goal is to list select criteria for the linear regression model 

In [2]:
import pandas as pd
import numpy as np
import patsy

# import dataset hprice3
hprice3 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice3.dta')

# log lcbd variable
hprice3['lcbd'] = np.log(hprice3.cbd)

# set up two linear regression model where model1 includes linstsq and agesq
f1 = 'lprice ~ lland + larea + lcbd + nbh + rooms + y81 + linst + linstsq + ldist + baths + age + agesq'
f2 = 'lprice ~ lland + larea + lcbd + nbh + rooms + y81 + linst           + ldist + baths + age        '
y1, X1 = patsy.dmatrices(f1, data=hprice3, return_type='dataframe')
y2, X2 = patsy.dmatrices(f2, data=hprice3, return_type='dataframe')

We are interested in the following equation criterias:

Adjusted $\bar{R}^2$

$\bar{R}^2 = 1 - (1-R^2)\frac{n-1}{n-K-1}$ where $R^2$ := standard regression coefficient of determination

Bayesian Information Criterion

$BIC = n + nlog(2\pi \hat{\sigma}^2) + Klog(n)$

Akaike Information Criterion

$AIC = n + nlog(2\pi\hat{\sigma}^2) + 2K$

* now we will do what is often done in stat learning literature and use BIC and AIC defined without additive constants $n + nlog(2\pi)$:

$IC = nlog(\hat{\sigma}^2) + c(n,K)$

* one has AIC when $c = 2K$ and BIC when $c = klog(n)$

Now we will compare the overall quality of the two models

In [7]:
from statsmodels.regression.linear_model import OLS

# create OLS models
model1 = OLS(y1, X1).fit()
model2 = OLS(y2, X2).fit()

# model 1: get selection criteria information
model1_r2adj = model1.rsquared_adj
model1_BIC = model1.bic
model1_AIC = model1.aic

# model 2: get selection criteria information
model2_r2adj = model2.rsquared_adj
model2_BIC = model2.bic
model2_AIC = model2.aic

# create a dictionary of information and create dataframe
dict1 = {'Adj R2': [model1_r2adj, model2_r2adj], 'BIC': [model1_BIC, model2_BIC], 'AIC': [model1_AIC, model2_AIC]}
pd.DataFrame(dict1)

Unnamed: 0,Adj R2,BIC,AIC
0,0.785773,-51.593272,-100.622007
1,0.774587,-44.719943,-86.205796


the first model has a higher $\hat{R}^2$ and lower BIC and AIC

THEREFORE, would choose model1 (which has linstsq and agesq) over model2

We also have another criteria:

Mallows' Cp:

$C_p = n \hat{\sigma}^2 + 2K \tilde{\sigma}^2$

where $\tilde{\sigma}^2$ is a preliminary estimator of $\sigma^2$ (typically based on fitting a larger model i.e one with all the predicotrs)

In [8]:
# get sigma squares from models
m1_sig_sq = model1.mse_resid
m2_sig_sq = model2.mse_resid

# get degrees of freedom
m1_k = model1.df_model
m2_k = model2.df_model

# calculate Mallows' Cp for each model
model1_cp = model1.nobs * m1_sig_sq + 2 * m1_k * m1_sig_sq
model2_cp = model2.nobs * m2_sig_sq + 2 * m2_k * m2_sig_sq

# create dictionary and dataframe
dict2 = {'Cp': [model1_cp, model2_cp]}
pd.DataFrame(dict2)

Unnamed: 0,Cp
0,14.190144
1,14.757989


model 1 has a smaller Cp than model 2

THEREFORE, model 1 is preferred using this criteria

### shibata, final prediction error, generalized cross-validation

shibata = $\hat{\sigma}^2 (1 + \frac{2K}{n})$

FPE = $\hat{\sigma}^2 (\frac{1 + K/n}{1 - K/n})$

GCV = $\frac{n\hat{\sigma}^2}{(n-K)^2}$

In [9]:
# manually calculating Shibata, FPE, and GCV for model1
shibata1 = m1_sig_sq * (1 + 2 * m1_k/model1.nobs)

FPE1_fraction = (1 + m1_k / model1.nobs) / (1 - m1_k / model1.nobs)
FPE1 = m1_sig_sq * FPE1_fraction

GCV1 = (model1.nobs * m1_sig_sq) / (model1.nobs - m1_k)**2

# manually calculating Shibata, FPE, and GCV for model2
shibata2 = m2_sig_sq * (1 + 2 * m2_k/model2.nobs)

FPE2_fraction = (1 + m2_k / model2.nobs) / (1 - m2_k / model2.nobs)
FPE2 = m2_sig_sq * FPE2_fraction

GCV2 = (model2.nobs * m2_sig_sq) / (model2.nobs - m2_k)**2

# create dataframe with information
dict3 = {'Shibata': [shibata1, shibata2], 'FPE1': [FPE1, FPE2], 'GCV1': [GCV1, GCV2]}
pd.DataFrame(dict3)


Unnamed: 0,Shibata,FPE1,GCV1
0,0.044206,0.044325,0.000138
1,0.045975,0.046062,0.000144


* model1 has smaller shibata, FPE1, and GCV1, so therefore model1 is preferred

### cross-validation

$CV = \frac{1}{n} \sum_{i=1}^n \tilde{e_i}^2$

where $\tilde{e_i}$ are the least squares leave-one-out prediction errors

$\tilde{e_i} = (1-h_{ii})^{-1} \hat{e_i}$

We define out of sample mean squared error as 

$\tilde{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \tilde{e_i}^2$

In [13]:
# calculate cross validation criteria
CV1 = (model1.resid/(1 - model1.get_influence().hat_matrix_diag))**2

CV2 = (model2.resid/(1 - model2.get_influence().hat_matrix_diag))**2

# display dataframe
dict4 = {'CV': [CV1.mean(),CV2.mean()]}
pd.DataFrame(dict4)

Unnamed: 0,CV
0,0.044224
1,0.045975


Again, model1 has a smaller CV than model2

REST OF DOC:

* Relationship among Selection Criteria

* Consistent Selection

* Information Criteria

* Asymptotic Selection Optimality

skipped - mostly explaining topic, not much code