## Resampling Methods
* two most commonly used resampling Methods

(1) cross-validation

(2) bootstrap

In [101]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype

# download dataset to use throughout
hprice3 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice3.dta')

# pre-processing dataset
## select interested columns
col = ['lprice','lland','larea','nbh','rooms','cbd','y81','ldist','baths','age','agesq']
df = hprice3[col]

## change y81 and nbh to integers (necessary for patsy later on)
df[['y81','nbh']] = df[['y81','nbh']].astype(int) 

## change y81, nbh, rooms into categorical
df['y81'] = df['y81'].astype('category')
df['nbh'] = df['nbh'].astype(CategoricalDtype(ordered=False))
df['rooms'] = df['rooms'].astype(CategoricalDtype(ordered=True))

# view head of pre-processed dataset
df.head()

Unnamed: 0,lprice,lland,larea,nbh,rooms,cbd,y81,ldist,baths,age,agesq
0,11.0021,8.429017,7.414573,4,7.0,3000.0,0,9.277999,1.0,48.0,2304.0
1,10.59663,9.032409,7.867871,4,6.0,4000.0,0,9.305651,2.0,83.0,6889.0
2,10.43412,8.517193,7.042286,4,6.0,4000.0,0,9.350102,1.0,58.0,3364.0
3,11.06507,9.21034,7.035269,4,5.0,4000.0,0,9.384294,1.0,11.0,121.0
4,10.69195,9.21034,7.532624,4,5.0,4000.0,0,9.400961,1.0,48.0,2304.0


### Cross-Validation

* these methods are used to do two things

(1) Model Assessment: process of evaluating a model's performance

(2) Model Selection: process of selecting the proper level of flexibility for a model

* since models are 'trained' using training data sets, they are suitable to fit data in these training data sets only

* since the validation set was not used to fit the model, these set of observations can be used to assess the performance of the model; therefore they allow us to do model selection

In [102]:
# continue preprocessing: convert categorical variables into dummy variables
df1 = pd.get_dummies(df, columns=['y81', 'nbh'], drop_first = True)

# get list of columns
df1.columns

Index(['lprice', 'lland', 'larea', 'rooms', 'cbd', 'ldist', 'baths', 'age',
       'agesq', 'y81_1', 'nbh_1', 'nbh_2', 'nbh_3', 'nbh_4', 'nbh_5', 'nbh_6'],
      dtype='object')

## Validation Set Approach

![](img\validation_set_approach.png)

* set of n observations are randomly split into a training set (in blue with other obs) and a validation set (in beige containing other obs)

* the stat learning method is fit onto the training set

* performance evaluated on the validation set

* can be doing many (i.e j) times randomly, then all model assessment measures such as RMSE, R2, Cp, BIC, and AIC can be calculated many (i.e j) times

In [107]:
from sklearn.model_selection import train_test_split
import patsy
from statsmodels.regression.linear_model import OLS

# convert dataframe into numpy array
df1_array = df1.to_numpy()

# create specification using dataframe column object
predictors = " + ".join(list(df1.columns[1:15]))
f = " ~ ".join(['lprice',predictors])

# create design matrix
y, X = patsy.dmatrices(f, data=df1, return_type='dataframe')

# do validation set splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# run OLS
model = OLS(y,X)
reg = model.fit()

# predict model
predict = reg.predict(X_test)

# view in dataframe



173    11.534286
132    11.455381
197    11.429371
9      10.841395
104    11.323265
         ...    
229    12.054129
60     10.567336
289    11.670529
260    11.861703
118    11.108526
Length: 65, dtype: float64

In [93]:
df1.columns[0:15]

Index(['lprice', 'lland', 'larea', 'rooms', 'cbd', 'ldist', 'baths', 'age',
       'agesq', 'y81_1.0', 'nbh_1.0', 'nbh_2.0', 'nbh_3.0', 'nbh_4.0',
       'nbh_5.0'],
      dtype='object')