## Resampling Methods
* two most commonly used resampling Methods

(1) cross-validation

(2) bootstrap

In [1]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype

# download dataset to use throughout
hprice3 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice3.dta')

# pre-processing dataset
## select interested columns
col = ['lprice','lland','larea','nbh','rooms','cbd','y81','ldist','baths','age','agesq']
df = hprice3[col]

## change y81 and nbh to integers
df[['y81','nbh', 'rooms']] = df[['y81','nbh','rooms']].astype(int) 

## change y81, nbh, rooms into categorical
df['y81'] = df['y81'].astype('category')
df['nbh'] = df['nbh'].astype(CategoricalDtype(ordered=False))
df['rooms'] = df['rooms'].astype(CategoricalDtype(ordered=True))

# view columns types of pre-processed dataset
df.dtypes

lprice     float32
lland      float32
larea      float32
nbh       category
rooms     category
cbd        float32
y81       category
ldist      float32
baths      float32
age        float32
agesq      float32
dtype: object

### Cross-Validation

* these methods are used to do two things

(1) Model Assessment: process of evaluating a model's performance

(2) Model Selection: process of selecting the proper level of flexibility for a model

* since models are 'trained' using training data sets, they are suitable to fit data in these training data sets only

* since the validation set was not used to fit the model, these set of observations can be used to assess the performance of the model; therefore they allow us to do model selection

In [2]:
# continue preprocessing: convert categorical variables into dummy variables
df1 = pd.get_dummies(df, columns=['y81', 'nbh','rooms'], drop_first = True)

# get list of columns
df1.columns

Index(['lprice', 'lland', 'larea', 'cbd', 'ldist', 'baths', 'age', 'agesq',
       'y81_1', 'nbh_1', 'nbh_2', 'nbh_3', 'nbh_4', 'nbh_5', 'nbh_6',
       'rooms_5', 'rooms_6', 'rooms_7', 'rooms_8', 'rooms_9', 'rooms_10'],
      dtype='object')

### Validation Set Approach

![](img\validation_set_approach.png)

* set of n observations are randomly split into a training set (in blue with other obs) and a validation set (in beige containing other obs)

* the stat learning method is fit onto the training set

* performance evaluated on the validation set

* can be doing many (i.e j) times randomly, then all model assessment measures such as RMSE, R2, Cp, BIC, and AIC can be calculated many (i.e j) times

In [6]:
from sklearn.model_selection import train_test_split
import patsy
from sklearn.linear_model import LogisticRegression

# separate dataframe while preserving datatypes (pmatrices disregards categorical variables)
X = df1[list(df1[1:21])]
y = df1[['lprice']]

# do validation set splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# run regression and do fit
model = LogisticRegression()
model.fit(X_train, y_train) # not sure how to fix error

# predict testing set with training set regression
reg.score(X_test, Y_test)

# create specification using dataframe column object -  not sure if I need
predictors = " + ".join(list(df1.columns[1:15]))
f = " ~ ".join(['lprice',predictors])


lprice
0    11.00210
1    10.59663
2    10.43412
3    11.06507
4    10.69195
..        ...
316  11.00210
317  11.88103
318  11.48247
319  11.46163
320  12.22627

[321 rows x 1 columns]


ValueError: Unknown label type: 'continuous'

### Leave-One-Out Cross Validation

![](img\LOOCV.png)

* set of n data points is repeatedly split into a training set (blue) containing all but one observation

* validation set contains only that observation

* test error calculated by averaging the n resulting MSE

* total of n training data sets containing exactly n-1 observations has been constructed along with n corresponding validation data sets containing exactly 1 observation each

In [15]:
# from assignment 3 - bring out betas of the regression
import sklearn.linear_model as skm
import sklearn.model_selection as sks

# load ceosal2 
ceosal2 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/ceosal2.dta')

# covert ceosal2 from dataframe to array
ceosal2_array = ceosal2.to_numpy()

# define a string with regression equation
eq = 'lsalary ~ lsales + lmktval + profmarg + comten + comtensq + ceoten + ceotensq + age + college + grad'

# build a design matrix with the formula
y,X = patsy.dmatrices(eq, data = ceosal2, return_type='dataframe')

# create empty list to store LeaveOutOne betas
betas = []

# run LeaveOneOut cross validation
loo = sks.LeaveOneOut()
for train, test in loo.split(ceosal2_array):
    # run a regression and obtain beta for lsales
    beta_value = float((skm.LinearRegression().fit(X.loc[train],y.loc[train]).coef_)[:,1])
    
    # store beta value as a float
    betas.append(beta_value)

# print first six betas
print(betas[1:6])

[0.18585808324152028, 0.1846562050869726, 0.18653729976416675, 0.1859217444741523, 0.19069844354607984]


### k-Fold Cross-Validation

![](img\k_fold_CV.png)

* set of n observations is randomly split into five non-overlapping groups

* each of these fifths acts as a validation set (beige) and the remainder as a training set (blue)

* test error is estimated by averaging the five resulting MSE estimates

* approach involves randomly dividing the set of observations into k groups (called folds) of approximately equal size

* first fold = validation set, method is fit on the remaining k folds

In [22]:
# from assignment 4 - only displaying how to set up kFold
from sklearn.model_selection import KFold

# import hprice1 dataset
hprice1 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice1.dta')

# change hprice1 from dataframe to array for kfold 
hprice1_array = hprice1.to_numpy()

# create kfold object
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# split data into 10 folds and collect RSS values
for train_index, test_index in kf.split(hprice1_array):
    # divide dataframe into test and train
    X_train = X.iloc[train_index]
    X_test = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test = y.iloc[test_index]