# Resampling Methods

There are two most commonly used _resampling methods,_ _cross-validation,_ and the _bootstrap_.

In [2]:
## removing everything from memory
rm(list=ls())
## turning all warnings off
options(warn=-1)

## installing the 'wooldridge' package if not previously installed
if (!require(wooldridge)) install.packages('wooldridge')

## loading the packages
library(wooldridge)

## see the raw data
head(hprice3)

## pre-processing the data set
hprice3_processed <- subset(hprice3,select=c("lprice","lland","larea","nbh","rooms","cbd","y81","ldist","baths","age","agesq"))
hprice3_processed$cbd <- log(hprice3_processed$cbd)
hprice3_processed$y81 <- as.factor(hprice3_processed$y81)
hprice3_processed$nbh <- as.factor(hprice3_processed$nbh)

head(hprice3_processed)

Loading required package: wooldridge
Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)


year,age,agesq,nbh,cbd,inst,linst,price,rooms,area,land,baths,dist,ldist,lprice,y81,larea,lland,linstsq
<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
1978,48,2304,4,3000,1000,6.9078,60000,7,1660,4578,1,10700,9.277999,11.0021,0,7.414573,8.429017,47.7177
1978,83,6889,4,4000,1000,6.9078,40000,6,2612,8370,2,11000,9.305651,10.59663,0,7.867871,9.032409,47.7177
1978,58,3364,4,4000,1000,6.9078,34000,6,1144,5000,1,11500,9.350102,10.43412,0,7.042286,8.517193,47.7177
1978,11,121,4,4000,1000,6.9078,63900,5,1136,10000,1,11900,9.384294,11.06507,0,7.035269,9.21034,47.7177
1978,48,2304,4,4000,2000,7.6009,44000,5,1868,10000,1,12100,9.400961,10.69195,0,7.532624,9.21034,57.77368
1978,78,6084,4,3000,2000,7.6009,46000,6,1780,9500,3,10000,9.21034,10.7364,0,7.484369,9.159047,57.77368


lprice,lland,larea,nbh,rooms,cbd,y81,ldist,baths,age,agesq
<dbl>,<dbl>,<dbl>,<fct>,<int>,<dbl>,<fct>,<dbl>,<int>,<int>,<dbl>
11.0021,8.429017,7.414573,4,7,8.006368,0,9.277999,1,48,2304
10.59663,9.032409,7.867871,4,6,8.29405,0,9.305651,2,83,6889
10.43412,8.517193,7.042286,4,6,8.29405,0,9.350102,1,58,3364
11.06507,9.21034,7.035269,4,5,8.29405,0,9.384294,1,11,121
10.69195,9.21034,7.532624,4,5,8.29405,0,9.400961,1,48,2304
10.7364,9.159047,7.484369,4,6,8.006368,0,9.21034,3,78,6084


## Cross-Validation

These methods are used to do two things

1. _Model Assessment_ - the process of evaluating a model's performance.
2. _Model Selection_ - the process of selecting the proper level of flexibility for a model.

Since models are 'trained' using training data sets, they are by construction suitable to fit data in these training data sets only (they were fit by minimizing in-sample fitted errors). Since the validation set was not used to fit the model, these set of observations can be used to assess the performance of the model and therefore will allow us to do model selection.

In [3]:
## installing the 'caret' package if not previously installed
if (!require(caret)) install.packages('caret')

## loading the packages
library(caret)

## continue pre-processing
## converting every categorical variable to numerical using dummy variables
dmy <- dummyVars(" ~ .", data = hprice3_processed,fullRank = T)
hprice3_processed <- data.frame(predict(dmy, newdata = hprice3_processed))

## looking at the structure of caret package.
str(hprice3_processed)

Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2


'data.frame':	321 obs. of  16 variables:
 $ lprice: num  11 10.6 10.4 11.1 10.7 ...
 $ lland : num  8.43 9.03 8.52 9.21 9.21 ...
 $ larea : num  7.41 7.87 7.04 7.04 7.53 ...
 $ nbh.1 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.2 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.3 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.4 : num  1 1 1 1 1 1 1 1 1 1 ...
 $ nbh.5 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.6 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ rooms : num  7 6 6 5 5 6 6 6 8 5 ...
 $ cbd   : num  8.01 8.29 8.29 8.29 8.29 ...
 $ y81.1 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ldist : num  9.28 9.31 9.35 9.38 9.4 ...
 $ baths : num  1 2 1 1 1 3 2 2 2 2 ...
 $ age   : num  48 83 58 11 48 78 22 78 42 41 ...
 $ agesq : num  2304 6889 3364 121 2304 ...


### The Validation Set Approach

![validation_set_approach.png](img/validation_set_approach.png)

A schematic display of the validation set approach. A set of $n$ observations are randomly split into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a validation set (shown in beige, and containing observation 91, among others). The statistical learning method is fit on the training set, and its performance is evaluated on the validation set.

This can be done many (say $J$) times randomly and then all model assessment measures like $RMSE$, $\bar{R}^2$, $C_p$, $BIC$, and $AIC$ can be calculated many (say $J$) times.

<ins>*Note*</ins>: Recall that if we have a sample $\{y_1,y_2,\ldots,y_m\}$ for which we have predicted $\{\widehat{y}_1,\widehat{y}_2,\ldots,\widehat{y}_m\}$, then the $MSE$ for these samples is defined as

$$
 \begin{align*}
 MSE = \frac{1}{m}\sum_{j=1}^m (y_j-\widehat{y}_j)^2\text{,}
 \end{align*}
$$ 
 and the $RMSE$ is simply defined as $RMSE=\sqrt{MSE}$.
    

In [4]:
## set the seed for reproducibility
set.seed(42)

## spliting processed data set into two parts based on outcome: 80% and 20%
index <- createDataPartition(hprice3_processed$lprice, p=0.80, list=FALSE)
trainSet <- hprice3_processed[ index,]
validationSet <- hprice3_processed[-index,]

## checking the structure of trainSet
str(trainSet)

## checking the structure of validationSet
str(validationSet)

'data.frame':	259 obs. of  16 variables:
 $ lprice: num  11 10.6 11.1 10.7 10.9 ...
 $ lland : num  8.43 9.03 9.21 9.21 9.29 ...
 $ larea : num  7.41 7.87 7.04 7.53 7.44 ...
 $ nbh.1 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.2 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.3 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.4 : num  1 1 1 1 1 1 1 1 1 0 ...
 $ nbh.5 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.6 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ rooms : num  7 6 5 5 6 6 8 5 6 5 ...
 $ cbd   : num  8.01 8.29 8.29 8.29 8.29 ...
 $ y81.1 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ldist : num  9.28 9.31 9.38 9.4 9.37 ...
 $ baths : num  1 2 1 1 2 2 2 2 1 1 ...
 $ age   : num  48 83 11 48 22 78 42 41 78 38 ...
 $ agesq : num  2304 6889 121 2304 484 ...
'data.frame':	62 obs. of  16 variables:
 $ lprice: num  10.4 10.7 10.8 10.9 11.2 ...
 $ lland : num  8.52 9.16 9.44 8.97 9.5 ...
 $ larea : num  7.04 7.48 7.16 7.15 7.65 ...
 $ nbh.1 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.2 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nbh.3 : num  0 0 0 0 0 0 0 

**__Example__**: Linear Regression

In [5]:
## train a linear regression model
model <- lm(lprice~., data=trainSet)
summary(model)

## making predictions
predictions <- predict(model, validationSet)

## summarize results
postResample(pred = predictions, obs = validationSet$lprice)


Call:
lm(formula = lprice ~ ., data = trainSet)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.68114 -0.10188  0.01122  0.10738  0.72961 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.235e+00  5.587e-01  12.950  < 2e-16 ***
lland        1.040e-01  2.695e-02   3.859 0.000146 ***
larea        3.062e-01  5.451e-02   5.618 5.27e-08 ***
nbh.1       -4.947e-02  5.055e-02  -0.979 0.328732    
nbh.2       -7.857e-02  4.142e-02  -1.897 0.059047 .  
nbh.3       -1.922e-01  9.419e-02  -2.040 0.042430 *  
nbh.4       -1.064e-02  5.474e-02  -0.194 0.846007    
nbh.5       -1.234e-01  5.540e-02  -2.227 0.026838 *  
nbh.6       -9.216e-02  4.809e-02  -1.916 0.056502 .  
rooms        5.508e-02  1.788e-02   3.081 0.002300 ** 
cbd         -5.870e-02  5.723e-02  -1.026 0.306081    
y81.1        3.540e-01  2.548e-02  13.889  < 2e-16 ***
ldist        6.478e-02  6.614e-02   0.979 0.328359    
baths        1.184e-01  2.981e-02   3.972 9.41e-05 ***
age    

### Leave-One-Out Cross-Validation

![LOOCV.png](img/LOOCV.png)

A schematic display of **LOOCV**. A set of $n$ data points is repeatedly split into a training set (shown in blue) containing all but one observation, and a validation set that contains only that observation (shown in beige). The test error is then estimated by averaging the $n$ resulting $MSE$’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.

Notice that a total of $n$ training data sets containing exactly $n-1$ observations have been constructed along with $n$ corresponding validation data sets containing exactly $1$ observation each.

<ins>*Note*</ins>: Recall that in this case we will be using the training data set $\{(y_1,\mathbf{x}_1^\prime),\ldots,(y_{i-1},\mathbf{x}_{i-1}^\prime),(y_{i+1},\mathbf{x}_{i+1}^\prime),\ldots,(y_n,\mathbf{x}_n^\prime)\}$ to predict $y_i$ for $i=1,\ldots,n$. Therefore the $MSE$ is defined as before for pairs $\{(y_1,\widehat{y}_1),\ldots,(y_n,\widehat{y}_n)\}$, i.e., $MSE_i=(y_1-\widehat{y}_1)^2$, $MSE=n^{-1}\sum_{i=1}^{n}MSE_i$, and $RMSE=\sqrt{MSE}$.

**__Example__**: Linear Regression

In [6]:
## define training control
train_control <- trainControl(method="LOOCV")

## train the model
model <- train(lprice~., data=hprice3_processed, trControl=train_control, method="glm")

## summarize results
print(model)

Generalized Linear Model 

321 samples
 15 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 320, 320, 320, 320, 320, 320, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.2104742  0.7687971  0.1481018



### $k$-Fold Cross-Validation

![k_fold_CV.PNG](img/k_fold_CV.PNG)

A schematic display of 5-fold CV. A set of $n$ observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set (shown in beige), and the remainder as a training set (shown in blue). The test error is estimated by averaging the five resulting $MSE$ estimates.

Notice that this approach involves randomly dividing the set of observations into $k$ groups (called __folds__), of approximately equal size. The first fold is treated as a validaton set, and the method is fit on the remaining $k−1$ folds.

<ins>*Note*</ins>: For each fold $k$, notice that we can calculate the corresponding $MSE$ using the validation set for that fold. Therefore, we will have $MSE_1,\ldots,MSE_k$ so $MSE=k^{-1}\sum_{j=1}^{k}MSE_j$, and $RMSE=\sqrt{MSE}$.

**__Example__**: Linear Regression

In [7]:
## define training control
train_control <- trainControl(method="cv", number=10)

## set the seed for reproducibility
set.seed(42)

## train the model
model <- train(lprice~., data=hprice3_processed, trControl=train_control, method="glm")

## summarize results
print(model)

Generalized Linear Model 

321 samples
 15 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 291, 288, 288, 288, 289, 289, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  0.209414  0.7779253  0.149471

