# MATH 3375 Examples Notebook #8

# Validation and Cross-Validation



### Examples

We will demonstrate validation and cross-validation with the housing data set. We are creating a model to predict the price of the house using square footage, number of full baths, and bike score as predictors.

In [None]:
#Look at data set
houses <- read.csv("HousingBrief.csv")
head(houses)

## Partitioning the Data

First we will demonstrate a simple partition for training and testing. We will use a random 70% of the data for training and save the rest for testing. 

To make the process as transparent as possible, we start by determining the number of data points (rows) in the data set.

In [None]:
nrow(houses)

#### Each row has an index, from 1 to 104

In [None]:
houses_idx <- 1:nrow(houses)
houses_idx


### Randomly Sample the Row Numbers

We can use R's **sample** command to randomly select 70% of the rows.  For this example, that is 73 rows.

Note that when running a random process (like a sample) it is best practice to start with the **set.seed** command. This ensures that the code will produce the same results every time you run it. We want this to be the case to ensure **_reproducibility_**. In the examples for this class, I will use the course number (3375), but any number could be used as the "seed".

In [None]:
set.seed(3375)

train_rows <- sort(sample(houses_idx, 73))
train_rows


#### The remaining rows (NOT selected as training rows) are the test rows.

In [None]:
test_rows <- houses_idx[!(houses_idx %in% train_rows)]
test_rows

### Use the Selected Row Numbers to Partition the Data into Training and Test Data Sets

In [None]:
train_data  <- houses[train_rows, ]

nrow(train_data)
head(train_data)

In [None]:
test_data   <- houses[test_rows, ]

nrow(test_data)
head(test_data)

#### Other Methods

R also has machine learning libraries (e.g., **caTools**, **dplyr**) that will create train/test splits of a data set. 

Below is another popular method that uses base R (no other package needed). It is simpler to code, but slightly less precise; the size of the training set is not certain, because the rows are given a 70% probability of being chosen, rather than designating a specific _number_ of rows as above.

In [None]:
set.seed(3375)

sample <- sample(c(TRUE, FALSE), nrow(houses), replace=TRUE, prob=c(0.7,0.3))
train_data_alt  <- houses[sample, ]
test_data_alt   <- houses[!sample, ]

nrow(train_data_alt)
nrow(test_data_alt)

## Train Model with Training Set

We will proceed with our original training and test set.  Below we create a model using only the training data set.

In [None]:
price_model <- lm(price2014 ~ squarefeet + full_baths + bikescore, data=train_data)
summary(price_model)

### In-Sample Error

We'll compute the Mean Square Error (MSE) for the data points in the training set.  This is the **_in-sample_** error.

Notice that we can take advantage of the fact that R has already calculated the residuals ($y - \widehat{y}$) for every point in the training set.  We can get these values using the **resid** function.  

The MSE is simply the average of the squared residuals.

In [None]:
n <- nrow(train_data)
residuals <- resid(price_model)
MSE_train <- sum(residuals^2)/n

MSE_train

### Out-Of-Sample Error 

We should be more interested in the MSE for the test set, since it was NOT used to train the model.  This is the **_out-of-sample_** error. To get the residuals for those data points, we need to do the following:

1. Compute the model predictions
2. Compute the residual for each data point arithmetically ($y - \widehat{y}$)
3. Compute the mean of the squared residuals

In [None]:
test_preds <- predict(price_model,test_data)
test_preds

In [None]:
test_resids <- test_data$price2014 - test_preds
test_resids

In [None]:
n = nrow(test_data)
test_MSE = sum(test_resids^2)/n
test_MSE

### Compare In-Sample and Out-of-Sample Error

As the data show, the in-sample error is much lower. This is to be expected, since the model was trained on the sample.  The out-of-sample error gives a better estimate of the error we could expect when using this model.

### Cross-Validation 

The **caret** package is one of the most popular R packages for performing cross-validation. However, it cannot be installed in this JupyterHub environment. An example is shown below of the code that would be used to conduct 10-fold cross-validation with the **caret** library.  

    set.seed(3375)
    train_control <- trainControl(method = "cv", number = 10)
 
    model <- train(price2014 ~ squarefeet+full_baths+bikescore, data=train_data, method="lm", trControl=train_control)
 
    print(model)

When the model is printed, the metrics are the _average_ out-of-sample errors computed across the several different samples. The final model is the linear model computed with all training data (none held out for validation).  

Shown below is a sample of what the output might look like.  Notice that the metric RMSE is the **_root_** mean square error, which is $\sqrt{MSE}$.  

Also note that the $R^2$ is not the same value that would be shown in the model summary (which is based on the training data.) This metric is computed based on how well the model explains the **_test_** data. 

    Linear Regression 

    200 samples
      3 predictor

    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 91, 90, 90, 91, 90, 90, ... 
    Resampling results:

      RMSE      Rsquared   MAE     
      78.7401   0.6432     53.1476