# 5.3 Lab: Cross-Validation and the Bootstrap

## 5.3.1 The Validation Set Approach

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the **Auto** data set.

In [1]:
import pandas as pd
import numpy as np

import statsmodels.formula.api as smf

In [2]:
auto = pd.read_csv('../data/Auto.csv',
                  na_values='?')\
         .dropna()\
         .reset_index()
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         392 non-null    int64  
 1   mpg           392 non-null    float64
 2   cylinders     392 non-null    int64  
 3   displacement  392 non-null    float64
 4   horsepower    392 non-null    float64
 5   weight        392 non-null    int64  
 6   acceleration  392 non-null    float64
 7   year          392 non-null    int64  
 8   origin        392 non-null    int64  
 9   name          392 non-null    object 
dtypes: float64(4), int64(5), object(1)
memory usage: 30.8+ KB


We begin by using the `DataFrame`'s `sample()` method to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations.

To reproduce the result, we can specify the `random_state` option in the `sample()` method.

In [3]:
train_auto = auto.sample(196, random_state=1)
test_auto = auto[~auto.isin(train_auto)].dropna(how='all')

We then use the `subset` option in `smf.ols()` to fit a linear regression using only the observations corresponding to the training set.

In [4]:
lm_fit_auto = smf.ols('mpg~horsepower',
                     data=auto,
                     subset=train_auto.index).fit()
lm_fit_auto.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.62
Model:,OLS,Adj. R-squared:,0.618
Method:,Least Squares,F-statistic:,316.4
Date:,"Thu, 07 Jan 2021",Prob (F-statistic):,1.28e-42
Time:,21:36:52,Log-Likelihood:,-592.07
No. Observations:,196,AIC:,1188.0
Df Residuals:,194,BIC:,1195.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,40.3338,1.023,39.416,0.000,38.316,42.352
horsepower,-0.1596,0.009,-17.788,0.000,-0.177,-0.142

0,1,2,3
Omnibus:,8.393,Durbin-Watson:,1.808
Prob(Omnibus):,0.015,Jarque-Bera (JB):,8.787
Skew:,0.516,Prob(JB):,0.0124
Kurtosis:,2.899,Cond. No.,328.0


We now use the models's `predict()` method to estimate the response and calculate the MSE of the 196 observations in the validation set.

In [5]:
pred_auto = lm_fit_auto.predict(test_auto)
((pred_auto - test_auto['mpg'])**2).mean()

23.36190289258723

Therefore, the estimated test MSE for the linear regression fit is 23.36. We can use the `np.power()` function to estimate the test error for the quadratic and cubic regressions.

In [6]:
lm_fit2_auto = smf.ols('mpg~horsepower+np.power(horsepower, 2)',
                      data=auto,
                      subset=train_auto.index).fit()
pred2_auto = lm_fit2_auto.predict(test_auto)
((pred2_auto - test_auto['mpg'])**2).mean()

20.252690858350064

In [7]:
lm_fit3_auto = smf.ols('mpg~horsepower+np.power(horsepower, 2)+np.power(horsepower, 3)',
                      data=auto,
                      subset=train_auto.index).fit()
pred3_auto = lm_fit3_auto.predict(test_auto)
((pred3_auto - test_auto['mpg'])**2).mean()

20.32560936589255

These error rates are 20.25 and 20.33 respectively. If we choose a diffrent training set instead, then we will obtain somewhat different errors on the validation set.

In [8]:
train_auto = auto.sample(196, random_state=2)
test_auto = auto[~auto.isin(train_auto)].dropna(how='all')

In [9]:
lm_fit_auto = smf.ols('mpg~horsepower',
                     data=auto,
                     subset=train_auto.index).fit()
pred_auto = lm_fit_auto.predict(test_auto)
((pred_auto - test_auto['mpg'])**2).mean()

25.10853905288965

In [10]:
lm_fit2_auto = smf.ols('mpg~horsepower+np.power(horsepower, 2)',
                      data=auto,
                      subset=train_auto.index).fit()
pred2_auto = lm_fit2_auto.predict(test_auto)
((pred2_auto - test_auto['mpg'])**2).mean()

19.722533470490276

In [11]:
lm_fit3_auto = smf.ols('mpg~horsepower+np.power(horsepower, 2)+np.power(horsepower, 3)',
                      data=auto,
                      subset=train_auto.index).fit()
pred3_auto = lm_fit3_auto.predict(test_auto)
((pred3_auto - test_auto['mpg'])**2).mean()

19.921367860022265

Using this split of the observations into a training set and a validation set, we find that the validation set error rates for the models with linear, quadratic, and cubic terms are 25.11, 19.72, and 19.92, respectively.

## 5.3.2 Leave-One-Out Cross-Validation

The LOOCV estimate can be automatically computed for any generalized linear model using the `LeaveOneOut()` and `KFold()` functions from the sub-module `model_selection` in the `scikit-learn` package.

In [17]:
from sklearn.model_selection import LeaveOneOut, KFold, cross_val_score
import  sklearn.linear_model as sk_lm

In [20]:
lm = sk_lm.LinearRegression()

X_train = train_auto['horsepower'].values.reshape(-1,1)
y_train = train_auto['mpg']
X_test = test_auto['horsepower'].values.reshape(-1,1)
y_test = test_auto['mpg']

In [31]:
model = lm.fit(X_train, y_train)

loo = LeaveOneOut()
X = auto['horsepower'].values.reshape(-1,1)
y = auto['mpg'].values.reshape(-1,1)

scores = cross_val_score(lm, X, y,
                         scoring='neg_mean_squared_error',
                         cv=loo,
                         n_jobs=1)
print(f'Folds: {len(scores)}'
      f'\nMSE: {np.mean(np.abs(scores))}' 
      f'\nSTD: {np.std(scores)}')

Folds: 392
MSE: 24.231513517929226
STD: 36.79731503640535


In [35]:
scores.mean()

-24.231513517929226