# 6.4: Resampling Exercises

## Getting Started

### Import Libraries 

We import our standard libraries and specific objects/libraries at the top level of our notebook.

In [1]:
# Load our previous libraries and objects
import numpy as np
import statsmodels.api as sm
from ISLP import load_data
from sklearn.model_selection import train_test_split

# Import new libraries and objects
from sklearn.model_selection import \
     (cross_validate,
      KFold)
from sklearn.base import clone
from ISLP.models import sklearn_sm

In [2]:
Auto = load_data('Auto')
Auto_train, Auto_valid = train_test_split(Auto,
                                         test_size=196, # split in two
                                         random_state=0) # random seed

## k-Fold Cross-Validation

Here we use `KFold()` to partition the data into $K=10$ random groups. We use `random_state` to set a random seed and initialize a vector `cv_error` in which we will store the CV errors corresponding to the polynomial fits of degrees one to five.

In [6]:
#From Hollie
Y = Auto['mpg']
cv_error = np.zeros(5) # final results array

H = np.array(Auto['horsepower'])
M = sklearn_sm(sm.OLS)

for i, d in enumerate(range(1,6)):
    # loop over d=1,2,3,4,5
    # i = 0,1,2,3,4
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,X,Y,cv=Auto.shape[0])
    cv_error[i] = np.mean(M_CV['test_score'])

In [33]:
for i, d in enumerate(range(4)):
    print(i,d,np.power.outer([2,3], np.arange(d+1)))
    #print(i,d,np.power.outer(2, d))

0 0 [[1]
 [1]]
1 1 [[1 2]
 [1 3]]
2 2 [[1 2 4]
 [1 3 9]]
3 3 [[ 1  2  4  8]
 [ 1  3  9 27]]


In [31]:
for d in range(10):
    print(d+1)

1
2
3
4
5
6
7
8
9
10


In [3]:
cv_error = np.zeros(5)
H = np.array(Auto['horsepower'])
M = sklearn_sm(sm.OLS)
Y = Auto['mpg']

cv = KFold(n_splits=10,
           shuffle=True,
           random_state=0) # use same splits for each degree
for i, d in enumerate(range(1,6)):
    X = np.power.outer(H, np.arange(d+1))
    M_CV = cross_validate(M,
                          X,
                          Y,
                          cv=cv)
    cv_error[i] = np.mean(M_CV['test_score'])
cv_error

array([24.20766449, 19.18533142, 19.27626666, 19.47848403, 19.13720581])

In [6]:
M_CV

{'fit_time': array([0.0007019 , 0.00044227, 0.00037026, 0.00043511, 0.00037313,
        0.00034714, 0.00033593, 0.00050378, 0.00041604, 0.00034881]),
 'score_time': array([0.00025439, 0.00017691, 0.00015903, 0.00018311, 0.00016308,
        0.00015593, 0.00014687, 0.00024414, 0.00016618, 0.00015903]),
 'test_score': array([15.93830994, 15.01126019, 19.39037386, 24.3137491 , 18.42505858,
        23.36485902, 22.47391839, 24.9714491 , 10.30948691, 17.17359306])}

*These exercises were adapted from :* James, Gareth, et al. An Introduction to Statistical Learning: with Applications in Python, Springer, 2023.