# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
y = boston_features[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']]
X = pd.DataFrame(boston.target, columns = ['Target'])

In [3]:
y.head()

Unnamed: 0,CHAS,RM,DIS,B,LSTAT
0,0.0,6.575,0.542096,1.0,-1.27526
1,0.0,6.421,0.623954,1.0,-0.263711
2,0.0,7.185,0.623954,0.989737,-1.627858
3,0.0,6.998,0.707895,0.994276,-2.153192
4,0.0,7.147,0.707895,1.0,-1.162114


In [4]:
X.head()

Unnamed: 0,Target
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


## Train test split

Perform a train-test-split with a test set of 0.20.

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [7]:
print(len(X_train), len(X_test), len(y_train), len(y_test))

404 102 404 102


Fit the model and apply the model to the make test set predictions

In [8]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [9]:
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [10]:
#test residuals
test_residuals = y_hat_test - y_test

In [11]:
#mean squared error
"""
(test_residuals * test_residuals).mean()
"""
from sklearn.metrics import mean_squared_error
test_residuals = y_hat_test - y_test

test_mse = mean_squared_error(y_test, y_hat_test)
test_mse

0.15540699229592286

In [12]:
y_hat_train = linreg.predict(X_train)

In [13]:
from sklearn.metrics import mean_squared_error
train_mse = mean_squared_error(y_train, y_hat_train)
train_mse

0.1481496506376991

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [14]:
import numpy as np

def kfolds(data, k):
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers           
    return np.array_split(data, k)

In [15]:
data = pd.concat([X, y], axis=1)
data.head()

Unnamed: 0,Target,CHAS,RM,DIS,B,LSTAT
0,24.0,0.0,6.575,0.542096,1.0,-1.27526
1,21.6,0.0,6.421,0.623954,1.0,-0.263711
2,34.7,0.0,7.185,0.623954,0.989737,-1.627858
3,33.4,0.0,6.998,0.707895,0.994276,-2.153192
4,36.2,0.0,7.147,0.707895,1.0,-1.162114


In [16]:
data_folds = kfolds(data, 5)
len(data_folds)

5

In [17]:
for fold in data_folds:
    print(len(fold))

102
101
101
101
101


### Apply it to the Boston Housing Data

In [18]:
# Make sure to concatenate the data again
# fold 0 is my testing data
# folds 1 through 4 is my training data

n = 0
train_list = []
for counts, fold in enumerate(data_folds):
    if counts != n:
        train_list.append(fold)
df_train = pd.concat(train_list)
df_test = data_folds[n]

In [19]:
#validate for loop
df_train.shape, df_test.shape

((404, 6), (102, 6))

In [20]:
df_train.head()

Unnamed: 0,Target,CHAS,RM,DIS,B,LSTAT
102,18.6,0.0,6.405,0.369415,0.17772,-0.012136
103,19.3,0.0,6.137,0.369415,0.993873,0.378596
104,20.1,0.0,6.167,0.321174,0.989384,0.235
105,19.5,0.0,5.851,0.262627,0.992814,0.71727
106,19.5,0.0,5.836,0.282946,0.996898,0.925237


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [25]:
test_errs = []
train_errs = []
k=5
"""
for n in range(k):
    # Split in train and test for the fold
    train_list = []  
    for counts, fold in enumerate(data_folds):
        if counts != n:
            train_list.append(fold)
    train = pd.concat(train_list)
    test = data_folds[n]
    # Fit a linear regression model
    linreg.fit(train[['CHAS','RM','DIS','B','LSTAT']], train['Target'])
    #Evaluate Train Errors
    y_hat_train = linreg.predict(train[['CHAS','RM','DIS','B','LSTAT']])
    train_errs.append(mean_squared_error(train['Target'], y_hat_train))
    #Evaluate Test Errors
    y_hat_test = linreg.predict(train[['CHAS','RM','DIS','B','LSTAT']])
    test_errs.append(mean_squared_error(test['Target'], y_hat_test))
"""
for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([fold for i, fold in enumerate(data_folds) if i!=n])
    test = data_folds[n]
    # Fit a linear regression model
    linreg.fit(train[X.columns], train[y.columns])
    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(train[X.columns])
    y_hat_test = linreg.predict(test[X.columns])
    train_residuals = y_hat_train - train[y.columns]
    test_residuals = y_hat_test - test[y.columns]
    train_errs.append(np.mean(train_residuals.astype(float)**2))
    test_errs.append(np.mean(test_residuals.astype(float)**2))
    
print(train_errs)
print(test_errs)

[CHAS     0.076459
RM       0.298541
DIS      0.045952
B        0.055956
LSTAT    0.337361
dtype: float64, CHAS     0.061679
RM       0.284789
DIS      0.047775
B        0.051807
LSTAT    0.352374
dtype: float64, CHAS     0.034227
RM       0.270183
DIS      0.045638
B        0.056858
LSTAT    0.300048
dtype: float64, CHAS     0.061159
RM       0.165309
DIS      0.039422
B        0.052248
LSTAT    0.311905
dtype: float64, CHAS     0.077343
RM       0.235382
DIS      0.052012
B        0.010213
LSTAT    0.361438
dtype: float64]
[CHAS     0.008306
RM       0.081285
DIS      0.061652
B        0.013365
LSTAT    0.344748
dtype: float64, CHAS     0.065699
RM       0.137891
DIS      0.049313
B        0.028273
LSTAT    0.273731
dtype: float64, CHAS     0.178331
RM       0.244500
DIS      0.055843
B        0.008887
LSTAT    0.491101
dtype: float64, CHAS     0.068169
RM       0.634277
DIS      0.078371
B        0.033133
LSTAT    0.436644
dtype: float64, CHAS     0.003165
RM       0.357259
DIS     

## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [23]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error")

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [24]:
cv_5_results

array([-0.10187142, -0.11098152, -0.19573232, -0.2501188 , -0.1741998 ])

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!