# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
boston_features.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.0,-1.27526
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.0,-0.263711
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.0,-1.162114


In [3]:
X = boston_features[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']]
y = boston.target

## Train test split

Perform a train-test-split with a test set of 0.20.

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Fit the model and apply the model to the make test set predictions

In [8]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [9]:
train_residuals = y_hat_train - y_train
test_residuals = y_hat_test - y_test
from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [10]:
len(boston_features)

506

In [29]:
def kfolds(data, k):
    split = np.array_split(data, k)
    subsets = []
    for i in range(k):
        subset_i = pd.DataFrame(data = split[i])
        subsets.append(subset_i)
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers           
    return subsets

### Apply it to the Boston Housing Data

In [61]:
# Make sure to concatenate the data again
features_subsets = kfolds(X, 5)
target_subsets = kfolds(y, 5)
full_subsets = []
for i in range(4):
    full_subset_i = features_subsets[i].join(target_subsets[i])
    full_subsets.append(full_subset_i)

In [62]:
#pd.concat((features_subsets[1],target_subsets[1]), join = 'outer', axis = 1)
#features_subsets[1].join(target_)
#features_subsets[0].join(target_subsets[0])

full_subsets

[     CHAS     RM       DIS         B     LSTAT     0
 0     0.0  6.575  0.542096  1.000000 -1.275260  24.0
 1     0.0  6.421  0.623954  1.000000 -0.263711  21.6
 2     0.0  7.185  0.623954  0.989737 -1.627858  34.7
 3     0.0  6.998  0.707895  0.994276 -2.153192  33.4
 4     0.0  7.147  0.707895  1.000000 -1.162114  36.2
 5     0.0  6.430  0.707895  0.992990 -1.200048  28.7
 6     0.0  6.012  0.671500  0.996722  0.248456  22.9
 7     0.0  6.172  0.700059  1.000000  0.968416  27.1
 8     0.0  5.631  0.709276  0.974104  1.712312  16.5
 9     0.0  6.004  0.743201  0.974305  0.779802  18.9
 10    0.0  6.377  0.727217  0.988956  1.077829  15.0
 11    0.0  6.009  0.719175  1.000000  0.357391  18.9
 12    0.0  5.889  0.663113  0.983862  0.638571  21.7
 13    0.0  5.949  0.601338  1.000000 -0.432353  20.4
 14    0.0  6.096  0.578763  0.957436 -0.071152  18.2
 15    0.0  5.834  0.582214  0.996772 -0.390531  19.9
 16    0.0  5.935  0.582214  0.974658 -0.811149  23.1
 17    0.0  5.990  0.559046 

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [70]:
test_errs = []
train_errs = []
k=5

for i in range(k-1):
    X_train, X_test, y_train, y_test = train_test_split(features_subsets[i], target_subsets[i], test_size = 0.2)
    # Fit a linear regression model
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    y_hat_train = linreg.predict(X_train)
    y_hat_test = linreg.predict(X_test)
    #Evaluate Train and Test Errors
    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test
    train_errs.append(train_residuals)
    test_errs.append(test_residuals)

print(train_errs)
print(test_errs)
#print(np.mean(train_errs))
#print(np.mean(test_errs))

[            0
43  -1.031472
72   1.045066
27   1.621388
66   0.233017
75   1.397073
5   -1.911343
54  -0.113291
37  -0.876753
97  -0.576112
23   1.712297
77   0.372383
32  -1.415051
15   0.402404
4   -4.644195
6   -2.864289
88   6.190224
2   -1.976567
16  -1.200619
38  -4.552555
71  -1.482308
61   2.888790
85   0.101711
48  -2.429427
22   3.505368
99  -1.172293
47   1.668101
12  -3.678859
52   2.478061
68  -0.177479
28   4.292538
..        ...
10   5.343063
98  -6.687296
63   1.694536
57  -0.636498
8   -3.046105
45  -0.669183
83   0.272354
93  -1.015030
20   0.219746
56   1.883338
82  -0.078756
24   2.323959
3   -0.238573
33   1.779654
86  -2.860848
84  -0.345330
74   0.052838
44   0.282148
36  -1.753203
42  -1.070565
46  -2.130206
70   0.954587
14   2.622293
35   1.148739
49  -3.154432
100 -2.290181
95  -2.911907
73  -0.067065
92   1.381398
76   1.093425

[81 rows x 1 columns],             0
64  -5.028918
54  -1.893229
68   0.713589
9    5.100333
22  -0.624315
97  -1.314648
44  -0.77

In [77]:
num = 5
train_err = []
test_err = []
for i in range(num):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    linreg.fit(X_train, y_train)
    y_hat_train = linreg.predict(X_train)
    y_hat_test = linreg.predict(X_test)
    train_err.append(mean_squared_error(y_train, y_hat_train))
    test_err.append(mean_squared_error(y_test, y_hat_test))
print(np.mean(train_err))  
print(np.mean(test_err))  

21.55884350861665
22.85860308071082


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [71]:
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error")
print(cv_5_results)

[-13.40514492 -17.4440168  -37.03271139 -58.27954385 -26.09798876]


Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [68]:
cv_5_mean = np.mean(cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error"))
cv_5_mean

-30.451881143540316

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!