# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

In [14]:
boston = load_boston()
boston

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

In [10]:
boston_features = pd.DataFrame(boston.data, columns=boston.feature_names)

In [11]:
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [13]:
X = boston_features.values
y = boston.target

## Train test split

Perform a train-test-split with a test set of 0.20.

In [15]:
from sklearn.model_selection import train_test_split

In [17]:
train_test_split?

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Fit the model and apply the model to the make test set predictions

In [20]:
from sklearn.linear_model import LinearRegression

In [23]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.coef_

array([-1.85667366e-01,  8.24767234e-03,  1.89271612e-03,  2.45224546e+00,
       -1.91625620e+01,  2.59799984e+00, -6.30186326e-03, -1.59106206e+01,
        2.73458477e-01, -9.87193204e-03, -8.38510918e-01,  3.61756172e+00,
       -5.42446777e+00])

Calculate the residuals and the mean squared error

In [25]:
y_pred = lr.predict(X_test)

In [26]:
residuals = y_test - y_pred

In [29]:
squared_error = residuals ** 2
squared_error

array([2.55603527e+01, 4.07293906e+01, 4.96932191e-01, 2.05180490e+01,
       5.34471813e-01, 1.99872851e+00, 4.77619199e+00, 3.96034033e-01,
       2.79646357e+00, 7.91314989e+00, 1.42379308e+00, 5.90953143e+00,
       8.47112264e+01, 1.83750266e+00, 1.16036787e-03, 3.67081626e+01,
       8.35988523e-01, 1.39787761e+00, 6.29087804e+01, 2.51831505e+00,
       5.65597199e+00, 2.98707961e-04, 6.14028416e+00, 7.48276971e-01,
       1.04492814e+01, 3.08589925e+00, 5.19218447e+00, 5.35031335e+00,
       2.62639023e+00, 6.36138792e-03, 8.45571737e-01, 2.02725478e+00,
       7.83527558e+01, 3.00563651e+00, 1.10124941e+01, 5.69680595e+00,
       1.78634161e+00, 8.83621744e-02, 9.37622241e+00, 2.08770956e+00,
       3.58289542e+01, 8.22255610e+00, 3.99171820e+01, 3.45171139e-01,
       1.73881868e+01, 2.67542077e+00, 9.18321534e-02, 4.88119270e+00,
       2.87321518e+00, 5.31524944e+01, 1.42337511e-01, 2.87234038e-02,
       1.99295540e+00, 2.52600994e-01, 6.97472072e-01, 1.32552741e-01,
      

In [30]:
mse = squared_error.mean()
mse

16.359027030943192

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [None]:
def kfolds(data, k):
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers           
    return None

### Apply it to the Boston Housing Data

In [None]:
# Make sure to concatenate the data again

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [None]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    train = None
    test = None
    # Fit a linear regression model
    
    #Evaluate Train and Test Errors

# print(train_errs)
# print(test_errs)

## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!