## Using Validation/Cross-Validation For Model Selection

This notebook demonstrates two typical workflows for using validation data to select models. It also demonstrates the use of some utility methods like generating **polynomial features**  and **scaling features**.

**Notebook Contents**

> 1. Simple preprocessing
> 2. Basic validation method: Train/validation/test
> 3. Rigorous validation method: Cross-validation/test
> 4. Making CV less manual via scikit-learn

## 1. Preprocessing

In [1]:
#Data loading: cars data set (using car characteristics to predict the price)
import pandas as pd
import numpy as np

df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', header=None)

columns= ['symboling','normalized-losses','make','fuel-type',
          'aspiration','num-of-doors','body-style','drive-wheels',
          'engine-location','wheel-base','length','width','height',
          'curb-weight','engine-type','num-of-cylinders','engine-size',
          'fuel-system','bore','stroke','compression-ratio','horsepower',
          'peak-rpm','city-mpg','highway-mpg','price']
df.columns=columns

In [2]:
df.head(3)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500


We're going to simplify things a bit by focusing on the numeric columns

In [3]:
#simple cleaning
df = df.replace('?', np.NaN).dropna().reset_index(drop=True)
df['price'] = df['price'].astype(float)

cars = df.select_dtypes(exclude=['object']).copy()

# cars['make'] = df['make']
cars.head(3)

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg,price
0,2,99.8,176.6,66.2,54.3,2337,109,10.0,24,30,13950.0
1,2,99.4,176.6,66.4,54.3,2824,136,8.0,18,22,17450.0
2,1,105.8,192.7,71.4,55.7,2844,136,8.5,19,25,17710.0


Now we're ready to start modeling! We're going to try out the validation process to choose between 3 models: simple linear regression, linear regression with 5th degree polynomial features, and linear regression with 2nd degree polynomial features.

## 2. Simple Validation Method: Train / Validation / Test

Here we will break the data into 3 portions: 60% for training, 20% for validation (used to select the model), 20% for final testing evaluation.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder

X, y = cars.drop('price',axis=1), cars['price']

# hold out 20% of the data for final testing
X_, X_test, y_, y_test = train_test_split(X, y, test_size=.2, random_state=10)

---
**Exercise**: using `train_test_split` and random state 30, further partition X, y into datasets X_train, y_train (60% of original) and X_val, y_val (20%).

Hint: you will need to adjust the `test_size` parameter.

---

In [5]:
# YOUR SOLUTION HERE
X_train, X_val, y_train, y_val = train_test_split(X_, y_, test_size=.25, random_state=3)

### Create Scaled Features

In [6]:
lm = LinearRegression()

#Feature scaling for train, val, and test
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train.values)
X_val_scaled = scaler.transform(X_val.values)
X_test_scaled = scaler.transform(X_test.values)

### Create Polynomial Features

In [8]:
#Feature transforms for train, val, and test so that we can run our poly model on each
poly = PolynomialFeatures(degree=2)

X_train_poly = poly.fit_transform(X_train.values)
X_val_poly = poly.transform(X_val.values)
X_test_poly = poly.transform(X_test.values)

lm_poly = LinearRegression()

Now we can train, validate, and test.

In [9]:
#validate

lm.fit(X_train, y_train)
print(f'Linear Regression val R^2: {lm.score(X_val, y_val):.3f}')

lm_poly.fit(X_train_poly, y_train)
print(f'Degree 2 polynomial regression val R^2: {lm_poly.score(X_val_poly, y_val):.3f}')

Linear Regression val R^2: 0.352
Degree 2 polynomial regression val R^2: -2.099


Check out that negative R^2, some severe overfitting! 

So having run this validation step, we see that the evidence points to simple linear regression being the best model. So our validation process lets us **select** that choice of model, and as our final step we retrain it on the entire chunk of train/val data and see how it does on test data:  

In [10]:
lm.fit(X_,y_)
print(f'Linear Regression test R^2: {lm.score(X_test, y_test):.3f}')

Linear Regression test R^2: 0.858


Not terrible!

This is a pretty solid selection method, but we can make it even more rigorous using **cross-validation**.

---
**Exercise**: Return to the beginning of this workflow (train-test split), and try changing the random state (e.g. to 11) before stepping through all the code blocks.

What happened to the evaluation results, and how would you explain it? Is the evidence about which model we should select the same, or different? 
       
---

## 3. Rigorous Validation Method: Cross-Validation / Test

Here we will break the data into 2 portions: 80% for a cross-validated training process, and 20% for final testing evaluation. 

Remember that the idea of CV is to make efficient use of the data available to us (using 80% instead of 60% above), while also performing multiple validation checks. For k-fold CV, we come up with k train/validation splits of the whole chunk of data, in such a way that **each observation is in the validation set exactly 1 time**. Here's a helpful diagram:

![](images/cross_validation_diagram.png)

As we loop through our CV folds, we will train and validate both models and collect the results to compare at the end. Note that we scale the training features within the CV loop.

In [11]:
from sklearn.model_selection import KFold

X, y = cars.drop('price',axis=1), cars['price']

X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=10) #hold out 20% of the data for final testing

#this helps with the way kf will generate indices below
X, y = np.array(X), np.array(y)

In [12]:
#run the CV

kf = KFold(n_splits=5, shuffle=True, random_state = 71)
cv_lm_r2s = [] #collect the validation results

for train_ind, val_ind in kf.split(X,y):
    
    X_train, y_train = X[train_ind], y[train_ind]
    X_val, y_val = X[val_ind], y[val_ind] 
    
    #simple linear regression
    lm = LinearRegression()

    lm.fit(X_train, y_train)
    cv_lm_r2s.append(round(lm.score(X_val, y_val), 3))

print('Simple regression scores: ', cv_lm_r2s, '\n')

print(f'Simple mean cv r^2: {np.mean(cv_lm_r2s):.3f} +- {np.std(cv_lm_r2s):.3f}')

Simple regression scores:  [0.8, 0.669, 0.868, 0.735, 0.781] 

Simple mean cv r^2: 0.771 +- 0.066


In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state = 71)
cv_lm_poly_r2s = []

for train_ind, val_ind in kf.split(X,y):    
    #poly with degree 2
    poly = PolynomialFeatures(degree=2)

    X_train_poly = poly.fit_transform(X_train)
    X_val_poly = poly.transform(X_val)

    lm_poly = LinearRegression()
    
    lm_poly.fit(X_train_poly, y_train)
    cv_lm_poly_r2s.append(round(lm_poly.score(X_val_poly, y_val), 3))
    
print('Poly scores: ', cv_lm_poly_r2s, '\n')

print(f'Poly mean cv r^2: {np.mean(cv_lm_poly_r2s):.3f} +- {np.std(cv_lm_poly_r2s):.3f}')

---
**Exercise (time permitting)**: Modify the Cross-validation loop above so that we also fit degree 3 and 4 polynomial regression model and track their validation mean squared error instead of r^2.

---

## 4. K-fold, in a Less Manual Way with Scikit-learn

The k-fold loop we created above required a chunk of code, but we usually expect sklearn to let us make everything much simpler. It turns out we can in this case too, at the expense of fine-grained control over exactly what happens within the k-fold loop. When we want to scale each training set within the k-fold loop or apply other transformations, the fine-grained control is nice, but often we can keep things simple and use the below.

In [None]:
from sklearn.model_selection import cross_val_score
lm = LinearRegression()

cross_val_score(lm, X, y, # estimator, features, target
                cv=5, # number of folds 
                scoring='r2') # scoring metric

We could also recreate the exact partitioning we used for the manual version by passing a KFold object to `cross_val_score` and using the same random state -- note the results below are identical to the manual output above.

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state = 71)
cross_val_score(lm, X, y, cv=kf, scoring='r2')