# CROSS-VALIDATION FOR PARAMETER TUNING, MODEL SELECTION, AND FEATURE SELECTION

## OVERVIEW
- What is the drawback of using the **train/test split** procedure for model evaluation?
- How does **K-fold cross-validation** overcome this limitation?
- How can cross-validation be used for selecting **tuning parameters**, choosing between **models**, and selecting **features**?
- What are some possible **improvements** to cross-validation?


## REVIEW OF MODEL EVALUATION PROCEDURES 

**Motivation:** Need a way to choose between machine learning models
- Goal is to estimate likely performance of a model on **out-of-sample data**

**Initial idea:** Train and test on the same data
- But, maximizing **training accuracy** rewards overly complex models which **overfit** the training data

**Alternative idea:** Train/test split
- Split the dataset into two pieces, so that the model can be trained and tested on **different data**

**Testing accuracy** is a better estimate than training accuracy of out-of-sample performance
- But, it provides a **high variance** estimate since changing which observations happen to be in the testing set can significantly change testing accuracy


In [None]:
%matplotlib inline

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import seaborn as sn

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
# read in the iris data
iris = load_iris()

In [None]:
# create X (features) and y (response)
X = iris.data
y = iris.target

In [None]:
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

In [None]:
# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print (metrics.accuracy_score(y_test, y_pred))

- **Question:** What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?
- **Answer:** That's the essense of cross-validation!

### STEPS FOR K-FOLD CROSS-VALIDATION

- Split the dataset into K **equal** partitions (or "folds").
- Use fold 1 as the **testing set** and the union of the other folds as the **training set**.
- Calculate **testing accuracy**.
- Repeat steps 2 and 3 K times, using a **different fold** as the testing set each time.
- Use the **average testing accuracy** as the estimate of out-of-sample accuracy.

In [None]:
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False)

In [None]:
kf

In [None]:
import numpy as np

# Create data
data = np.array(range(1,26))

In [None]:
data

In [None]:
# print the contents of each training and testing set
print ('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf.split(data)):
    print ("{0:^9} {1} {2}".format(iteration, data[0], data[1]))

- Dataset contains **25 observations** (numbered 0 through 24)
- 5-fold cross-validation, thus it runs for **5 iterations**
- For each iteration, every observation is either in the training set or the testing set, **but not both**
- Every observation is in the testing set **exactly once**

### 10-FOLD CROSS-VALIDATION

<img src='./images/10_fold_cv.png'/>

## COMPARING CROSS-VALIDATION TO TRAIN/TEST SPLIT

### Advantages of **cross-validation:**
- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

### Advantages of **train/test split:**
- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process

## CROSS-VALIDATION EXAMPLE
### PART 1: PARAMETER TUNING

**Goal:** Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print (scores)

In [None]:
# use average accuracy as an estimate of out-of-sample accuracy
print (scores.mean())

In [None]:
X.shape

In [None]:
# search for an optimal value of K for KNN
k_range = range(1, 31)
k_scores = []
k_score_items = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    k_score_items.append(scores)
    k_scores.append(scores.mean())
print (k_scores)

In [None]:
for x, y in enumerate(k_scores):
    print (x+1,y)

In [None]:
len(k_scores)

In [None]:
KNeighborsClassifier?

In [None]:
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
import matplotlib.pyplot as plt
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

### PART 2: MODEL SELECTION
**Goal:** Compare the best KNN model with logistic regression on the iris dataset

In [None]:
# create X (features) and y (response)
X = iris.data
y = iris.target

In [None]:
# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)
cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean()

In [None]:
# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print (cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())

### PART 3: FEATURE SELECTION 
**Goal**: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset


In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [None]:
# read in the advertising dataset
data = pd.read_csv('../data/advertising.csv', index_col=0)

In [None]:
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

In [None]:
# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]

In [None]:
# select the Sales column as the response (y)
y = data.Sales

In [None]:
# 10-fold cross-validation with all three features
lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')
print (scores)

In [None]:
# fix the sign of MSE scores
mse_scores = -scores
print (mse_scores)

In [None]:
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print (rmse_scores)

In [None]:
# calculate the average RMSE
print (rmse_scores.mean())

In [None]:
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print (np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

In [None]:
lm.fit(X, y)

In [None]:
lm.predict([[100, 200]])

## IMPROVEMENTS TO CROSS-VALIDATION

**Repeated cross-validation**
- Repeat cross-validation multiple times (with **different random splits** of the data) and average the results
- More reliable estimate of out-of-sample performance by **reducing the variance** associated with a single trial of cross-validation

**Creating a hold-out set**
- "Hold out" a portion of the data **before** beginning the model building process
- Locate the best model using cross-validation on the remaining data, and test it **using the hold-out set**
- More reliable estimate of out-of-sample performance since hold-out set is **truly out-of-sample**

**Feature engineering and selection within cross-validation iterations**
- Normally, feature engineering and selection occurs **before** cross-validation
- Instead, perform all feature engineering and selection **within each cross-validation iteration**
- More reliable estimate of out-of-sample performance since it **better mimics** the application of the model to out-of-sample data
