# Introduction to Cross Validation

In this lecture series we will do a much deeper dive into various methods of cross-validation. As well as a discussion on the general philosphy behind cross validation. A nice official documentation guide can be found here: https://scikit-learn.org/stable/modules/cross_validation.html

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Example

In [2]:
df = pd.read_csv("../DATA/Advertising.csv")

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


----
----
----
## Train | Test Split Procedure 

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [4]:
## CREATE X and y
X = df.drop('sales',axis=1)
y = df['sales']

# TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# SCALE DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**Create Model**

In [5]:
from sklearn.linear_model import Ridge

In [6]:
# Poor Alpha Choice on purpose!
model = Ridge(alpha=100)

In [7]:
model.fit(X_train,y_train)

In [8]:
y_pred = model.predict(X_test)

**Evaluation**

In [9]:
from sklearn.metrics import mean_squared_error

In [10]:
mean_squared_error(y_test,y_pred)

7.341775789034128

**Adjust Parameters and Re-evaluate**

In [11]:
model = Ridge(alpha=1)

In [12]:
model.fit(X_train,y_train)

In [13]:
y_pred = model.predict(X_test)

**Another Evaluation**

In [14]:
mean_squared_error(y_test,y_pred)

2.3190215794287514

Much better! We could repeat this until satisfied with performance metrics. (We previously showed RidgeCV can do this for us, but the purpose is to generalize the CV process for any model).

----
----
----
## Train | Validation | Test Split Procedure 

This is often also called a "hold-out" set, since you should not adjust parameters based on the final test set, but instead use it *only* for reporting final expected performance.

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Validation/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Eval Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)
7. Adjust Parameters as Necessary and repeat steps 5 and 6
8. Get final metrics on Test set (not allowed to go back and adjust after this!)

In [15]:
## CREATE X and y
X = df.drop('sales',axis=1)
y = df['sales']

In [16]:
######################################################################
#### SPLIT TWICE! Here we create TRAIN | VALIDATION | TEST  #########
####################################################################
from sklearn.model_selection import train_test_split

# 70% of data is training data, set aside other 30%
X_train, X_OTHER, y_train, y_OTHER = train_test_split(X, y, test_size=0.3, random_state=101)

# Remaining 30% is split into evaluation and test sets
# Each is 15% of the original data size
X_eval, X_test, y_eval, y_test = train_test_split(X_OTHER, y_OTHER, test_size=0.5, random_state=101)

In [17]:
# SCALE DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_eval = scaler.transform(X_eval)
X_test = scaler.transform(X_test)

**Create Model**

In [18]:
from sklearn.linear_model import Ridge

In [19]:
# Poor Alpha Choice on purpose!
model = Ridge(alpha=100)

In [20]:
model.fit(X_train,y_train)

In [21]:
y_eval_pred = model.predict(X_eval)

**Evaluation**

In [22]:
from sklearn.metrics import mean_squared_error

In [23]:
mean_squared_error(y_eval,y_eval_pred)

7.320101458823869

**Adjust Parameters and Re-evaluate**

In [24]:
model = Ridge(alpha=1)

In [25]:
model.fit(X_train,y_train)

In [30]:
y_eval_pred = model.predict(X_eval)

**Another Evaluation**

In [31]:
mean_squared_error(y_eval,y_eval_pred)

2.3837830750569853

**Final Evaluation (Can no longer edit parameters after this!)**

In [32]:
y_final_test_pred = model.predict(X_test)

In [29]:
mean_squared_error(y_test,y_final_test_pred)

2.254260083800517

----
----
----
## Cross Validation with cross_val_score

----

<img src="grid_search_cross_validation.png">

----

In [33]:
## CREATE X and y
X = df.drop('sales',axis=1)
y = df['sales']

# TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# SCALE DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [34]:
model = Ridge(alpha=100)

In [35]:
from sklearn.model_selection import cross_val_score

In [36]:
# SCORING OPTIONS:
# https://scikit-learn.org/stable/modules/model_evaluation.html
scores = cross_val_score(model,X_train,y_train,
                         scoring='neg_mean_squared_error',cv=5)

In [37]:
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

In [38]:
# Average of the MSE scores (we set back to positive)
abs(scores.mean())

8.215396464543607

**Adjust model based on metrics**

In [39]:
model = Ridge(alpha=1)

In [40]:
# SCORING OPTIONS:
# https://scikit-learn.org/stable/modules/model_evaluation.html
scores = cross_val_score(model,X_train,y_train,
                         scoring='neg_mean_squared_error',cv=5)

In [41]:
# Average of the MSE scores (we set back to positive)
abs(scores.mean())

3.344839296530695

**Final Evaluation (Can no longer edit parameters after this!)**

In [42]:
# Need to fit the model first!
model.fit(X_train,y_train)

In [43]:
y_final_test_pred = model.predict(X_test)

In [44]:
mean_squared_error(y_test,y_final_test_pred)

2.3190215794287514