# Introduction to Cross Validation

In this lecture series we will do a much deeper dive into various methods of cross-validation. As well as a discussion on the general philosphy behind cross validation. A nice official documentation guide can be found here: https://scikit-learn.org/stable/modules/cross_validation.html

## Imports

In [1]:
%reset -f

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error, r2_score

## Data Example

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/SimeonHristov99/ML_23-24/main/DATA/Advertising.csv')
df

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


## Train | Test Split Procedure 

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [4]:
## Create X and y
X = df.drop('sales',axis=1)
y = df['sales']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# Scale data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**Create Model**

In [5]:
# Poor Alpha Choice on purpose!
model = Ridge(alpha=100)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
y_pred

array([15.34908128, 17.05755308, 12.73784965, 16.18231062, 10.85075815,
        9.87999576, 17.6105132 , 15.80786278, 11.32616781, 17.30158479,
       12.8883864 , 13.64670913, 13.71636726, 18.83377117, 17.38617584,
       11.59912699, 14.88899736, 10.07145317, 10.14692243, 17.90771073,
       10.25837266, 16.71492563, 20.57087744, 19.66643199, 10.14020781,
       13.40084066, 18.09910709, 10.80433113, 13.00876939, 13.79206361,
       12.73015096, 17.42108555, 11.50183684, 10.10362749, 16.18778637,
       10.45161746, 11.25953403, 10.42658319, 12.30681396, 11.82281519,
       14.75707677, 11.58372535, 12.01609545, 10.90016204, 12.55896716,
       11.62961585, 10.8495293 , 15.74187916, 14.09264772, 18.45114683,
       13.43419788, 14.05075373, 16.0980788 , 12.07046074, 13.15048011,
        8.75095421, 19.21013193, 12.92686996, 16.49277745, 14.83525505])

**Evaluation**

In [6]:
mean_squared_error(y_test,y_pred)

7.34177578903413

In [7]:
r2_score(y_test,y_pred)

0.7399499443992823

**Adjust Parameters and Re-evaluate**

In [8]:
model = Ridge(alpha=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
y_pred

array([15.73544249, 19.56177685, 11.47282584, 16.99614361,  9.19583919,
        7.06034338, 20.24078477, 17.27047482,  9.7997058 , 19.18969381,
       12.40827613, 13.88321006, 13.72330625, 21.24960621, 18.41451801,
       10.00739858, 15.54023734,  7.72694272,  7.59886443, 20.3595504 ,
        7.831815  , 18.21607253, 24.61611392, 22.77116018,  8.0117733 ,
       12.667102  , 21.40567156,  8.10250725, 12.43158049, 12.53481984,
       10.81678067, 19.21537816, 10.09192883,  6.76998079, 17.29636618,
        7.81497124,  9.28808588,  8.31202002, 10.6122371 , 10.6533735 ,
       13.05491413,  9.80364168, 10.24764859,  8.09836046, 11.58209801,
       10.10783927,  9.025001  , 16.24936342, 13.26025422, 20.77690029,
       12.51477346, 13.96784546, 17.53696507, 11.15686875, 12.57233878,
        5.56009018, 23.21824128, 12.62301353, 18.72931877, 15.18197827])

**Another Evaluation**

In [9]:
mean_squared_error(y_test,y_pred)

2.319021579428752

In [10]:
r2_score(y_test,y_pred)

0.9178588793775941

Much better! We could repeat this until satisfied with performance metrics.

## Train | Validation | Test Split Procedure 

This is often also called a "hold-out" set, since you should not adjust parameters based on the final test set, but instead use it *only* for reporting final expected performance.

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Validation/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Eval Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)
7. Adjust Parameters as Necessary and repeat steps 5 and 6
8. Get final metrics on Test set (not allowed to go back and adjust after this!)

In [11]:
X = df.drop('sales',axis=1)
y = df['sales']

In [12]:
######################################################################
#### SPLIT TWICE! Here we create TRAIN | VALIDATION | TEST  #########
####################################################################

# 70% of data is training data, set aside other 30%
X_train, X_OTHER, y_train, y_OTHER = train_test_split(X, y, test_size=0.3, random_state=101)

# Remaining 30% is split into evaluation and test sets
# Each is 15% of the original data size
X_eval, X_test, y_eval, y_test = train_test_split(X_OTHER, y_OTHER, test_size=0.5, random_state=101)

print(f'{X_train.shape=}')
print(f'{y_train.shape=}')
print(f'{X_OTHER.shape=}')
print(f'{y_OTHER.shape=}')
print(f'{X_eval.shape=}')
print(f'{y_eval.shape=}')
print(f'{X_test.shape=}')
print(f'{y_test.shape=}')

X_train.shape=(140, 3)
y_train.shape=(140,)
X_OTHER.shape=(60, 3)
y_OTHER.shape=(60,)
X_eval.shape=(30, 3)
y_eval.shape=(30,)
X_test.shape=(30, 3)
y_test.shape=(30,)


In [13]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_eval = scaler.transform(X_eval)
X_test = scaler.transform(X_test)

**Create Model**

In [14]:
# Poor Alpha Choice on purpose!
model = Ridge(alpha=100)
model.fit(X_train,y_train)
y_eval_pred = model.predict(X_eval)
y_eval_pred

array([16.0980788 , 10.8495293 ,  8.75095421, 14.83525505, 12.55896716,
       12.8883864 , 11.58372535, 12.01609545, 16.18778637, 10.90016204,
       11.32616781, 17.90771073, 14.09264772, 13.79206361, 13.71636726,
        9.87999576, 15.34908128, 13.00876939, 13.43419788, 10.85075815,
       14.75707677, 18.83377117, 17.30158479, 15.74187916, 16.49277745,
       19.66643199, 17.6105132 , 10.07145317, 13.64670913, 17.42108555])

**Evaluation**

In [15]:
mean_squared_error(y_eval,y_eval_pred)

7.320101458823871

In [16]:
r2_score(y_eval,y_eval_pred)

0.7276645528191792

**Adjust Parameters and Re-evaluate**

In [17]:
model = Ridge(alpha=1)
model.fit(X_train,y_train)
y_eval_pred = model.predict(X_eval)
y_eval_pred

array([17.53696507,  9.025001  ,  5.56009018, 15.18197827, 11.58209801,
       12.40827613,  9.80364168, 10.24764859, 17.29636618,  8.09836046,
        9.7997058 , 20.3595504 , 13.26025422, 12.53481984, 13.72330625,
        7.06034338, 15.73544249, 12.43158049, 12.51477346,  9.19583919,
       13.05491413, 21.24960621, 19.18969381, 16.24936342, 18.72931877,
       22.77116018, 20.24078477,  7.72694272, 13.88321006, 19.21537816])

**Another Evaluation**

In [18]:
mean_squared_error(y_eval,y_eval_pred)

2.383783075056986

In [19]:
r2_score(y_eval,y_eval_pred)

0.9113142579540117

**Final Evaluation (Can no longer edit parameters after this!)**

In [20]:
y_final_test_pred = model.predict(X_test)
y_final_test_pred

array([10.10783927, 12.62301353, 12.57233878, 15.54023734, 21.40567156,
        7.59886443, 10.6533735 , 18.21607253, 19.56177685,  8.10250725,
        6.76998079, 10.6122371 , 13.96784546,  8.0117733 , 11.47282584,
       12.667102  , 11.15686875, 18.41451801, 23.21824128, 24.61611392,
        7.831815  ,  7.81497124, 17.27047482, 16.99614361,  9.28808588,
       20.77690029,  8.31202002, 10.00739858, 10.81678067, 10.09192883])

In [21]:
mean_squared_error(y_test,y_final_test_pred)

2.2542600838005176

In [22]:
r2_score(y_test,y_final_test_pred)

0.9237892400906466

## Cross Validation with cross_val_score

![cross_val](https://raw.githubusercontent.com/SimeonHristov99/ML_23-24/main/assets/grid_search_cross_validation.png)

In [23]:
X = df.drop('sales',axis=1)
y = df['sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [24]:
model = Ridge(alpha=100)
scores = cross_val_score(model,X_train,y_train,
                         scoring='neg_mean_squared_error',cv=5)
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

In [25]:
# Average of the MSE scores (we set back to positive)
abs(scores.mean())

8.215396464543607

**Adjust model based on metrics**

In [26]:
model = Ridge(alpha=1)
scores = cross_val_score(model,X_train,y_train,
                         scoring='neg_mean_squared_error',cv=5)
scores

array([-3.15513238, -1.58086982, -5.40455562, -2.21654481, -4.36709384])

In [27]:
# Average of the MSE scores (we set back to positive)
abs(scores.mean())

3.344839296530695

**Final Evaluation (Can no longer edit parameters after this!)**

In [28]:
# Need to fit the model first!
model.fit(X_train,y_train)

In [29]:
y_final_test_pred = model.predict(X_test)
y_final_test_pred

array([15.73544249, 19.56177685, 11.47282584, 16.99614361,  9.19583919,
        7.06034338, 20.24078477, 17.27047482,  9.7997058 , 19.18969381,
       12.40827613, 13.88321006, 13.72330625, 21.24960621, 18.41451801,
       10.00739858, 15.54023734,  7.72694272,  7.59886443, 20.3595504 ,
        7.831815  , 18.21607253, 24.61611392, 22.77116018,  8.0117733 ,
       12.667102  , 21.40567156,  8.10250725, 12.43158049, 12.53481984,
       10.81678067, 19.21537816, 10.09192883,  6.76998079, 17.29636618,
        7.81497124,  9.28808588,  8.31202002, 10.6122371 , 10.6533735 ,
       13.05491413,  9.80364168, 10.24764859,  8.09836046, 11.58209801,
       10.10783927,  9.025001  , 16.24936342, 13.26025422, 20.77690029,
       12.51477346, 13.96784546, 17.53696507, 11.15686875, 12.57233878,
        5.56009018, 23.21824128, 12.62301353, 18.72931877, 15.18197827])

In [30]:
mean_squared_error(y_test,y_final_test_pred)

2.319021579428752

In [31]:
r2_score(y_test,y_final_test_pred)

0.9178588793775941

# Cross Validation with cross_validate

The cross_validate function differs from cross_val_score in two ways:

1. It allows specifying multiple metrics for evaluation.
2. It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

For single metric evaluation, where the scoring parameter is a string, callable or None, the keys will be:
        
        - ['test_score', 'fit_time', 'score_time']

And for multiple metric evaluation, the return value is a dict with the following keys:

    ['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']

return_train_score is set to False by default to save computation time. To evaluate the scores on the training set as well you need to be set to True.

In [32]:
X = df.drop('sales',axis=1)
y = df['sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [33]:
model = Ridge(alpha=100)
scores = cross_validate(model,X_train,y_train,
                         scoring=['neg_mean_absolute_error','neg_mean_squared_error','max_error'],cv=5)
scores

{'fit_time': array([0.00295448, 0.00267839, 0.00205731, 0.00205588, 0.00226974]),
 'score_time': array([0.00316048, 0.00380039, 0.0031178 , 0.00272489, 0.00414681]),
 'test_neg_mean_absolute_error': array([-2.31243044, -1.74653361, -2.56211701, -2.01873159, -2.27951906]),
 'test_neg_mean_squared_error': array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
         -8.38562723]),
 'test_max_error': array([ -6.44988486,  -5.58926073, -10.33914027,  -6.61950405,
         -7.75578515])}

In [34]:
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_max_error
0,0.002954,0.00316,-2.31243,-9.32553,-6.449885
1,0.002678,0.0038,-1.746534,-4.944962,-5.589261
2,0.002057,0.003118,-2.562117,-11.396652,-10.33914
3,0.002056,0.002725,-2.018732,-7.024211,-6.619504
4,0.00227,0.004147,-2.279519,-8.385627,-7.755785


In [35]:
pd.DataFrame(scores).mean()

fit_time                        0.002403
score_time                      0.003390
test_neg_mean_absolute_error   -2.183866
test_neg_mean_squared_error    -8.215396
test_max_error                 -7.350715
dtype: float64

**Adjust model based on metrics**

In [36]:
model = Ridge(alpha=1)
scores = cross_validate(model,X_train,y_train,
                         scoring=['neg_mean_absolute_error','neg_mean_squared_error','max_error'],cv=5)
pd.DataFrame(scores).mean()

fit_time                        0.002476
score_time                      0.003231
test_neg_mean_absolute_error   -1.319685
test_neg_mean_squared_error    -3.344839
test_max_error                 -5.161145
dtype: float64

**Final Evaluation (Can no longer edit parameters after this!)**

In [37]:
# Need to fit the model first!
model.fit(X_train,y_train)

In [38]:
y_final_test_pred = model.predict(X_test)
y_final_test_pred

array([15.73544249, 19.56177685, 11.47282584, 16.99614361,  9.19583919,
        7.06034338, 20.24078477, 17.27047482,  9.7997058 , 19.18969381,
       12.40827613, 13.88321006, 13.72330625, 21.24960621, 18.41451801,
       10.00739858, 15.54023734,  7.72694272,  7.59886443, 20.3595504 ,
        7.831815  , 18.21607253, 24.61611392, 22.77116018,  8.0117733 ,
       12.667102  , 21.40567156,  8.10250725, 12.43158049, 12.53481984,
       10.81678067, 19.21537816, 10.09192883,  6.76998079, 17.29636618,
        7.81497124,  9.28808588,  8.31202002, 10.6122371 , 10.6533735 ,
       13.05491413,  9.80364168, 10.24764859,  8.09836046, 11.58209801,
       10.10783927,  9.025001  , 16.24936342, 13.26025422, 20.77690029,
       12.51477346, 13.96784546, 17.53696507, 11.15686875, 12.57233878,
        5.56009018, 23.21824128, 12.62301353, 18.72931877, 15.18197827])

In [39]:
mean_squared_error(y_test,y_final_test_pred)

2.319021579428752

In [40]:
r2_score(y_test,y_final_test_pred)

0.9178588793775941