###  Created by Luis Alejandro (alejand@umich.edu)

### Cross-validation

Hyper-parameters are parameters that are not directly learnt within estimators. It is possible and recommended to search the hyper-parameter space for the best cross validation score. 

If we tweak the hyperparameters using performance metrics on the test set, there is still a risk of overfitting since some knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. For this we can use an extra set called the "validation set".

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). One of the most common used approaches is the k-fold cross-validation:

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" style="width: 600px;"/>

[Read More](https://scikit-learn.org/stable/modules/cross_validation.html)

In [1]:
# Libraries
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

from sklearn.pipeline import Pipeline

### Example of k-fold cross-validation on a regression task
Fits a [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model to the bodyfat dataset and performs the cross validation mannually using the [Kfold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class.


In [2]:
# Read dataset
dataset = pd.read_csv('../../datasets/regression/bodyfat.csv')
predictors = dataset.iloc[:,:-1].values
responses = dataset.iloc[:,-1].values
dataset.columns

Index(['Age (Years)', 'Weight (lbs)', 'Height (inches)',
       'Neck circumference (cm)', 'Chest circumference (cm)',
       'Abdomen 2 circumference (cm)', 'Hip circumference (cm)',
       'Thigh circumference (cm)', 'Knee circunference (cm)',
       'Ankle circunference (cm)', 'Biceps (extended) circunference (cm)',
       'Forearm circunference (cm)', 'Wrist circunference (cm)', 'Bodyfat %'],
      dtype='object')

In [3]:
# Splits into training/test sets
X,X_holdout,y,y_holdout = train_test_split(predictors,responses,test_size = 0.2)
# Train and evaluates model using standarization
sc = StandardScaler()
sc.fit(X)
# Defines model
mdl = LinearRegression()
# Performs K-fold cross-validation
kf = KFold(n_splits = 5)
score = np.zeros(kf.get_n_splits())
best_score = -float("inf")
best_mdl = mdl
best_sc = sc

for i, indexes in enumerate(kf.split(X)):
    # Sets
    X_train = X[indexes[0],:]
    y_train = y[indexes[0]]
    X_test = X[indexes[1],:]
    y_test = y[indexes[1]]
    # Standarizing
    sc.fit(X_train)
    X_train = sc.transform(X_train)
    X_test = sc.transform(X_test)
    # Training
    mdl.fit(X_train,y_train)
    y_pred = mdl.predict(X_test)    
    score[i] = r2_score(y_test,y_pred)
    # Picks best model
    if best_score <= score[i]:
        best_score = score[i]
        best_mdl = mdl
        best_sc = sc
        
print('Validation Scores: ', score)
X_holdout = best_sc.transform(X_holdout)
y_pred = best_mdl.predict(X_holdout)
print('Test Score (Best Model): ', r2_score(y_holdout,y_pred))

Validation Scores:  [0.73511561 0.38206664 0.70970376 0.68139746 0.81126304]
Test Score (Best Model):  0.7011764698336852


#### Example of leaking info
Standarizing before doing the cross-validation process (training/testing) is a way to leak information.

In [4]:
# Splits into training/test sets
X,X_holdout,y,y_holdout = train_test_split(predictors,responses,test_size = 0.2)
# Train and evaluates model using standarization
sc = StandardScaler()
sc.fit(X)
# Defines model
mdl = LinearRegression()
# Performs K-fold cross-validation
kf = KFold(n_splits = 5)
score = np.zeros(kf.get_n_splits())
best_score = -float('inf')
best_mdl = mdl
best_sc = sc
# Standarizing
sc.fit(X)
X = sc.transform(X)
X_holdout = sc.transform(X_holdout)

for i, indexes in enumerate(kf.split(X)):
    # Sets
    X_train = X[indexes[0],:]
    y_train = y[indexes[0]]
    X_test = X[indexes[1],:]
    y_test = y[indexes[1]]
    # Training
    mdl.fit(X_train,y_train)
    y_pred = mdl.predict(X_test)    
    score[i] = r2_score(y_test,y_pred)
    # Picks best model
    if best_score <= score[i]:
        best_score = score[i]
        best_mdl = mdl
        
print('Validation Scores: ', score)
y_pred = best_mdl.predict(X_holdout)
print('Test Score (Best Model): ', r2_score(y_holdout,y_pred))

Validation Scores:  [0.71418155 0.72363143 0.55877667 0.78066185 0.58456479]
Test Score (Best Model):  0.7953924728102229


### Example of k-fold cross-validation on a classification task
Fits a [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model to the [breast cancer dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html). Performs cross-validation using the [cross-validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) function to be able to retrieve the best model and evaluate several metrics at once (see [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) as well). Since we need to apply standarization we are using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [5]:
# Load dataset
dataset = datasets.load_breast_cancer()
print(dataset.feature_names, end="\n")
print(dataset.target_names)
predictors = dataset.data
responses = dataset.target

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']


In [6]:
# Splits into training/test sets
X,X_holdout,y,y_holdout = train_test_split(predictors,responses,test_size = 0.3, stratify=responses)
# Defines model
sc = StandardScaler()
clf = LogisticRegression(penalty='l2', C = 1)
estimators = [('normalizer', sc), ('classifier', clf)]
pipe = Pipeline(estimators)
results = cross_validate(pipe,X,y,cv = 10,scoring = ['accuracy', 'f1','precision','recall'], n_jobs=-1,
                         return_estimator=True, return_train_score=True)

print('\nTime training (Avg): ', results['fit_time'].mean())
print('\nTraining Metrics: ')
print('Accuracy (Avg): ', '%.2f' % results['train_accuracy'].mean())
print('F1 Macro (Avg): ', '%.2f' % results['train_f1'].mean())
print('Recall Macro (Avg): ', '%.2f' % results['train_recall'].mean())
print('Precision Macro (Avg): ', '%.2f' % results['train_precision'].mean())
print('\nValidation Metrics: ')
print('Accuracy (Avg): ', '%.2f' % results['test_accuracy'].mean())
print('F1 Macro (Avg): ', '%.2f' % results['test_f1'].mean())
print('Recall Macro (Avg): ', '%.2f' % results['test_recall'].mean())
print('Precision Macro (Avg): ', '%.2f' % results['test_precision'].mean())

best_pipe = results['estimator'][results['test_accuracy'].argmax()]
y_pred = best_pipe.predict(X_holdout)
print('\nTest Metrics: ')
print('Accuracy: ', '%.2f' % accuracy_score(y_pred,y_holdout))
print('F1 Score: ', '%.2f' % f1_score(y_pred,y_holdout))
print('Recall: ', '%.2f' % recall_score(y_pred,y_holdout))
print('Precision: ', '%.2f' % precision_score(y_pred,y_holdout))


Time training (Avg):  0.011170077323913574

Training Metrics: 
Accuracy (Avg):  0.99
F1 Macro (Avg):  0.99
Recall Macro (Avg):  1.00
Precision Macro (Avg):  0.98

Validation Metrics: 
Accuracy (Avg):  0.97
F1 Macro (Avg):  0.98
Recall Macro (Avg):  0.99
Precision Macro (Avg):  0.97

Test Metrics: 
Accuracy:  0.97
F1 Score:  0.98
Recall:  0.97
Precision:  0.98


In [7]:
print('\nNormalizer params:')
print(best_pipe['normalizer'].mean_)
print(best_pipe['normalizer'].var_)
print('\nClassifier weights:')
print(best_pipe['classifier'].coef_)


Normalizer params:
[1.41691369e+01 1.93214246e+01 9.22562849e+01 6.57615363e+02
 9.60887430e-02 1.04714078e-01 8.80452765e-02 4.90608212e-02
 1.81865922e-01 6.27250559e-02 4.02588827e-01 1.18449860e+00
 2.85269330e+00 3.99662067e+01 6.97755866e-03 2.52388827e-02
 3.04707933e-02 1.16045503e-02 2.04237039e-02 3.73709302e-03
 1.63429581e+01 2.56681285e+01 1.07798966e+02 8.85823743e+02
 1.32443771e-01 2.56749804e-01 2.71934676e-01 1.15178936e-01
 2.91506145e-01 8.39630726e-02]
[1.21953357e+01 2.04434379e+01 5.83070588e+02 1.19627675e+05
 1.94433162e-04 2.84515582e-03 6.25187079e-03 1.49495275e-03
 7.73518112e-04 5.11872367e-05 7.12924462e-02 2.85653498e-01
 3.80842746e+00 1.93651226e+03 9.07798084e-06 2.99659682e-04
 5.31585740e-04 3.47155915e-05 7.55772904e-05 6.34557359e-06
 2.31736402e+01 3.97002208e+01 1.13246130e+03 3.20760171e+05
 5.10650631e-04 2.45449894e-02 4.18269310e-02 4.23354560e-03
 4.02422013e-03 3.16494699e-04]

Classifier weights:
[[-0.28694131 -0.3820251  -0.28393158 -0.