<a href="https://colab.research.google.com/github/chiaramarzi/ML-models-validation/blob/main/models_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial intelligence (AI) for health - potentials


*   **Data mining**: finding pattern in big data
*   **Biomarker discovery**: determining potential (compound) biomarkers
*   The **predicitive nature** of machine learning strategies is highly in line with the aim of clinical diagnosis and prognosis **in the single patient**

# Models validation



In machine learning, model validation is referred to as the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived.
Model validation is carried out after model training.

Estimation of **unbiased generalization performance** of the model

# Outline

* Holdout validation
* K-fold cross-validation (CV)
* Leave-One-Out CV (LOOCV)
* Hyperparameters tuning
* Training, validation and test set: the holdout validation
* Training, validation and test set: the nested CV
* Sampling bias
* Repetition of holdout validation
* Repetition of CV
* Unbalanced datasets

# Age prediction based on neuroimaging features



*   Data: T1-weighted images of 86 healthy subjects with age ranging from 19 to 85 years (41 males and 45 females, age 44.2 ± 17.1 years, mean ± standard deviation). Data are freely accessible at [here](https://fcon_1000.projects.nitrc.org/) and described in (Mazziotta et al., 2001)
*   Features:
  * Cortical thickness (mCT)
  * Gyrification index (Pial_mean_GI)
  * Fractal dimension (FD)
* Task:
  * Regression
  * Classification ("young" vs. "old")

The same data and features have been previously investigated in (Marzi et al., 2020).


**References**

Marzi, C., Giannelli, M., Tessa, C. et al. Toward a more reliable characterization of fractal properties of the cerebral cortex of healthy subjects during the lifespan. Sci Rep 10, 16957 (2020). https://doi.org/10.1038/s41598-020-73961-w

Mazziotta, J. et al. A probabilistic atlas and reference system for the human brain: International Consortium for Brain Mapping (ICBM). Philos. Trans. R. Soc. Lond. B Biol. Sci. 356, 1293–1322. https://doi.org/10.1098/rstb.2001.0915 (2001).

# Libraries and data loading

In [82]:
# My repo cloning
! git clone https://github.com/chiaramarzi/ML-models-validation

%cd /content/ML-models-validation
! git pull

fatal: destination path 'ML-models-validation' already exists and is not an empty directory.
/content/ML-models-validation

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'root@2f3f4197a81a.(none)')


In [83]:
# Libraries loading
from IPython.display import Image
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_validate
from sklearn.svm import SVR, SVC
from sklearn.metrics import mean_absolute_error, accuracy_score

%run utils.ipynb import *

# Regression data
reg_data = pd.read_csv('data_regression.csv')

# Balanced classification data
class_data = pd.read_csv('data_classification_balanced.csv')

# Unbalanced classification data
unbal_class_data = pd.read_csv('data_classification_unbalanced.csv')

In [None]:
reg_data

In [None]:
class_data

In [None]:
unbal_class_data

# Holdout validation

The principle is simple, you simply split your data randomly into roughly 70% used for training the model and 30% for testing the model. 










![](https://raw.githubusercontent.com/chiaramarzi/ML-models-validation/main/figures/IMG_4103.png)

In [None]:
Image('figures/IMG_4103.png')

In [None]:
#SEED = 42 #563: good, 0: perfect, 42: worse

### REGRESSION ###
print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

regression_holdout(X, y, seed = 43, test_size = 0.25)

In [None]:
### CLASSIFICATION ###
print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

classification_holdout(X, y, seed = 43, stratify = None, test_size = 0.25)

# K-fold cross-validation (CV)

It splits the data into k folds, then trains the data on k-1 folds and test on the one fold that was left out. It does this for all combinations and averages the result on each instance.

The advantage is that all observations are used for both training and validation, and each observation is used once for validation. 

We typically choose either k=5 or k=10 as they find a nice balance between computational complexity and validation accuracy.

The scores of each fold from CV techniques are more insightful than one may think. They are mostly used to simply extract the average performance. However, one might also look at the variance or standard deviation of the resulting folds as it will give information about the stability of the model across different data inputs.


In [None]:
### REGRESSION ###
n_folds = 5 # for LOOCV insert n_fold = 86

print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

MAE_train, MAE_test = regression_CV(X, y, seed = 42, n_folds = n_folds)
print_to_std(MAE_train, MAE_test, "MAE")


In [None]:
### CLASSIFICATION ###
n_folds = 5 # for LOOCV insert n_fold = 50

print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

ACC_train, ACC_test = classification_CV(X, y, seed = 42, n_folds = n_folds)
print_to_std(ACC_train, ACC_test, "ACC")

  # Leave-one-out CV (LOOCV)

A variant of k-Fold CV is Leave-one-out Cross-Validation (LOOCV). 

LOOCV uses each sample in the data as a separate test set while all remaining samples form the training set. This variant is identical to k-fold CV when k = n (number of observations).

LOOCV is computationally very costly as the model needs to be trained n times. Only do this if the dataset is small or if you can handle that many computations.



In [None]:
### REGRESSION ###
n_folds = 86

print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

MAE_train, MAE_test = regression_CV(X, y, seed = 42, n_folds = n_folds)
print_to_std(MAE_train, MAE_test, "MAE")


In [None]:
### CLASSIFICATION ###
n_folds = 50

print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

ACC_train, ACC_test = classification_CV(X, y, seed = 42, n_folds = n_folds)
print_to_std(ACC_train, ACC_test, "ACC")

# Hyperparameters tuning

SVR()

SVC()

The c parameters

# Training, validation and test set: the holdout validation

When optimizing the hyperparameters of your model, you might overfityour model if you were to optimize using the train/test split.
Why? Because the model searches for the hyperparameters that fit the specific train/test you made.

To solve this issue, you can create an additional holdout set. This is often 10% of the data which you have not used in any of your processing/validation steps.


In [None]:
SEED = 42 #563: good, 0: perfect, 42: worse

### REGRESSION ###
print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

regression_holdout_val_set(X, y, SEED, test_set_size = 0.25, val_set_size = 0.15)

In [None]:
### CLASSIFICATION ###
print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

classification_holdout_val_set(X, y, SEED, test_set_size = 0.25, val_set_size = 0.15)

# Training, validation and test set: the nested CV

When you are optimizing the hyperparameters of your model and you use the same k-Fold CV strategy to tune the model and evaluate performance you run the risk of overfitting. You do not want to estimate the accuracy of your model on the same split that you found the best hyperparameters for.


Instead, we use a Nested Cross-Validation strategy allowing to separate the hyperparameter tuning step from the error estimation step. To do this, we nest two k-fold cross-validation loops:


*   The inner loop for hyperparameter tuning and
*   the outer loop for estimating accuracy.


In [84]:
SEED = 42

### REGRESSION ###
n_folds = 5

print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

MAE_tr_val, MAE_test = regression_nestedCV(X, y, SEED, n_folds)
print_to_std(MAE_tr_val, MAE_test, "MAE")

# NestedCV implemented in scikit-learn
outer_cv = KFold(n_splits=n_folds, shuffle=True, random_state=SEED)
inner_cv = KFold(n_splits=n_folds, shuffle=True, random_state=SEED)

clf = SVR(kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=0.1, epsilon=0.1, shrinking=True, cache_size=200, verbose=1000, max_iter=- 1)
p_grid = [{'C': [0.1, 1, 100]}]     
      
clf_gs = GridSearchCV(clf, param_grid=p_grid, cv=inner_cv, refit='neg_mean_absolute_error', scoring='neg_mean_absolute_error', n_jobs=1, verbose = 1000)
nested_score = cross_validate(clf_gs, X=X, y=y, cv=outer_cv, return_train_score=True, return_estimator=True, scoring = 'neg_mean_absolute_error', n_jobs=1)
###########################################################

***Regression task
The whole dataset contains 86 subjects
The age prediction will be performed using 3 MRI-derived features

*** Outer iteration: 1
* Inner iteration: 1




TypeError: ignored

In [None]:
### CLASSIFICATION ###
n_folds = 5

print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

ACC_train, ACC_test = classification_nestedCV(X, y, SEED, n_folds)
print_to_std(ACC_train, ACC_test, "ACC")

# Sampling bias

However, what if one subset of our data only have people of a certain age or income levels? This is typically referred to as a sampling bias. 

Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample.

# Repetition of holdout validation

In [None]:
print('#### HOLDOUT REPETITION')

### REGRESSION ###
print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

for SEED in range(0,10):
    regression_holdout(X, y, seed = SEED, test_size = 0.25)

In [None]:
### CLASSIFICATION ###
print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

for SEED in range(0,10):
    classification_holdout(X, y, SEED, test_size = 0.25, stratify = None)

# Repetition of CV

In [None]:
#### SAMPLING BIAS: HOLDOUT REPETITION ###
print('#### HOLDOUT REPETITION')

### REGRESSION ###
n_folds = 5

print('***Regression task')

X = reg_data.iloc[:,2:5]
y = reg_data['Age']

print('The whole dataset contains ' + str(np.shape(reg_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

for SEED in range(1,10):
    MAE_train, MAE_test = regression_CV(X, y, SEED, n_folds)
    print_to_std(MAE_train, MAE_test, "MAE")


In [None]:
### CLASSIFICATION ###
n_folds = 5 

print('***Classification task')

X = class_data.iloc[:,2:5]
y = class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

for SEED in range(1,10):
    ACC_train, ACC_test = classification_CV(X, y, SEED, n_folds)
    print_to_std(ACC_train, ACC_test, "ACC")

# Unbalanced datasets

In some cases, there may be a large imbalance in the response variables. 

For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. 

This variation is also known as Stratified K Fold

In [None]:
SEED = 88
#### UNBALANCED DATASETS ###
print('#### UNBALANCED DATASETS')
### CLASSIFICATION ###
print('***Classification task')

X = unbal_class_data.iloc[:,2:5]
y = unbal_class_data['Age_class']

print('The whole dataset contains ' + str(np.shape(unbal_class_data)[0]) + ' subjects')
print('The age prediction will be performed using ' + str(np.shape(X)[1]) + ' MRI-derived features')
print() 

'''
for SEED in range(0,100):
    print("SEED:", SEED)
    classification_holdout(X, y, SEED, stratify = None)
    classification_holdout(X, y, SEED, stratify = y)

'''

classification_holdout(X, y, SEED, test_size = 0.25, stratify = None)
classification_holdout(X, y, SEED, test_size = 0.25, stratify = y)

# SEED = 95, 91, 88