<a href="https://colab.research.google.com/github/chiaramarzi/ML-models-validation-2024/blob/main/model_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial intelligence (AI) for health - potentials

*   **Data mining**: finding pattern in big data
*   **Biomarker discovery**: determining potential (compound) biomarkers
*   The **predicitive nature** of machine learning strategies is highly in line with the aim of clinical diagnosis and prognosis **in the single patient**

# Models validation

In machine learning, model validation is referred to as the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived.
Model validation is carried out after model training.

Estimation of **unbiased generalization performance** of the model

# Gene mutation prediction based on neuroimaging features

*   Can we predict the presence/absence of genetic mutation using neuorimaging features?
*   Data: multicenter dataset containing the radiomics features extracted from brain lesions segmented on T1- and T2-weighted MR images
  * 250 patients, examined in 10 institutions (different MRI scanners and protocols), 25 patients for each institution
  * Each patient presents a brain lesion, visible in 3D T1- and T2-weighted MR images
  * 200 radiomic features (100 from T1-weighted MR image and 100 from T2-weighted MR image) describing the lesion's morphology, contrast, texture, etc.
  * Each patient underwent genetic testing to determine if they had a specific genetic mutation


  

# Outline

Simple validation schemes:
1.   Holdout validation
2.   K-fold cross-validation (kfoldCV)
3.   Leave-One-Out CV (LOOCV)
4.   Group K-Fold Cross-Validation (gkfoldCV)
5.   Leave-One-Group-Out CV (LOGOCV)

Sampling bias:
6.   Repetition of holdout validation

Unbalanced datasets
7.   Stratified holdout validation
8.   Stratified kfoldCV

Hyperparameters tuning:
9.   Training, validation and test set: the nested kfoldCV



# Cloning repository, libraries and data loading

In [None]:
# Libraries loading
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import (train_test_split, KFold, StratifiedKFold,
                                     LeaveOneOut, LeaveOneGroupOut, ShuffleSplit,
                                     GroupKFold, cross_validate, GridSearchCV)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
cd "drive/MyDrive/Colab Notebooks/"

In [None]:
# Data loading
data = pd.read_csv("simulated_data.csv")
X = data.iloc[:, 2::]
y = data["Gene_mutation"]
groups = data["Institution"]

In [None]:
groups

In [None]:
# Random Forest classifier inizialization
model = RandomForestClassifier(random_state = 42)

# 1. Holdout validation

The principle is simple, you simply split your data randomly into roughly X% used for training the model and (1-X)% for testing the model.

FIGURA


In [None]:
# 1. Train-Test Split
print("1. Train-Test Split")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = None)
print("Training set:", list(X_train.index))
print("Test set:", list(X_test.index))
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
print('#####')
print(f"Training Accuracy: {train_acc}")
print(f"Test Accuracy: {test_acc}\n")

# 2. K-fold cross-validation (kfoldCV)

It splits the data into k folds, then trains the data on k-1 folds and test on the one fold that was left out. It does this for all folds and averages the results obtained in the k test folds.  

FIGURA

The advantage is that all observations are used for both training and validation, and each observation is used once for validation.

We typically choose either k=5 or k=10 as they find a nice balance between computational complexity and validation accuracy.

The scores of each fold from CV techniques are more insightful than one may think. They are mostly used to simply extract the average performance. However, one might also look at the variance or standard deviation of the resulting folds as it will give information about the stability of the model across different data inputs.

In [None]:
# 2. K-Fold Cross-Validation
print("2. K-Fold Cross-Validation")
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

scores = cross_validate(model, X, y, cv = kf, scoring='accuracy', return_train_score=True)
print('#####')
print(f"Training Accuracies: {scores['train_score']}")
print(f"Test Accuracies: {scores['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(scores['train_score'])}")
print(f"Std Training Accuracy: {np.std(scores['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(scores['test_score'])}")
print(f"Std Test Accuracy: {np.std(scores['test_score'])}\n")

# 3. Leave-One-Out CV (LOOCV)

A variant of k-fold CV is Leave-one-out Cross-Validation (LOOCV).

LOOCV uses each sample in the data as a separate test set while all remaining samples form the training set. This variant is identical to k-fold CV when k = n (number of observations).

LOOCV is computationally very costly as the model needs to be trained n times. Only do this if the dataset is small or if you can handle that many computations.

In [None]:
# 3. Leave-One-Out Cross-Validation (LOO)
print("3. Leave-One-Out Cross-Validation (LOO)")
loo = LeaveOneOut()
scores = cross_validate(model, X, y, cv = loo, scoring='accuracy', return_train_score=True)
print('#####')
print(f"Training Accuracies: {scores['train_score']}")
print(f"Test Accuracies: {scores['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(scores['train_score'])}")
print(f"Std Training Accuracy: {np.std(scores['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(scores['test_score'])}")
print(f"Std Test Accuracy: {np.std(scores['test_score'])}\n")

# 4. Group K-Fold Cross-Validation (gkfoldCV)

It is a variant of the conventional kfoldsCV, which encapsulates in the validation scheme a peculiar structure of the dataset, i.e., the presence of a hierarchical structure.  
Each "parent node", called "group" in machine lerning terminology, will appear exactly once in the test set across all folds (the number of distinct groups has to be at least equal to the number of folds).

FIGURA DA FARE

In [None]:
# 4. Group K-Fold Cross-Validation
print("4. Group K-Fold Cross-Validation")
gkf = GroupKFold(n_splits=5)
for i, (train_index, test_index) in enumerate(gkf.split(X, y, groups)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}, group={list(groups[train_index])}")
    print(f"  Test:  index={test_index}, group={list(groups[test_index])}")
scores = cross_validate(model, X, y, groups = groups, cv = gkf, scoring='accuracy', return_train_score=True)
print('#####')
print(f"Training Accuracies: {scores['train_score']}")
print(f"Test Accuracies: {scores['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(scores['train_score'])}")
print(f"Std Training Accuracy: {np.std(scores['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(scores['test_score'])}")
print(f"Std Test Accuracy: {np.std(scores['test_score'])}\n")

# 5. Leave-One-Group-Out CV (LOGOCV)

It is a variant of the conventional LOOCV, which encapsulates in the validation scheme a peculiar structure of the dataset, i.e., the presence of a hierarchical structure.  
Each "parent node", called "group" in machine lerning terminology, will appear exactly once in the test set.

In [None]:
# 5. Leave-One-Group-Out Cross-Validation
print("5. Leave-One-Group-Out Cross-Validation")
logo = LeaveOneGroupOut()
for i, (train_index, test_index) in enumerate(logo.split(X, y, groups)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}, group={list(groups[train_index])}")
    print(f"  Test:  index={test_index}, group={list(groups[test_index])}")
scores = cross_validate(model, X, y, groups = groups, cv = logo, scoring='accuracy', return_train_score=True)
print('#####')
print(f"Training Accuracies: {scores['train_score']}")
print(f"Test Accuracies: {scores['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(scores['train_score'])}")
print(f"Std Training Accuracy: {np.std(scores['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(scores['test_score'])}")
print(f"Std Test Accuracy: {np.std(scores['test_score'])}\n")

# Sampling bias

What if one subset of our data only have people of a certain age or income levels? This is typically referred to as a sampling bias.

**Sampling bias** is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample.

Choosing and using a validation scheme leads to potentially sampling bias in practice. One possible solution is the n-times repetition of the validation scheme.

## 6. Repetition of holdout validation

One way to overcome the sampling bias is the **n-times repetition** of the validation method, changing the seed of the pseudo-random numbers generator, that determines the data splitting.

The performances on the different test sets will be averaged.

In [None]:
# 6. ShuffleSplit
print("6. ShuffleSplit")
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for i, (train_index, test_index) in enumerate(ss.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")
    if 0 in list(test_index):
      print()
      print("0 found in the test set.")
      print()
scores = cross_validate(model, X, y, cv = ss, scoring='accuracy', return_train_score=True)
print('#####')
print(f"Training Accuracies: {scores['train_score']}")
print(f"Test Accuracies: {scores['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(scores['train_score'])}")
print(f"Std Training Accuracy: {np.std(scores['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(scores['test_score'])}")
print(f"Std Test Accuracy: {np.std(scores['test_score'])}\n")

# Unbalanced datasets

In some cases, there may be a large imbalance in the response variables.

For example, in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the simple cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete dataset.

## 7. Stratified holdout validation

In [None]:
# 7. Stratified Train-Test Split
print("1. Stratified Train-Test Split")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
print('#####')
print(f"Training Accuracy: {train_acc}")
print(f"Test Accuracy: {test_acc}\n")

## 8. Stratified kfoldCV

In [None]:
# 8. Stratified K-Fold Cross-Validation
print("8. Stratified K-Fold Cross-Validation")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_validate(model, X, y, cv = skf, scoring='accuracy', return_train_score=True)
print('#####')
print(f"Training Accuracies: {scores['train_score']}")
print(f"Test Accuracies: {scores['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(scores['train_score'])}")
print(f"Std Training Accuracy: {np.std(scores['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(scores['test_score'])}")
print(f"Std Test Accuracy: {np.std(scores['test_score'])}\n")

# Hyperparameters tuning

A random forest classifier is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. For classification tasks, the output of the random forest is the class selected by most trees (the modal response).

While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic interpretability present in decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and attention-based models. This interpretability is one of the most desirable qualities of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model[1]. For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function [2, 3].

Random forest classifier has, among others, the following **hyperparameters**:
- *n_estimators*, default=100: the number of trees in the forest
- *max_depth*, default=None: the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than *min_samples_split* samples
- *min_samples_split*, default=2: the minimum number of samples required to split an internal node
- *min_samples_leaf*, default=1: the minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least *min_samples_leaf* training samples in each of the left and right branches
- *max_features*, default=”sqrt” (alternatives: "log2", None, int): the number of features to consider when looking for the best split

**References**   
[1] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN 0-387-95284-5.   
[2] Sagi, Omer; Rokach, Lior (2020). "Explainable decision forest: Transforming a decision forest into an interpretable tree". Information Fusion. 61: 124–138. doi:10.1016/j.inffus.2020.03.013. S2CID 216444882.   
[3] Vidal, Thibaut; Schiffer, Maximilian (2020). "Born-Again Tree Ensembles". International Conference on Machine Learning. 119. PMLR: 9743–9753. arXiv:2003.11132.

When you are optimizing the hyperparameters of your model and you use the same k-Fold CV strategy to tune the model and evaluate performance, you run the risk of overfitting. You do not want to estimate the accuracy of your model on the same split that you found the best hyperparameters for.


Instead, we use a Nested kfold Cross-Validation strategy allowing to separate the hyperparameter tuning step from the error estimation step. To do this, we nest two k-fold cross-validation loops:


*   The inner loop for hyperparameter tuning and
*   the outer loop for estimating accuracy

FIGURA

## 9. Nested kfoldCV

In [None]:
# 9. Nested kfoldCV
outer_cv = KFold(n_splits=3, shuffle=True, random_state=0)
inner_cv = KFold(n_splits=2, shuffle=True, random_state=0)

clf = RandomForestClassifier(random_state = 42)
param_grid = {
	'n_estimators': [10, 20, 25],
	'max_depth': [2, 3, 4],
	'max_features': [None, "sqrt", "log2"],
}

grid_search = GridSearchCV(clf,
                           param_grid=param_grid,
                           cv=inner_cv,
                           refit='accuracy',
                           scoring='accuracy',
                           n_jobs=1,
                           verbose = 4)
score = cross_validate(grid_search, X=X, y=y, cv=outer_cv, return_train_score=True, return_estimator=True, scoring = 'accuracy', n_jobs=1)

print('#####')
print(f"Training Accuracies: {score['train_score']}")
print(f"Test Accuracies: {score['test_score']}\n")
print(f"Mean Training Accuracy: {np.mean(score['train_score'])}")
print(f"Std Training Accuracy: {np.std(score['train_score'])}\n")
print(f"Mean Test Accuracy: {np.mean(score['test_score'])}")
print(f"Std Test Accuracy: {np.std(score['test_score'])}\n")
print(score['estimator'][0].best_estimator_)
print(score['estimator'][1].best_estimator_)
print(score['estimator'][2].best_estimator_)