# The importance of appropriate cross-validation

In [1]:
import warnings
warnings.filterwarnings("ignore")

In using machine learning on neuroimaging data, appropriate cross-validation methods are critical for drawing meaningful inferences.
However, a majority of neuroscience researchers are not familiar with how to choose an appropriate method for their data.

```{figure} ../images/poldrack-2020-fig3.jpg
---
height: 250px
name: cv-usage
---
From {cite}`Poldrack_2020`, depicting results from a review of 100 Studies (2017–2019) claiming prediction on fMRI Data
Panel A shows prevalence of cross-validation methods used to assess predictive accuracy in this sample.
Panel B shows a histogram of associated sample sizes.
```

We briefly overview what cross-validation aims to achieve, as well as several different strategies for cross-validation that are in use with neuroimaging data.
We then provide examples of appropriate and inappropriate cross-validation within the `development_fmri` dataset. 

## Why cross-validate ?

First, let's formalize the problem that cross-validation aims to solve, using notation from {cite}`Little_2017`. 

For $N$ observations, we can choose a variable $y \in \mathbb{R}^n$ that we are trying to predict from data $X \in \mathbb{R}^{n \times p}$ in the presence of confounds $Z \in \mathbb{R}^{n \times k}$⁠.
For example, we may have neuroimaging data for 155 participants, from which we are trying to predict their age group as either a child or an adult.
There are additional confounding measures in this prediction, both measured and unmeasured.
For example, motion is a likely confounding variable, as children often move more in the scanner than adults.

In this notation, we can then consider $y$ as a function of X and Z:

$$
  y = Xw + Zu + \epsilon
$$

where $\epsilon$ is observation noise, and we have assumed a strictly linear relationship between the variables.

In such model, $\epsilon$ may be independent and identically distributed (i.i.d.) even though the relationship between $y$ and $X$ is not i.i.d; for example, if it changes with age group membership.

The machine learning problem is to estimate a function $\hat{f}_{\{ train \}}$ that predicts best $y$ from $X$.
In other words, we want to minimize an error $\mathcal{E}(y,\hat{f}(X))$⁠.

The challenge is that we are interested in the error on new, unknown, data.
Thus, we would like to know the expectaction of the error for $(y, X)$ drawn from their unknown distribution:

$$
  \mathbb{E}_{(y,X)} [\mathcal{E}(y,\hat{f}(X))].
$$

From this we note two important points.
  1. Evaluation procedures _must_ test predictions of the model on held-out data, independent from the data used to train the model.
  2. Cross-validation procedures that vary the train set by repeating the train-test split many times also allow use to ask a related question: given future data to train a machine learning method on a clinical problem, what is the error that I can expect on new data?


## Forms of cross-validation

Given the importance of cross-validation, many different schemes exist.
The [scikit-learn documentation has a section](https://scikit-learn.org/stable/modules/cross_validation.html) just on this topic, which is worth reviewing in full.
Here, we briefly highlight several of the schemes in use in neuroimaging.


```{figure} ../images/varoquaux-2016-fig6.png
---
height: 400px
name: cv-strategies
---
From {cite}`Varoquaux_2017`shows the difference in accuracy measured by cross-validation and on the held-out
validation set, in intra and inter-subject settings, for different cross-validation strategies:
(1) leave one sample out, (2) leave one block of samples out (where the block is the natural unit of the experiment: subject or session), and random splits leaving out 20% of the blocks as test data, with (3) 3, (4) 10, or (5) 50 random splits. 
For inter-subject settings, leave one sample out corresponds to leaving a session out.
The box gives the quartiles, while the whiskers give the 5 and 95 percentiles.
```

In [our classification example](class-example), we used `StratifiedShuffleSplit` for cross-validation.
This method preserves the percentage of samples for each class across train and test splits; that is, the percentages of child and adult participants in our classification example.

## Testing cross-validation schemes in our example dataset.

We'll keep working with the same `development_dataset`, though this time we'll fetch all 155 subjects.
Again, we'll derive functional connectivity matrices for each participant.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from nilearn import (datasets, maskers, plotting)
from nilearn.connectome import ConnectivityMeasure

development_dataset = datasets.fetch_development_fmri()
msdl_atlas = datasets.fetch_atlas_msdl()

masker = maskers.NiftiMapsMasker(
    msdl_atlas.maps, resampling_target="data",
    t_r=2, detrend=True,
    low_pass=0.1, high_pass=0.01).fit()
correlation_measure = ConnectivityMeasure(kind='correlation')



Downloading data from https://osf.io/download/5c8ff3eb2286e80019c3c198/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3ed2286e80017c41b56/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3ee2286e80016c3c379/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3ee4712b400183b70c3/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3efa743a9001660a0d5/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f14712b4001a3b560e/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f1a743a90017608164/ ...


 ...done. (2 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f12286e80016c3c37e/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f34712b4001a3b5612/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f7a743a90019606cdf/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f6a743a90017608171/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f64712b400183b70d8/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f72286e80019c3c1af/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3f92286e80018c3e463/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff4534712b400183b716d/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3fb2286e80017c41b72/ ...


 ...done. (2 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3fb2286e80019c3c1b3/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3fd4712b400183b70e6/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3fe4712b4001a3b5620/ ...


 ...done. (2 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff3ff4712b400173b5399/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff401a743a9001660a104/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff403a743a90017608181/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff4034712b400183b70f6/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff4042286e80019c3c1c2/ ...


 ...done. (1 seconds, 0 min)


Downloading data from https://osf.io/download/5c8ff4052286e80017c41b92/ ...


Downloading data from https://osf.io/download/5c8ff4064712b400183b70fe/ ...


 ...done. (2 seconds, 0 min)


KeyboardInterrupt: 

<!-- 
```{code-call} python3
time_series = masker.transform(func_file, confounds=confound_file)
correlation_matrices = correlation_measure.fit_transform(time_series)
``` -->

```{bibliography} references.bib
:style: unsrt
:filter: docname in docnames
```