In [None]:
import warnings
warnings.filterwarnings("ignore")
import pylab as plt
import os.path as op
path_data = op.join(op.expanduser('~'), 'nilearn_data/')
%matplotlib inline

# Cross-validation in neuroimaging

This notebook is a slight extension of the Haxby decoding tutorial...it focuses on a few cross-validation methods you might employ in your model fit, and covers some best-practices.

First we'll load / display the data we'll work with

In [None]:
from nilearn import datasets
from nilearn import plotting

# By default the 2nd subject will be fetched
haxby_dataset = datasets.fetch_haxby()
fmri_filename = haxby_dataset.func[0]

# print basic information on the dataset
print('First subject functional nifti images (4D) are at: %s' %
      fmri_filename)  # 4D data

In [None]:
mask_filename = haxby_dataset.mask_vt[0]
plotting.plot_roi(mask_filename, bg_img=haxby_dataset.anat[0],
                  cmap='Paired')

Next we'll mask and vectorize the data...

In [None]:
from nilearn.input_data import NiftiMasker
# Load the mask from disk
masker = NiftiMasker(mask_img=mask_filename, standardize=True)

# Fitting the transformer initializes it to operate on new data
masker.fit(fmri_filename)

# Now we'll transform our fMRI data
fmri_masked = masker.transform(fmri_filename)

The variable "fmri_masked" is a numpy array. It is 2-D.

In [None]:
print(fmri_masked)

Its shape corresponds to the number of time-points x the number of
voxels in the mask. Note that this is much fewer than the total number of voxels in the nifty image.

In [None]:
print(fmri_masked.shape)

## Load the behavioral labels

Now we'll load the behavioral labels for this dataset

In [None]:
import pandas as pd
import numpy as np

# Load target information as string and give a numerical identifier to each
labels = pd.read_csv(haxby_dataset.session_target[0], delimiter=' ')
print(labels.head())

It looks like labels has the same length as our fMRI data, meaning that they share the same time-base.

In [None]:
print(labels.shape)
print(fmri_masked.shape)

Next, we'll retrieve the behavioral targets from the labels. These will be the "classes" that we attempt to predict.

Note that these labels aren't integers like before. That's fine - `sklearn` will try to be clever and convert these into integer representations when we fit the model.

In [None]:
print(labels['labels'].values[:50])

## Restrict the analysis to cats and faces

We'll take a subset of samples so that we're only including cats and faces...

In [None]:
# Create a mask w/ Pandas
condition_mask = labels.eval('labels in ["face", "cat"]').values

# Create a mask w/ Numpy
# condition_mask = np.logical_or(target == b'face', target == b'cat')

# We apply this mask in the sample direction to restrict the
# classification to the face vs cat discrimination
fmri_masked = fmri_masked[condition_mask]
targets = labels[condition_mask]['labels'].values

Note that we now have fewer samples.

In [None]:
print(fmri_masked.shape)

# Fit our model

Finally, we'll fit our model!

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')
print(svc)

As our data is already in the shape for `sklearn`, it's quite easy to fit the model.

In [None]:
svc.fit(fmri_masked, targets)

In [None]:
prediction = svc.predict(fmri_masked)
print(prediction)

## Validating our model

The proper way to measure error rates or prediction accuracy is via
cross-validation: leaving out some data and testing on it. There are many ways to do this.

### ...by manually leaving out data

Let's leave out the 30 last data points during training, and test the
prediction on these 30 last points:

In [None]:
svc.fit(fmri_masked[:-30], targets[:-30])

prediction = svc.predict(fmri_masked[-30:])
print((prediction == targets[-30:]).sum() / float(len(targets[-30:])))

However, this seems unfortunate. We've now got 50% less data in order to fit the model. Ideally, we'd like to do two things:

* Validate our model properly (aka, on held-out data not used in model fitting)
* Use as much of our data as possible.

It is difficult to satisfy both of these conditions properly, but *cross-validation* is one way of getting closer to this goal. 

### Cross-validation: Implementing a KFold loop

We can split the data in train and test set repetitively in a `KFold`
strategy. We'll create many cross-validation objects with `sklearn`. When one iterates through these, it returns different indices for training / test sets upon teach iteration.

Let's visualize what this will look like below...

In [None]:
# Create the KFold object
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

n_splits = 15
cv = KFold(n_splits=n_splits)

cv_sample = np.zeros([len(fmri_masked), n_splits])
for ii, (tr,tt) in enumerate(cv.split(fmri_masked)):
    cv_sample[tt, ii] = 1
fig, ax = plt.subplots()
ax.imshow(cv_sample, cmap='coolwarm', aspect='auto', interpolation='nearest')
_ = ax.set(xlabel='Iteration', ylabel='Data Index',
           title='Cross Validation indices\n(red = Test set)')

As you can see, on each iteration we hold out a different subset of samples. Next, we'll loop through this object, fit a model on one subset of data, and then test it on the other subset.

In [None]:
for train, test in cv.split(fmri_masked):
    svc.fit(fmri_masked[train], targets[train])
    prediction = svc.predict(fmri_masked[test])
    print(accuracy_score(targets[test], prediction))

If all we want to do is score this model, note that `sklearn` has tools to perform cross-validation more succinctly:

In [None]:
from sklearn.cross_validation import cross_val_score
cv_score = cross_val_score(svc, fmri_masked, targets)
print(cv_score)

> Note that we can speed things up to use all the CPUs of our computer
with the n_jobs parameter. However, be careful in doing this on a cluster environment as you may be asking for resources not available to you.

By default, cross_val_score uses a 3-fold KFold. We can control this by
passing the "cv" object, here a 5-fold:

In [None]:
cv_score = cross_val_score(svc, fmri_masked, targets,
                           cv=cv.split(fmri_masked))
print(cv_score)

It's often useful to visualize these as a histogram, to get an idea for the distribution of cross-validated scores.

In [None]:
def plot_classifier_scores(scores):
    fig, ax = plt.subplots()
    ax.hist(scores)
    ax.axvline(.5, ls='--', c='r')
    ax.set_xlabel('Model Score')
plot_classifier_scores(cv_score)

### A quick note on cross-validating with time
As neuroscientists, our data is often collected across time. This might be across a very short time-scale (milliseconds) or a long one (hours, or days). Either way, it is crucial to consider the relationships between datapoints as a function of time. Consider the following facts:

* All time series data is correlated with itself (autocorrelated) to some degree
* Confounding variables may be the same on one day of acquisition, and different on another day
* The brain may have a different baseline internal state at the beginning of an experiment compared to the end.

As we've mentioned before, you must **always test the model on "new" data**. This means that the training and test sets should share as *little information as possible*. In other words, anything in the training set that could give you information about the test set, but that is not related to the features of interest, will **bias** the model towards the wrong answer, or inflate your model score.


## Leaving out recording sessions
The best way to do cross-validation is to respect the structure of
the experiment, for instance by leaving out full sessions of
acquisition.

The number of the session is stored in the CSV file with our
behavioral data. We'll apply our condition mask, and then leave out one session at a time. To do this, we'll use a LeaveOneLabelOut object:

In [None]:
# Find out list of session number
session_label = labels['chunks'][condition_mask]

# Iterate through sessions, validating on a held-out session
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(session_label)
cv_score = cross_val_score(svc, fmri_masked, targets, cv=cv)
plot_classifier_scores(cv_score)