# Lecture 12: Encoding Models and Model Prediction

## Goals
- **Neuroscience / Neuroimaging concepts**
    - Encoding Models
    - Feature Spaces
    - Building a predictive model of brain activity
    - Splitting fMRI data into testing and training sets
- **Datascience / Coding concepts**
    - Multiple Comparison Correction
    - Bonferroni Correction
    - Linear vs. non-linear transformations
    - Ecological Validity
    - Predicting held out data
    - Testing model performance (using correlation)
    - Cross-Validation

# Setup

Simply run the cells below that contain all the Python modules we'll neeed, plus setup matplotlib for plotting in this jupyter notebook.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.stats
from scipy.stats import zscore
import nibabel
import cortex
import os
from nistats.hemodynamic_models import glover_hrf as create_hrf
from sklearn.linear_model import LinearRegression

np.random.seed(42)

# Set plotting defaults
%matplotlib inline

## Helper Functions

In [None]:
def load_nifti(filename, zscore=True, mask=None):
    img = nibabel.load(filename)
    data = img.get_data().T
    if mask is not None:
        data = data[:, mask]
    if zscore:
        data -= data.mean(0)
        data /= data.std(0) + 1e-8
    return data

In [None]:
def create_fake_linear(n, a, b, noise_sd=None):
    x = np.random.randn(n)
    y = a * x + b
    if noise_sd is not None:
        y += np.random.randn(n) * noise_sd
    return (x,y)

In [None]:
def convolve_designmat(designmat):
    
    num_stim_types = designmat.shape[1]
    n = designmat.shape[0]

    # create the hrf
    hrf = create_hrf(tr=2, oversampling=1, time_length=32)
    
    # create the current shuffled design matrix using the shuffled indices
    cur_designmat = np.zeros(designmat.shape)
    for cur_column in np.arange(num_stim_types):
        cur_stim_vec_shuffled = designmat[:, cur_column]
        cur_designmat[:, cur_column] = np.convolve(cur_stim_vec_shuffled, hrf)[:n]        

    return cur_designmat

In [None]:
def shuffle_blocks(x, block_size=10):
    n = x.shape[0]
    n_blocks = n // block_size     

    # shuffle the indices by blocks
    indices = np.arange(n)
    blocked_indices = indices.reshape([n_blocks,block_size])
    np.random.shuffle(blocked_indices)
    indices_block_shuffled = blocked_indices.reshape([-1])    
    x_block_shuffled = x[indices_block_shuffled,:]

    return x_block_shuffled

In [None]:
def calc_contrast_convolve(designmat, bold_data, contrast_idxs):
    # convolve the stimulus design matrix with the hrf and return it
    designmat_conv = convolve_designmat(designmat)
    
    # fit the model on the convolved design matrix
    model = LinearRegression()
    model.fit(designmat_conv, bold_data)
    contrast = model.coef_[:, contrast_idxs[0]] - model.coef_[:, contrast_idxs[1]]
    return contrast

In [None]:
def z_score(x):
    x = x - x.mean(axis=0)
    x = x / (x.std(axis=0) + 1e-18)
    return x

In [None]:
def correlate(x, y):
    x_z = z_score(x)
    y_z = z_score(y)
    return np.mean(x_z*y_z, axis=0)

# Review

Last week we covered how to calculate contrasts using multiple regression, and briefly touched on how to use permutation tests to compute p-values for hypothesis testing. Let's review and dig deeper.

## Load the visual localizer fMRI data

We'll start by loading the visual category localizer data that we've been using.

In [None]:
mask = cortex.db.get_mask('s01', 'catloc', 'cortical')

data01 = load_nifti("/data/cogneuro/fMRI/categories/s01_categories_01.nii.gz", zscore=True, mask=mask)
data02 = load_nifti("/data/cogneuro/fMRI/categories/s01_categories_02.nii.gz", zscore=True, mask=mask)
data03 = load_nifti("/data/cogneuro/fMRI/categories/s01_categories_03.nii.gz", zscore=True, mask=mask)

# Concatenate the data
data = np.concatenate((data01, data02, data03), axis=0)
data.shape

Now extract the data for a voxel in the FFA and PPA

In [None]:
FFA_Vox_Idx = 3464
PPA_Vox_Idx = 10433
data_faces = data[:, FFA_Vox_Idx]
data_places = data[:, PPA_Vox_Idx]

And pull out the number of categories in the data, and create a vector that represents time for the data as well.

In [None]:
n_TRs = data.shape[0]
time = np.arange(0, (n_TRs * 2), 2)
num_cortical_voxels = data.shape[1]

Load the category data

In [None]:
categories = np.load("/data/cogneuro/fMRI/categories/catloc_experimental_conditions.npy")
unique_categories_no_nothing = np.unique(categories[categories != 'nothing'])

Simply create the stimulus design matrix here, since we'll need it, and not the response design matrix, to do permutation testing later on.

In [None]:
designmat_stimulus = np.zeros((len(categories), len(unique_categories_no_nothing)))
for i in np.arange(len(unique_categories_no_nothing)):
    designmat_stimulus[:,i] = (categories == unique_categories_no_nothing[i])

And we'll use a helper method to convolve the stimulus design matrix with an hrf to create the response design matrix.

In [None]:
designmat_response = convolve_designmat(designmat_stimulus)

### Fit a Multiple Regression Model

In [None]:
model_visual = LinearRegression()
_ = model_visual.fit(designmat_response, data)
weights = model_visual.coef_

### Calculate and plot Contrasts

For this visual category localizer experiment 4 different contrasts are generally defined:

1. **FFA**: faces - places
2. **PPA**: places - objects
3. **EBA**: bodies - objects
4. **LO**: objects - scrambled_objects

Let's create the PPA contrast here:

In [None]:
unique_categories_no_nothing

In [None]:
contrast_idxs_PPA = (3, 2)
contrast_PPA = weights[:, contrast_idxs_PPA[0]] - weights[:, contrast_idxs_PPA[1]]

### Flatmap of the PPA Contrast

In [None]:
volume_contrast_PPA = cortex.Volume(contrast_PPA, 's01', 'catloc', vmin=0, cmap='Reds')
_ = cortex.quickshow(volume_contrast_PPA)

### Calculate Significance of Contrast using Permutation Testing

In [None]:
# Step 1: define the number of resamples to do
NUM_RESAMPLES = 100

# Step 2: create the null distribution vector
null_dist_PPA = np.zeros((NUM_RESAMPLES, num_cortical_voxels))

# Step 3: loop and resample to create the null distribution
for cur_iter in range(NUM_RESAMPLES):
    
    # Step 3a: resample the design matrix without replacement by shuffling the blocks
    cur_designmat_shuffle = shuffle_blocks(designmat_response)
    
    # Step 3b: calculate the contrast value using the current shuffled design matrix
    cur_contrast_PPA = calc_contrast_convolve(cur_designmat_shuffle, data, contrast_idxs_PPA)
    
    # Step 3c: store the contrast in the null distribution vector
    null_dist_PPA[cur_iter, :] = cur_contrast_PPA

# Step 4: Calculate the p-value to determine signifiance
num_greater_null_PPA = (null_dist_PPA > contrast_PPA).sum(axis=0)
p_value_PPA = (num_greater_null_PPA + 1) / (NUM_RESAMPLES+1)

### Flatmap of Significant Voxels

In [None]:
volume_pvalue_PPA = cortex.Volume(1-p_value_PPA, 's01', 'catloc', vmin=0.95, vmax=1, cmap='Reds')
_ = cortex.quickshow(volume_pvalue_PPA)

#### Breakout Session

1\. Calculate the significance for the EBA contrast using permutation testing.

2\. Plot a flatmap of `1-pvalue` for the EBA contrast. Make sure to set `vmin=0.95` and `vmax=1` so it shows up. 

# Multiple Comparison Correction

### What is a p-value again?

A p-value tells us the probability that our test statistic comes from the null distribution. That means when we reject the null hypothesis with an $\alpha$`=.05`,  5% of the time the real value is 0 (or whatever the null hypothesis is). That's acceptable if we are only calculating one p-value, but if we have 100 p-values that we are calculating, and they are all truly not significant, we'll still find 5 on average that "look" significant. Here's a comic that uses this idea for it's punchline:

<img src='figures/significant.png' style=''>

We need to correct for this, and this correction is called a mutliple comparisons correction. There are many ways to do this comparison, some of which are more sensitive than others. The most common methos used in fMRI fall into one of two categories: **Family Wise Error** correction and the **False Discovery Rate**. The most basic correction that can be done is called a Bonferroni correction, and is a type of **Family Wise Error** correction. Let's explore that now...

### The dead salmon study: Why we do multiple comparisons correction

An influential study by Bennett et al. (2010) showed that a group of voxels in a dead salmon were implicated in processing social cues from images. Seeing as how the fish was *DEAD*, these findings were clearly some **false positives**, i.e., an effect that is marked as real or significant even though it really was not. This study was designed to show the perils of failing to correct for multiple comparisons.

<img src='figures/dead-salmon-fmri1.png'>

Here are some good blog posts about this study:
http://neuroskeptic.blogspot.com/2009/09/fmri-gets-slap-in-face-with-dead-fish.html
https://blogs.scientificamerican.com/scicurious-brain/ignobel-prize-in-neuroscience-the-dead-salmon-study/

The original study:

Bennett et al. "Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction" Journal of Serendipitous and Unexpected Results, 2010

### Bonferroni Correction: Adjusting the alpha level

If multiple hypotheses are tested, the chance of a rare event increases, and therefore, the likelihood of incorrectly rejecting a null hypothesis (i.e., making a Type I error) increases. The Bonferroni correction compensates for that increase by testing each individual hypothesis at a significance level of `alpha/m`, where alpha is the desired overall alpha level and m is the number of hypotheses. For example, if a trial is testing `m=20` hypotheses with a desired `alpha = 0.05`, then the Bonferroni correction would test each individual hypothesis at `alpha = 0.05 / 20 = 0.0025`.

from: https://en.wikipedia.org/wiki/Bonferroni_correction

Let's use the Bonferroni correction to adjust the alpha level for all cortical voxels.

In [None]:
alpha = 0.05
alpha_adjusted = alpha / num_cortical_voxels
alpha_adjusted

Now we can use this adjusted p-value to determine significance. Let's recreate a map that uses this value.

In [None]:
contrast_PPA_sig_Bon = p_value_PPA < alpha_adjusted

In [None]:
volume_contrast_PPA_sig_Bon = cortex.Volume(contrast_PPA_sig_Bon, 's01', 'catloc', cmap='Reds', vmin=0, vmax=1)
_ = cortex.quickshow(volume_contrast_PPA_sig_Bon)

Huh, nothing is significant! This is not the truth, rather a side-effect of the permutation test we did. Let's explore how the number of resamples (or size of the null distribution) effects the p-values we can estimate.

### Effect of Null Distribution Size on p-values

We know that the p-value calculated from doing a permutation test is:
$$ \frac{(numBiggerNull + 1)}{(numResamples+1)}$$

From this, we can see that the smallest p-value that is possible given a set number of resamples is:

$$ \frac{1}{(numResamples+1)}$$

Since we only did 100 resmaples in the permutation test of the PPA contrast, the smallest p-value we can expect to get is $\frac{1}{101}$, which is much smaller than the adjusted alpha we calculated by doing the Boferroni correction above. That is why we didn't find any significant voxels, not because there aren't actually any.

# Regions of Interest definition

From next lecture onward we will work with a brain surface that has regions of interest defined. Just to get an overview of what that looks like, here is a fully-annotated brain surface:

In [None]:
from IPython.display import SVG, display
svg = SVG("/data/cogneuro/pycortex_store/s03/overlays.svg")
display(svg)

# Encoding Models

Up until this point we've looked at data collected in localizer experiments, and used contrasts derived from a multiple regression model to infer selectivity in different brain regions. By fitting a multiple regression model we accounted for any correlation between independent variables, and by using intelligently designed contrasts we were able to control for unwanted cognitive processes leaving only the cognitive process of interest. This has been the dominant method for analsis of fMRI data over the last 20 years, and as such we will call it the **conventional approach**. It has several drawbacks, however:

1. The contrasts are generally simple and rely on the experimental design to ensure all unwanted cognitive processes or other confounding variables are controlled for, preventing new confounds from being controlled for post-hoc.
2. The experimental design generally only allows for a single hypothesis to be tested.
3. For many brain regions, much is already known about the types of stimulus or task properties that any given brain region responds to. The conventional approach does not allow for all of this previous information to be controlled for in the analysis. This becomes a problem when there is correlation between the independent variables of a new study and independent variables of previous studies. What may look like new findings could simply be due to the correlation between the new and old independent variables.

A technique called **encoding models** has been applied to fMRI studies for the last decade and overcomes these drawbacks. It is still not the dominant analysis technique, but it is rising in popularity. Encoding models were first developed for use in sensory neuroscience. They describe all of the **features** or properties of the sensory stimulus (usually visual) that a given neuron may respond to (by increasing it's firing rate). Let's illustrate this with a fictional example:

A simple conventional experiment finds a region of the brain that is selective to faces by doing a `faces - places` contrast. A second conventional experiment might then find that a given face selective region (say FFA) responds less to smiling than it does to a person's eye color. The study may then conclude that the FFA is involved in determining properties of a person's identity (eye color) and not emotional expression. An encoding model of the same region might define several hundred properties of a face, such as whether the person is smiling, frowning, their eye color, or presence of facial hair. This type of model would give you a deeper insight into what types of information is **represented** by the neurons in this brain region, as you would have a much larger number of features to investigate to better infer whether the FFA does indeed represent information about a person's identity and not emotional expression.

See [this paper](https://www.sciencedirect.com/science/article/pii/S1053811910010657) for more on encoding models and it's use in fMRI studies.

Next we will introduce several concepts necessary for understanding encoding models.

## Feature Spaces

A **feature space** is an N-dimensional abstract space where each dimension represents a single feature, which represents anything. A great way to think about an abstract space is to think of a 2-D scatter plot. Generally we talk about the x and y axes in a scatter plot, and each point represents the values of that datum on both the x and y axis. To translate that to the language of feature spaces, each axis represents a single feature, and a point in the plot represents the **feature loading** of that datum onto the two features. 

Let's look at example 2-D feature space that represents two features: the weight and height of humans. We could say that this 2-D feature space is a good description of the size of humans, and that any one feature would be an incomplete description of a person's size.

In [None]:
fake_height, fake_weight = create_fake_linear(1000, 1.1, 60, .3)

In [None]:
fig = plt.figure(figsize=(6,6))
plt.plot(fake_height, fake_weight, '.')
plt.xlabel('Height')
_ = plt.ylabel('Weight')

We call this a space because we actually think about it geometrically, with data points lying at a physical location within the space. That location is defined by the N feature loadings for each data point. This means that distances in this space tell you something about how close the data points are to each other. If two points are close to each other in this plot then the two people those points represents are similar in size.

We could add many more features to this feature space which capture information about a person's size, such as the size of the person's waist, head and arms, length of legs, etc. 

A well designed feature space is one that has the minimum number of features needed to fully describe the phenomenon (human size in this case). There's folklore that the length of a person's forearm is very similar to the length of their feet. If this were found to be true, then including both forearm and foot length in this example feature space would be redundant, and undesirable. 

### Feature Spaces in fMRI
In an fMRI study using encoding models, each feature in the feature space is represented by an independent variable in the design matrix. Thus, every stimulus or task is assigned a value for every feature, resulting in a **feature vector**. That feature vector becomes the row in the design matrix where that stimulus or task was done during the experiment. 

Let's create several feature vectors within a feature space consisting of 5 face properties: 
* Eye color
* Nose size
* Smiling
* Frowning
* Facial hair present

In [None]:
features = ['Eye color', 'Nose size', 'Smiling', 'Frowning', 'Facial hair present']

In [None]:
stim1_feature_vector = [1, 3.2, 0, 1, 1]
stim2_feature_vector = [3, 2.7, 0, 0, 0]
stim3_feature_vector = [1, 3.4, 0, 0, 1]
stim4_feature_vector = [2, 2.1, 1, 0, 0]
stim5_feature_vector = [2, 3.1, 0, 1, 0]
stim6_feature_vector = [3, 2.9, 1, 0, 0]

Now let's assemble them into a design matrix

In [None]:
designmat_fake_face = np.stack((stim1_feature_vector,
                                stim2_feature_vector,
                                stim3_feature_vector,
                                stim4_feature_vector,
                                stim5_feature_vector,
                                stim6_feature_vector), axis=0)
designmat_fake_face.shape

We don't have a way to visualize a 5-dimensional space, but we can look at 2 features at a time to get an idea of what this data looks like. We can also look at the design matrix as an image.

In [None]:
fig = plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.imshow(designmat_fake_face, aspect='auto', cmap='Reds')
plt.xticks(np.arange(len(features)), features, rotation=45)

plt.subplot(1,3,2)
plt.plot(designmat_fake_face[:,0], designmat_fake_face[:,1], '.', markersize=15)
plt.xlabel(features[0])
plt.ylabel(features[1])

plt.subplot(1,3,3)
plt.plot(designmat_fake_face[:,2], designmat_fake_face[:,3], '.', markersize=15)
plt.xlabel(features[2])
plt.ylabel(features[3])

#### Breakout Session

1\. What does each point in the above scatter plots represent?

2\. Why are there only 3 points (and not 6) in the 3rd plot of smiling vs. frowning?

## Naturalistic Stimuli

Most conventional studies use very controlled stimuli or tasks, which allows those studies to incorporate only a few independent variables encoding the cognitive process of interest. In contrast, encoding model studies use so-called **naturalistic stumuli**, meaning they capture the world more as it is. For example, a study of facial expressions could use pictures of actors posing in front of a white background, which would be very controlled stimuli. Naturalistic images of emotional faces would be images taken in the real world of people experiencing an authentic emotion. On the left are several examples of naturalistic face images, and on the right are several examples of controlled, posed images.

<table>
    <tr>
        <td><img src='figures\Natural_Face1.png' width=300/></td>
        <td><img src='figures\Posed_Face1.png' width=300/></td>
    </tr>
    <tr>
        <td><img src='figures\Natural_Face2.png' width=300/></td>
        <td><img src='figures\Posed_Face2.png' width=300/></td>
    </tr>
    <tr>
        <td><img src='figures\Natural_Face2.png' width=300/></td>
        <td><img src='figures\Posed_Face3.png' width=300/></td>
    </tr>
</table>

### Naturalistic vs. Controlled Stimuli

* One advantage of naturalistic images for encoding models is that they are not constrained (controlled) in any way and so they allow for many different feature spaces to be applied to them.

* A second advantage of naturalistic stimuli over controlled stimuli is that the findings from conventional studies are not as **ecologically valid** as those which use naturalistic stimuli. Ecological validity is a term that indicates how well the findings from a controlled study would apply to the uncontrolled "real world". That does not mean that all encoding model studies are better than conventional studies, or that conventional studies are useless. Ecological validity is just one of many considerations when designing an experiment.  

* A disadvantage of using naturlistic stimuli is that you may need more fMRI data than in a conventional study when you are using a large feature space.

## Encoding Models for fMRI Analysis

We've seen that features spaces for fMRI data describe the features of the stimuli or task in the experiment. One way to think about the feature space in fMRI analyses is that each feature space is a hypothesis of the types of stimulus or task properties (features) that a brain region represents. Thus, to test a hypothesis about a brain region using encoding models, you would define a feature space and then convert each stimulus or task into a feature vector. This conversion, or transformation, from stimulus to feature space is highly **non-linear**, meaning that the feature vector for a given stimulus cannot be calculated using a linear function (y = ax + b) of its values (for images those values are pixel values). Once the stimuli or tasks are transformed into the feature space, the relationship between the feature space and the brain activity (BOLD signal) is assumed **linear**, and so we can use multiple regression to determine the **feature weights** for each feature on every voxel in the brain. The feature weights are simply yet another name for beta weights or coefficients. 

Here is a figure describing how encoding models use feature spaces to find mappings between stimuli and brain activity:

<img src='figures/Encoding_Models.png' width=800/>


### How Encoding Models Overcome the Drawbacks of the Conventional Approach: Multiple Post-hoc Feature Spaces.

Now that we have the concept of encoding models and feature spaces, let's see how using encoding models overcomes the drawbacks of the conventional approach mentioned above. The main way encoding models handle these constraints is that any feature space the experimenter can conceive of can be applied to the stimuli in a post-hoc manner.

1. The contrasts are generally simple and rely on the experimental design to ensure all unwanted cognitive processes or other confounding variables are controlled for, preventing new confounds from being controlled for post-hoc.

    * Since experiments using encoding models use naturalistic stimuli, the stimuli are not controlled at all. This means the feature space used by the encoding model is responsible for controlling all the confounding variables and unwanted cognitive processes. Since any feature space can be applied to the BOLD data post-hoc, confounding variables can be accounted for using a new, expanded feature space.
    
2. The experimental design generally only allows for a single hypothesis to be tested.

    * Feature spaces represent hypotheses of the types of information represented in a given brain region. Since any number of feature spaces may be applied to the same BOLD data post-hoc, any number of hypotheses may be tested.
    
3. For many brain regions, much is already known about the types of stimulus or task properties that any given brain region responds to. The conventional approach does not allow for all of this previous information to be controlled for in the analysis. This becomes a problem when there is correlation between the independent variables of a new study and independent variables of previous studies. What may look like new findings could simply be due to the correlation between the new and old independent variables.

    * Since the conventional approach only has a few independent variables, it is difficult to incorporate previous variables that are known to correlate with a given brain region. Feature spaces can be quite large, for reasons we'll explore later in this lecture. As such, feature spaces can include all the old variables of interest, plus new variables you are studying. And when new findings surface, you can even re-analyze old data with a new, larger feature space that incorporates those new findings.

# Assessing Model Performance: Predcting BOLD data

Encoding models use feature spaces that are very large (i.e they have many independent variables). Using large feature spaces addresses some of the drawbacks of the convention approach, but introduces some new problems. Namely, the more independent variables in a multiple regression model, the more the model will **overfit** to the data, and gve you misleading results. Model **prediction** is a technique that helps to limit the amount of overfitting. For this reason, encoding models use **prediction accuracy** to evaluate the performance of the feature spaces used in the encoding models. Let's see more...

## Overfitting

Suppose we have a model with one or more unknown parameters, and a data set to which the model can be fit (the training data set). The fitting process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of testing data from the same population as the training data, it will generally turn out that the model does not fit the testing data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large.

Define the constants for the fake data.

In [None]:
fake_n = 100
fake_slope = 1.2
fake_intercept = 3
fake_noise_sd = 0.4

Generate some fake data using the values above

In [None]:
fake_x, fake_y = create_fake_linear(fake_n, fake_slope, fake_intercept, fake_noise_sd)

Let's remind outselves what the sum of squared errors (SSE) is by defining a function that calculates SSE.

In [None]:
def SSE(x,y):
    if x.ndim == 1:
        x = x.reshape(-1,1)
    if y.ndim == 1:
        y = y.reshape(-1,1)
    model = LinearRegression()
    model.fit(x,y)
    y_hat = model.predict(x)
    return np.sum((y-y_hat)**2)

Now let's calculate SSE for the fake data that we just created.

In [None]:
SSE(fake_x, fake_y)

Now let's create another vector of random variables that is the same length as the fake data. We'll use this new vector as the second independent variable in a multiple regression on `fake_y` and look at the SSE for this model. 

Do we expect it to go up, down or stay the same?

In [None]:
random_1 = np.random.rand(fake_n)
fake_x2 = np.stack((fake_x, random_1), axis=1)
SSE(fake_x2, fake_y)

We see a modest decrease in the SSE, which we might expect since the second independent variable we just added is random. 

What happens if we add 10 new variables though?

In [None]:
random_multi100 = np.random.rand(fake_n, 10)
fake_x3 = np.concatenate((fake_x.reshape(-1,1), random_multi100), axis=1)
SSE(fake_x3, fake_y)

Ok, now we're getting somewhere. Now let's add the same number of random independent variables as there are observations (or samples) in the data. What do we expect to see for the SSE here?

In [None]:
random_multi_N = np.random.rand(fake_n, fake_n-1)
fake_x4 = np.concatenate((fake_x.reshape(-1,1), random_multi_N), axis=1)
SSE(fake_x4, fake_y)

We see a value that is very close to `0`! Does this mean that these random independent variables are meaningfully related to the `fake_y` dependent variable?

Of course the answer is no, and overfitting explains why this SSE is almost `0`. All of these random independent variables are explaining some part of the noise in the data. Since we're interested in explaining the signal, and not the noise, of our data, this metric is not very useful when we have a lot of independent variables relative to the number of observations. 

In [None]:
all_SSE = np.zeros(fake_n)
all_SSE[0] = SSE(fake_x, fake_y)
random_multi_i = np.random.rand(fake_n, fake_n-1)
for i in np.arange(fake_n-1):
    fake_x_i = np.concatenate((fake_x.reshape(-1,1), random_multi_i[:,:i]), axis=1)
    all_SSE[i+1] = SSE(fake_x_i, fake_y)

plt.plot(all_SSE)
plt.xlabel('# of independent variables')
plt.ylabel("SSE")

#### Breakout Session

1\. Can the SSE increase when adding new random variables? Why or why not?

## Cross-validation

To reduce the effects of **over-ftting** we can use a technique from statistics and machine learning called **cross-validation**. There are several different types of **cross-validation**, but all of them involve fitting a model to one set of data, and then predicting a different set of data to see how well the model genralizes to new data. The simplest form of cross-validition is simply using a **held-out** data set, often called a **test set** or **validation set**, to quantify your model fit. Here are a couple definitions:

- **Train dataset**: This is the part of the dataset you use to estimate your model. It should generally be bigger than the test dataset, about 70% of your data is a good rule of thumb.

- **Test dataset**: This is the part of your dataset that you use to quantify how well your model generalizes. This dataset should remain untouched until the very end of your analysis, where you only use it to report your results. You should never go back to your analysis and change any parameters based on the performance of your model on the test set.

We'll do cross-validation on the fake data we just created. The first step is to create a training and a test set from the fake data set. Let's do that here, but first we'll create a larger fake data set that has the same slope, intercept and noise as the previous set.

In [None]:
fake_n = 1500
fake_x_big, fake_y_big = create_fake_linear(fake_n, fake_slope, fake_intercept, fake_noise_sd)

Split the data into 2/3 training and 1/3 test

In [None]:
train_indices = range(1000)
test_indices = range(1000, 1500)

Subset the data to get the training data, then reshape it

In [None]:
fake_x_train = fake_x_big[train_indices].reshape(-1,1)
fake_y_train = fake_y_big[train_indices].reshape(-1,1)

Subset the data to get the testing data, then reshape it

In [None]:
fake_x_test = fake_x_big[test_indices].reshape(-1,1)
fake_y_test = fake_y_big[test_indices].reshape(-1,1)

## Model Performance: Correlating Predicted and Real Data

In order to reduce overfitting we use cross-validation. But cross-validation requires that we calculate some number, a statistic, that quantifies how well the model fits. There are many values that can be used. Here we will use the correlation between the predicted BOLD data and the actual BOLD data. This correlation indicates how similarly the two BOLD data signals change together. It is nice because it is easily interpreted, as the correlation is always between -1 to 1.

Let's calculate the correlation between the predicted BOLD data and the real BOLD data. This correlation we'll call **prediction accuracy**, and will be the metric we use to determine how well our encoding model fits the data.

Prediction accuracy can be calculated on the training data used to train the model, or on a held out test set. When the prediction accuracy is calculated on the training data itself, this is not cross-validation. It is still useful to calculate the prediction accuracy on both the training and test data, because the difference between the two indicates the extent to which your model is overfitting to the training data.

Fit the model on the fake training data

In [None]:
model_fake = LinearRegression()
model_fake.fit(fake_x_train, fake_y_train)
print('Original Data')
print('The estimated weight is: %.04f' % (model_fake.coef_[0][0]))

Calculate model performance by correlating the predicted and real training data

In [None]:
pred_fake_train = model_fake.predict(fake_x_train)
corr_fake_train = correlate(pred_fake_train, fake_y_train)[0]
print('Correlation with training data: %.04f' % (corr_fake_train))

Calculate model performance by correlating the predicted and real testing data

In [None]:
pred_fake_test = model_fake.predict(fake_x_test)
corr_fake_test = correlate(pred_fake_test, fake_y_test)[0]
print('Correlation with testing data: %.04f' % (corr_fake_test))

#### Breakout Session
1\. Does it seem this model is overfitting to the training data set? By how much? To figure that out, simply subtract the test prediction accuracy from the training prediction accuracy.

### Effects of an Outlier on Prediction Accuracy

When there's an outlier in the either the training or test data set the prediction accuracy values will be drastically changed. Let's add an outlier and see how it changes prediction accuracy of test and training data.

To determine what value will result in an outlier, let's find the difference of the largest value in our fake data from the mean of the fake data. That is the largest deviation. Then we'll multiple that deviation by 20, and add it back to the mean. This will be our outlier.

In [None]:
deviation_fake = max(fake_x_train) - fake_x_train.mean()
outlier_value = deviation_fake*20 + fake_x_train.mean()

Now we'll simply replace the last value of the y-data with the outlier we just calculated for both the fake train and test data. 

In [None]:
fake_y_train_outlier = np.append(fake_y_train[:-1], outlier_value).reshape(-1,1)
fake_y_test_outlier = np.append(fake_y_test[:-1], outlier_value).reshape(-1,1)

Now let's re-calculate the prediction accuracy of the test and training data using the fake y data with outliers.

Fit the model on the fake training data that has an outlier

In [None]:
model_fake_outlier = LinearRegression()
model_fake_outlier.fit(fake_x_train, fake_y_train_outlier)
print('\nOutlier Training Data')
print('The estimated weight is: %.04f' % (model_fake_outlier.coef_[0][0]))

Calculate model performance by correlating the predicted and real training data with an outlier 

In [None]:
pred_fake_train_outlier = model_fake_outlier.predict(fake_x_train)
corr_fake_train_outlier = correlate(pred_fake_train_outlier, fake_y_train_outlier)[0]
print('Correlation with training data: %.04f' % (corr_fake_train_outlier))

Calculate model performance by correlating the predicted and real testing data with an outlier

In [None]:
pred_fake_test_outlier = model_fake_outlier.predict(fake_x_test)
corr_fake_test_outlier = correlate(pred_fake_test, fake_y_test)[0]
print('Correlation with testing data: %.04f' % (corr_fake_test_outlier))

#### Breakout Session

But be careful, the testing correlation is not always correct! It depends on which set has the outlier, as can be seen here.

1. Calculate the prediction accuracy of the original model (no outlier) on the fake y data that has the outlier. Print out both the correlations for the test data with no outlier, and the testing data with an outlier that you just calculated. 

## Prediction Accuracy of real fMRI data.

Now let's use **cross-validation** to calculate model performance on a **held-out** dataset for some real fMRI data. We'll do this on the same visual localizer data we've been using this semester.

### Split the Data into Training and Test Sets

Let's do the first step of cross-validation with a held out set by splitting the data into the training and test data. This is most easily done by splitting across runs with fMRI data, since the HRF affects multiple TRs at a time.  We'll use the first 2 runs of data as the training set, and the third run as the test data.

In [None]:
data_train = np.concatenate((data01, data02), axis = 0)
data_test = data03

designmat_stimulus_train = designmat_stimulus[:len(data_train),:]
designmat_response_train = convolve_designmat(designmat_stimulus_train)

designmat_stimulus_test = designmat_stimulus[-len(data_test):,:]
designmat_response_test = convolve_designmat(designmat_stimulus_test)

### Fit a Model to the Training Data

Now fit the same multiple linear regression model that we've been fitting the last several lectures. The only difference is that there is less data (2 runs instead of 3).

In [None]:
model_train = LinearRegression()
_ = model_train.fit(designmat_response_train, data_train)

### Predict the test data

Now we'll predict the both the training and test data using the two design matrices we created above. 

In [None]:
pred_train = model_train.predict(designmat_response_train)
pred_test = model_train.predict(designmat_response_test)

### Calculate Prediction Accuracy

Now calculate prediction accuracy by correlating the predicted bold data with the real bold data. We'll do this for both the training and test data so we can see how much we've overfit to the noise.

In [None]:
pred_accuracy_train = correlate(pred_train, data_train)
pred_accuracy_test = correlate(pred_test, data_test)

### Plot flatmaps of model performance

Finally let's plot 3 different flatmaps. The first will show the prediction accuracy for the training data.

In [None]:
_ = cortex.quickshow(cortex.Volume(pred_accuracy_train, 's01', 'catloc'))

The second will show prediction accuracy for the testing data.

In [None]:
_ = cortex.quickshow(cortex.Volume(pred_accuracy_test, 's01', 'catloc'))

The third will show extent of overfitting across cortex by plotting the difference between the training and test prediction accuracyies.

In [None]:
_ = cortex.quickshow(cortex.Volume(pred_accuracy_train-pred_accuracy_test, 's01', 'catloc'))