# Overview
In the last lecture of the class, we will discuss how to apply the modeling techniques we have studied to more complex stimuli than those that we have used so far. 

# Goals
We will model responses to a new data set from an experiment with a less structured design. This experiment has no pre-specified *conditions*; it shows a number of words, paired with pictures. Our goal will be to formally specify what *about* the words elicits responses in the brain. We will:

* Generate hypotheses about what features or aspects of the words are related to brain responses
* Fit a model based on a hypothesis to the data from the new experiment
* Use that model to predict responses to novel stimuli

This approach is called the *encoding model* approach. 

- Neuroscience concepts
    - Using feature spaces to represent the properties of complex stimulus
    - Modeling brain responses as a function of stimulus features
- Coding concepts
    - Implementation of cross validation
- Datascience concepts
    - Predicting held out data
    - Testing and training sets
    - Testing model performance (using correlation)

> Blackboard: Recap of "What do experiments do?"

In [None]:
# First: update neurods (new functions for computing OLS)
import neurods
_, _, version = [int(x) for x in neurods.__version__.split('.')]
if version <= 1:
    neurods.io.update_neurods()
# Restart your kernel!

In [None]:
# Imports
import neurods
import numpy as np
import os
import matplotlib.pyplot as plt
import nibabel
import cortex
# Configure defaults for plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.aspect'] = 'auto'
plt.rcParams['image.cmap'] = 'viridis'
%matplotlib inline
#%config InlineBackend.figure_format = 'retina' # optional
from scipy.stats import zscore

# Complex Stimuli

The approach we have studied so far allows us to estimate responses to different categories of stimuli that have been shown in an experiment. We can compare responses between different conditions to test whether the brain reliably responds more to one condition compared to another, and we can predict the response of each voxel to new examples of the specific categories of stimuli that appear in our experiment. 

... But what if we are interested in more than responses to faces, bodies, places, and objects? What if we are interested in responses to more complex stimuli? For example, How might the brain respond when we speak, read, or listen to a variety of words? 

To explore this question, we will use freely available data from an influential paper (Mitchell et al. 2008, *Science*): https://www.cs.cmu.edu/afs/cs/project/theo-73/www/science2008/data.html

The experiment consists of subjects looking at a variety of words and accompanying line drawings of those words, as shown below.

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="http://www.cs.cmu.edu/~lwehbe/files/science.png")

In this experiment, a stimulus was presented every 10 seconds. Each stimulus was repeated 6 times. To generate the data we have downloaded, for each repeat of each word/picture, the activity between 4 and 8 seconds after the onset of the word/picture was averaged, resulting in one brain image (an event-related average) for each word/picture. 

This means that the hemodynamic response is *already accounted for*, because we are not dealing with raw TRs any more for this data set! Thus, no convolution of our design matrix will be necessary.

### Load data (Y)

In [None]:
# Experiment directory
basedir = os.path.join(neurods.io.data_list['fmri'], 'word_picture')

# Load the mask
mask_file = os.path.join(basedir,'s03_mask.nii')
mask = neurods.io.load_fmri_data(mask_file, do_zscore=False).astype(np.bool)

# Load the fMRI data
data_file = os.path.join(basedir,'s03.nii.gz')
data = neurods.io.load_fmri_data(data_file, do_zscore=True, mask=mask)
print(data.shape)

### Load description of data (X)

In [None]:
# Here we load a variable that contains information about the stimulus, 
# including the 60 words that comprise our stimuli
feature_data = np.load(os.path.join(basedir, 'features.npz'))
words = feature_data['words']

print("Here are the stimulus words: \n")
print(words)

In [None]:
word_num = 1 # change the word number

sample_image = np.zeros(mask.shape)
sample_image[mask==True] = data[word_num]
h = cortex.mosaic(sample_image, vmin=-3, vmax=3)
plt.colorbar()
plt.title(words[word_num], size=30);

### Display data, with description

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
ax.imshow(data)
y_idx = np.arange(0, 60, 10)
plt.yticks(y_idx, [words[yi] for yi in y_idx], fontsize=12)
plt.grid(axis='y', color='w', lw=2)
plt.ylabel('Word/Image', fontsize=14);
plt.xlabel('Voxel', fontsize=14);

> Blackboard: Differences in the data in this experiment vs. previous experiments

If we treat each spearate word as a condition, this is too many conditions! How can we *summarize* the data in this experiment, aside from contrasting one condition with another?

> Breakout session 1:
Working together with the whole class on the google doc, put the words into groups! Give each group a title (e.g. "Furniture"), and make a python list of strings for each group, like this:

    # Words that start with S (this is not a good reason to group words, it's just an example)
    group_1 = ['saw', 'screwdriver', 'shirt', 'skirt', 'spoon']
    
>Try to use *most* of the words in at least one group. These groups should have something in common that you think might affect responses in the brain!

In [None]:
# Fill in your groups!
group_1 = []
group_2 = []
# add as many groups as you want, up to ~5
groups = [group_1, group_2, ...]

In [None]:
### STUDENT ANSWER


In [None]:
# This function will convert your groups into an array that we can use to model the brain data from this experiment!
def group_to_design(*groups, words=words):
    """Convert lists of words in different groups into a design matrix
    
    Parameters
    ----------
    groups : lists
        any number of lists of strings can be input as the first argument. Each element
        of each group must be a word in the `words` list. 
    words : list
        The list of words to be divided into groups. 
        
    Returns
    -------
    design : array
        2D array of (time [images] x group label [features])
    """
    design = np.zeros((len(words), len(groups)))
    for igrp, grp in enumerate(groups):
        for grp_word in grp: 
            if not grp_word in words:
                raise Exception('{} is not in list!'.format(grp_word))
        design[:, igrp] = np.array([w in grp for w in words])
    return design

In [None]:
# ... and it can also be used to see what words you haven't yet categorized:
design = group_to_design(*groups)
print(words[~design.any(1)])

In [None]:
# Show feature assignments to each word
_ = plt.imshow(design)
plt.xticks(range(design.shape[1]))
plt.xlabel("Feature")
plt.ylabel("Word/Image");

# Conditions/Categories vs Features

We have created a category label model that *summarizes* over the different conditions (different words/images) in our experiment. This is OK, but it still may be hard to predict how the brain would respond to words that aren't in our original data set. (How would the brain respond to "goat"? To "pen"? To "book"?) Also, every word/image doesn't fit equally well into its category. We can do better.

We can make use of the fact that new words have some features in common with the set of words in this experiment. What if we could learn the responses to specific *features* or *properties* of words (e.g. whether or not they are animate, whether or not they are edible etc.)? Then we could predict the activity of a new word as a combination of the activities associated with its features. For example, we can learn how the brain responds to objects that are manmade, inanimate, made of wood and that are used as tools. The degree to which a noun is manmade, inanimate, or made of wood can be considered *features* of that noun. We can estimate the brain response to any word (e.g. "goat", "pen", or "book") as a weighted combination of feature values. We will use multivariate regression to estimate the weights for each feature for each voxel in the brain.

First, we need an annotation of the features or properties (e.g. degree of animacy, etc.) of these words. From looking at the list of words, it's clear that there are many properties that different sets of words share. (we used some of these properties to define our groups). 

A better way to do this is to use labels for the degree to which each word has a certain property. We have access to a set of 218 questions for which every word has been labeled by multiple users on Amazon Mechanical Turk (Sudre et al., Neuroimage, 2012). These questions were designed to capture the semantic properties of these objects. Additionally, 11 features describing the visual properties of the line drawings are also provided.

The scale of the features is 1-5 with,
 - 1 being a 100% not having a given property (e.g. not animate), and 
 - 5 being 100% yes of having a property (e.g. animate).

In [None]:
feature_names = feature_data['feature_names']
features = feature_data['features']
print("We have {} features that describe the stimulus.\n".format(len(feature_names)))
print("Each word is described by ratings on each of these features\n")

word_index = 0 # Play with this number
n_features = 20
print("First {} features for {}:\n".format(n_features, words[word_index]))
for feature_index in range(n_features):
    print(feature_names[feature_index], features[word_index, feature_index])

Thus, to describe the whole stimulus, we have an array that is much more complicated than our previous experimental design matrix.

In [None]:
# Label the axes on this plot!
plt.imshow(feature_data['features'])
print(feature_data['features'].shape)
### STUDENT ANSWER


For simplicity, we will only use the visual features for now.

In [None]:
features = feature_data['features']
feature_names = feature_data['feature_names']
vis_features = features[:, 218:]
vis_feature_names = feature_names[218:]

In [None]:
# (Same labels)
_ = plt.imshow(vis_features)
print(vis_features.shape)
### STUDENT ANSWER


In [None]:
# There are several new functions in neurods.stats that we have used the last few weeks.
from neurods import stats as nds

## BUILDING A PREDICTIVE MODEL

### IT IS VERY IMPORTANT NOT TO USE TEST DATA IN TRAINING!!

To test whether a model has learned a general relationship between stimulus features and brain responses, we need test it on data that was not included in the model training data set. 

Imagine you have a small dataset with voxel responses to features, and some of the voxels have some noise that is correlated to one of the features. The probability of such an event becomes smaller as the dataset size increases, but at low sample sizes there is a good chance of finding spurious correlations. 

Such a correlation would allow you to build a model that predicts brain activity from the features, but only in that dataset, since the noise is independent of the data and will not repeat in the same way in other datasets. However, for the voxels that show a real and strong enough response to the features, you will be able to learn a model that predicts brain activity from the features, and that model should generalize to new data.

This is why we always test a model on held out data that was not used in training. This allows us to judge whether the model is really predicting brain activity and not just fitted to noise in the sample.

Here we separate for you the words into a test and a train set:

In [None]:
test_index = [0, 1, 2, 3, 4, 6, 7, 8, 10, 13, 20, 23]
train_index = list(set(range(60)) - set(test_index))

train_x = zscore(design[train_index, :])
train_y = zscore(data[train_index, :])
print ("Shape of training features: {0}\nShape of training fMRI data: {1}".format(train_x.shape, train_y.shape))

test_x = zscore(design[test_index, :])
test_y = zscore(data[test_index, :])
print ("Shape of testing features: {0}\nShape of testing fMRI data: {1}".format(test_x.shape, test_y.shape))

> Survey 1: pycortex review

### Weight estimation and data prediction

We want to learn a function that predicts the activity for any word in terms of its features. 


> Breakout session

- Use the `nds.ols()` function to estimate the brain response to the various features for every voxel.
- Use the estimated weights to predict the activity for the held-out words, using `test_X`.
- Use the `nds.compute_correlation()` function to compute the correlation of your predicted activity and the real activity `test_Y`
- Plot a flatmap of the prediction performance. Which regions are well predicted, why?

In [None]:
# Note: for pycortex plotting, this subject has been normalized to a standard template brain
# Thus, use a the template brain ("MNI" = Montreal Neurological Institute template) and standard atlas alignment
sub, xfm = 'MNI', 'atlas336'
# V = cortex.Volume(...)

In [None]:
### STUDENT ANSWER


# Cross validation
... the easy way:

In [None]:
from sklearn.model_selection import KFold
n_splits = 3
n_voxels = mask.sum()
# Pre-allocate 
r_cv = np.zeros((n_splits, n_voxels))
kf = KFold(n_splits=n_splits)
for icv, (trn, val) in enumerate(kf.split(range(12))):
    print("\n===Split {}===\n".format(icv))
    print("Training index:")
    print(trn)
    print("Validation index:")
    print(val)
    # Fit / predict the model here:
### STUDENT ANSWER


# Model comparison
OK, so we have a somewhat general model; now, we would like to addres a few more questions / issues: 

* Our selection of a training set was arbitrary; we would like to predict ALL of our data using cross validation!
* Is our model better than a dumb model? (Are there bad ways to group the words?)
* Is our model better than the feature model that describes the visual features of each word/images?

In [None]:
# We will use these groups as a null hypothesis (since it's not terribly likely that the brain cares about
# the alphabetic grouping of these words...)
dummy_groups = [['airplane', 'ant', 'apartment', 'arch', 'arm', 'barn', 'bear', 'bed', 'bee', 'beetle', 
                 'bell', 'bicycle', 'bottle', 'butterfly'],
                ['car', 'carrot', 'cat', 'celery', 'chair', 'chimney', 'chisel', 'church', 'closet', 
                 'coat', 'corn', 'cow', 'cup', 'desk', 'dog', 'door', 'dress', 'dresser'],
                ['eye', 'fly', 'foot', 'glass', 'hammer', 'hand', 'horse', 'house', 'igloo', 'key', 'knife',
                 'leg', 'lettuce','pants', 'pliers'],
                ['refrigerator', 'saw', 'screwdriver', 'shirt', 'skirt', 'spoon', 'table', 
                 'telephone', 'tomato', 'train', 'truck', 'watch', 'window']]
dummy_design = group_to_design(*dummy_groups)

In [None]:
plt.imshow(dummy_design)

... And we will compute cross-validated predictions for each of our 3 models. There is a problem with this "dummy" model, though: It's going to be difficult to cross-validate! how can we account for the fact that different 1/3s of this array will have different category labels present? 

> Breakout session: Solve this problem! And formalize cross validation code as a function

In [None]:
### STUDENT ANSWER


In [None]:
# ... but better to do it this way:
def ols_pred_cv():
    pass
### STUDENT ANSWER


In [None]:
# Fit all three models w/ cross validation
r_category = ols_pred_cv(data, design, ri=ri)
r_dummy = ols_pred_cv(data, dummy_design, ri=ri)
r_feature = ols_pred_cv(data, vis_features, ri=ri)

### Compare models!
Each model makes a prediction at each voxel. Thus, the most straightforward (qualitative) way to compare models is to plot how well each model predicts each voxel in a scatterplot:

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(r_dummy.mean(0), r_category.mean(0), color='gray', alpha=0.1)
plt.plot([-0.6,0.85], [-0.6,0.85], 'k--')
plt.xlim([-0.6,0.85])
plt.ylim([-0.6,0.85]);
### STUDENT ANSWER


> Survey 2: what are the axis labels here? What does each dot mean? (Add labels!) What is the graph telling you??

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(r_feature.mean(0), r_category.mean(0), color='gray', alpha=0.1)
plt.plot([-0.6,0.85], [-0.6,0.85], 'k--')
plt.xlim([-0.6,0.85])
plt.ylim([-0.6,0.85])
### STUDENT ANSWER


> Discussion: How could we go about statistically determining which model is BETTER?

# Show where in the brain each model does better

In [None]:
V = cortex.Volume(r_feature.mean(0)-r_dummy.mean(0), sub, xfm, cmap='RdBu_r', vmin=-1, vmax=1, mask=mask)
_ = cortex.quickflat.make_figure(V)

# Model interpretation
Finally, we can examine the weights (for the categories or features) that summarize across the stimulus words / images. Further analysis will be required to determine if differences between weights for categories or features are reliable; this is just a qualitative analysis for now.

In [None]:
sub, xfm = 'MNI', 'atlas336'
for b in B:
    V = cortex.Volume(b, sub, xfm, cmap='RdBu_r', vmin=-3.0, vmax=3.0, mask=mask)
    cortex.quickflat.make_figure(V)
    # Add a title to each condition!
    plt.suptitle('')

(Do these plots maybe need a little fixing?)