# Overview 

Today's class will have two parts: 

First, we will review the homework, and describe ways to limit the amount of memory used in loading large data sets. 

Second, we will describe the structure of the experiment that produced the data we have been analyzing, and we will compute averages of activity around the time of specific experimental events. 


# Goals
* Understand ways to reduce the amount of memory used when loading data
* Understand *masking* data with logical indices
* Estimate the average response to an experimental event

# Updating resources in your server home directory
(Run the cells in this section once, then restart your kernel, reload the web page, and skip this section the next time through!)

In [None]:
if False: # You should not need to run this again if you ran it in class; if you were not in class, set this to True
    # Updating functions
    import neurods
    # Update neurods package
    neurods.io.update_neurods()

In [None]:
#cp ../Lecture04_Normalization_Masking_pycortex/figures ./ # uncomment this to run this cell

**NOTE! Added in breakout notebook:** Run this to get the other image into the notebook!

In [None]:
import os
import neurods
if not os.path.exists('figures/CategoryLocalizerDesign.001.png'):
    url = 'https://www.dropbox.com/s/sk9rbqdgyu6wf33/CategoryLocalizerDesign.001.png'
    fpath = 'figures/'
    neurods.io.download_file(url, 'CategoryLocalizerDesign.001.png', root_destination=fpath)

# Memory management and masking
A big difficulty in last week's homework - and in data science in general - is how to deal with large data sets. 

In [None]:
# Load some necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import nibabel
import neurods
import cortex
import os

In [None]:
# Set plotting defaults
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# Set matplotlib defaults!
plt.rcParams['image.cmap'] = 'viridis'
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.origin'] = 'lower'
plt.rcParams['image.aspect'] = 'equal'

### Python digression: Floating point vs integer numbers

`numpy` stores numbers in several different formats: numbers can be stored as boolean values (True or False); as integers (0, 1, 2...) or as floating-point numbers (2.3256..., 3.63212..., etc). This is a common aspect of all programming languages that deal with images or numbers. Different formats for numbers use different amounts of memory. For data types that allow decimals (e.g. numpy's float32 and float64), the more decimal places that are stored for each number in an array, the more memory the array takes up. 

Thus, converting to a less-precise format (np.float32) can save memory, if precision is not critically important.

In [None]:
print(np.float64(np.pi))
print(np.float32(np.pi))

In [None]:
r64 = np.random.rand(30,100,100)
r32 = r64.astype(np.float32)
print('data type of `r64` is: ', r64.dtype)
print('data type of `r32` is: ', r32.dtype)

In [None]:
whos

### HW Recap

In [None]:
from scipy.stats import zscore

# An OK implementation of load_data:
from scipy.stats import zscore
def load_data_ok(files, do_zscore=False):
    """Load fMRI data from files and optionally z-normalize data"""
    # Create a list to store data
    data = None
    for f in files:
        nii = nibabel.load(f)
        if data is None:
            data = nii.get_data().T
            if do_zscore:
                data = zscore(data, axis=0)
        else:
            tmp = nii.get_data().T
            if do_zscore:
                tmp = zscore(tmp, axis=0)            
            data = np.vstack([data, tmp])
    return data

# A better implementation
def load_data_better(files, do_zscore=False):
    """Load fMRI data from files and optionally z-normalize data
    
    Parameters
    ----------
    files : list 
        List of file names (absolute paths)
    do_zscore : bool
        Flag that determines whether to zscore data in time or not
    
    Returns
    -------
    data : array
        fMRI data array, in (time, z, y, x) format
    """
    # Create a list to store data
    data = []
    # Loop over files in list
    for f in files:
        nii = nibabel.load(f)
        tmp = nii.get_data().T
        # Optionally zscore each run independently
        if do_zscore:
            data = zscore(data, axis=0)
        data.append(tmp)
    # Concatenate full data
    data = np.vstack(data)
    return data

# The implementation we will use for this notebook
def load_data(*files, do_zscore=False, mask=None, dtype=np.float32):
    """Load fMRI data from files and optionally z-normalize data
    
    Parameters
    ----------
    files : strings 
        Absolute path names for files to be loaded
    do_zscore : bool
        Flag that determines whether to zscore data in time or not
    mask : boolean array
        Selection mask that specifies which voxels to extract from 3D brain
    dtype : numpy data type
        Data type to which to convert the loaded data

    Returns
    -------
    data : array
        fMRI data array, in (time, z, y, x) format (if not masked) or in
        (time, voxels) format (if masked)
    """
    # Create a list to store data
    data = []
    # Loop over files in list
    for f in files:
        print("Loading {}...".format(f))
        nii = nibabel.load(f)
        tmp = nii.get_data().T.astype(dtype)
        # Optionally mask data
        if mask is not None:
            tmp = tmp[:, mask]
        # Optionally zscore each run independently
        if do_zscore:
            tmp = zscore(tmp, axis=0)
        data.append(tmp)
        del tmp
    # Concatenate full data
    data = np.vstack(data)
    return data

# The extra lazy way to load data (here as an example, not used below)
def load_data_lazy(*runs, exp='categories', **kwargs):
    """Efficient wrapper for load_data
    
    Loads data for a given experiment, after specifying only run number
    
    Parameters
    ----------
    runs : integers {1,2,3}
        Run number to load for a given experiment
    exp : string
        Experiment name
    kwargs : keyword arguments
        (passed to load_data)
    
    Returns
    -------
    data : array
        fMRI data array, in (time, z, y, x) format (if not masked) or in
        (time, voxels) format (if masked)
    
    """
    if exp=='categories':
        files = [os.path.join(neurods.io.data_list['fmri'], exp, 's01_categories_%02d.nii.gz'%r) for r in runs]
    elif exp=='motor':
        files = [os.path.join(neurods.io.data_list['fmri'], exp, 's01_motorloc.nii.gz')]
    return load_data(*files, **kwargs)


In [None]:
# Demonstration that the load function works well
# Set this to True to run this cell. We skip it here, because it will use up 
# a lot of memory, and thus possibly cause errors in subsequent cells. 
if False:
    # Load one to three files
    for n in range(1, 4):
        data = load_data(*files[:n], do_zscore=True)
        print("Loaded {} files, shape is:".format(n), data.shape)
        print("max={:0.3f}, min={:0.3f}".format(np.nanmax(data), np.nanmin(data)))
        print("")
        del data

## Masking

As we have discussed, not all of the data in our 4D array is equally interesting to us. We are interested in the fMRI data collected IN the brain (vs outside it), and more specifically in the data collected from the cerebral cortex (the outermost layer of the brain). 

Here, we will show you how to extract (a) the data in the brain, and (b) the data in the cerebral cortex from the whole array. 

Remember our histogram of values for data, which show a ton of voxels with zero values (from outside the brain):

In [None]:
# Specify files
files = ['s01_categories_{:02d}.nii.gz'.format(r) for r in [1, 2, 3]]
files = [os.path.join(neurods.io.data_list['fmri'], 'categories', f) for f in files]

In [None]:
# Load data for only one file
data = load_data(files[0], do_zscore=False)

In [None]:
bins = np.linspace(0,2000,31)
_ = plt.hist(data.flatten(), bins)
plt.xlabel('Raw fMRI Activity')
plt.ylabel('TRs (count)')

So: how can we extract the data that is only from the region of the scan that contains the brain? We could try to write down an index for each data point in the data that contains a brain voxel (e.g. [25, 33, 33], [25, 33, 34]), but you can see how such a list would get quite long (tens of thousands) and would be difficult to construct. 

One simple way to find data that is in or near the brain is to threshold the data to find only the voxels where the signal is greater than zero. 

In [None]:
# Here, consider only the first volume
brain_voxels = data[0] > 250
#print(brain_voxels) # Just displays a bunch of Trues and Falses in a big array

In [None]:
# What is this thing we have just created?
print('dtype of `brain_voxels`: ', brain_voxels.dtype) # Data type
print('Sum of of `brain_voxels`: ', brain_voxels.sum()) # Number of voxels selected
print('Mean of of `brain_voxels`: ', brain_voxels.mean()) # Proportion of voxels selected
print('Shape of `brain_voxels`: ', brain_voxels.shape) # Shape of array 

### Breakout session
1. Discuss what each of the values above indicate about the `brain_voxels` array.
2. What happens if you change the cell above to be brain_voxels = data[0] > X, where X is greater than zero? (What should the threshold [X] for selecting brain voxels be?)
3. While playing with the threshold value, display the `brain_voxels` variable in some sensible way. What does the array LOOK like for different thresholds (values of X)?

**NOTE** Setting the threshold higher excludes more and more low-signal voxels outside the brain (see next plot). Every voxel that is YELLOW in the following images is a True value, i.e. a voxel that will be selected by the mask that has been computed for a given threshold value.

In [None]:
### STUDENT ANSWER
fig, ax = plt.subplots(1,4, figsize=(8, 2))
for ax, threshold in zip(ax, [0, 10, 50, 250]):
    # Create selection mask (all voxels with a signal greater than `threshold`)
    brain_voxels = data[0] > threshold 
    # Choose a transverse slice of the brain to show
    brain_slice = brain_voxels[15]
    # Show the image!
    im = ax.imshow(brain_slice)
    ax.set_title('Threshold={}'.format(threshold))

You can see the whole mask using neurods.viz.slice_3d_array:

In [None]:
fig = plt.figure(figsize=(6,5))
_ = neurods.viz.slice_3d_array(brain_voxels, axis=0, fig=fig)

Now we have an array of True/False values (a boolean array). This array can be directly used to INDEX our data! 

In [None]:
# Logical indices are fun!
a = np.arange(10)
idx = np.array([True, False, True, False, True, False, True, False, True, False])
a[idx]

In [None]:
# This works in multiple dimensions, too!
a = np.arange(20).reshape(2,10)
print(a)

In [None]:
print(a[:,idx])

In [None]:
# or even for brain data!
brain_data = data[:, brain_voxels]
print(brain_data.shape)

This mask selects 64,789 voxels in the brain, and excludes the voxels outside the brain! The following plots show the array that we have now created:

### BREAKOUT SESSION
Make a histogram of `brain_data`. Z-score it, and plot it as an image.

In [None]:
### Student answer
plt.hist(brain_data.flatten(), bins)
plt.xlabel('Raw BOLD response')
plt.ylabel('TRs (count)')
plt.figure()
plt.imshow(zscore(brain_data, axis=0), aspect='auto')
plt.xlabel("Voxels")
plt.ylabel("Time (TRs)")

This lower plot shows all the data that we are interested in. 

A question was asked in class about whether we have now lost all information about where, for example, voxel # 23456 (indexed across the x axis of the plot) occurred in the brain. We have not lost that information, because we still have the mask! We can re-create a 3D array at any time to put our data back into a 3-D (z, y, x) or 4-D (time, z, y, x) array, like this:

In [None]:
new_brain = np.zeros(brain_voxels.shape)
# Put the first TR worth of data back into the original brain volume shape:
new_brain[brain_voxels] = brain_data[0]

# Show what we have via slice_3d_array:
_ = neurods.viz.slice_3d_array(new_brain, axis=0)
# Compare this to the original first volume of the data:
_ = neurods.viz.slice_3d_array(data[0], axis=0)

Very similar! But if you look closely at the lower plot, there are some low values for some voxels outside the brain. In the upper plot, everything outside the brain is exactly zero. Pycortex can re-constitute masked data in the same way when you create `cortex.Volume` objects (See below)

In [None]:
# If you provide a mask to cortex.Volume, it will re-constitue a 3D array from a 2D array
v_masked = cortex.Volume(brain_data[0], 's01', 'catloc', mask=brain_voxels, vmin=0, vmax=2000)
v_orig = cortex.Volume(data[0], 's01', 'catloc', vmin=0, vmax=2000)
# Here's another fancy trick pycortex can do: You can plot two different data sets if you pass webgl.show()
# a dict instead of a single pycortex Volume object:
to_show = {'Masked & reconstituted data' : v_masked,
           'Original data' : v_orig}
cortex.webgl.show(to_show)

You can flip between data sets in pycortex's web view by pressing `+` and `-`, or by using the drop-down menu at the top of the screen. Note that the two data sets look nearly identical (except for a few voxels that were cut out of the brain by the mask, near the occipital pole and in the temporal lobe).

# Masking with pycortex
pycortex is the software that we use to map our 3-D or 4-D data onto the cortical surface. Pycortex can also be used to select voxels that are specifically located within a small distance from the cortical surface. This obviates the need for specifying an arbitrary threshold (above, we specified 250 - but why not 251? Why not 249 or 200?) - it provides a principled way to say which data you are interested in (data for voxels that fall within the cerebral cortex). 

To select voxels, we use the `cortex.db.get_mask` function. Just like cortex.Volume, this function requires two pieces of information. The first piece of information is the specific subject for whom we want to select the cortical surface (different subjects' brains are different!) - here, we specify `'s01'` (subject 1). The second piece of information is the transform (all the rotations, stretches, scaling, and left/right/up/down movements necessary to align the functional data to the anatomical data and thus to the cortical surface). Note that the subject's head may be in slightly different positions within the scanner for each different experiment - this is why we need to specify a transform, to say specifically WHERE the data was collected relative to the anatomical scan for a given subject for a given experiment. Here, `'catloc'` refers to the fact that this specific experiment is a *category localizer* experiment (see below for what that means!).

In [None]:
# Fancy syntax for setting two variables:
sub, xfm = 's01', 'catloc'
# `sub` specifies a subject and `xfm` specifies a stored transform
cortical_voxels = cortex.db.get_mask(sub, xfm, type='cortical')

In [None]:
# Display the same information for this mask as we did for the brain mask above
print("Mask data type:", cortical_voxels.dtype) 
print("Mask shape:", cortical_voxels.shape)
print("Number of voxels in mask:", cortical_voxels.sum())
print("Proportion of voxels in mask: {:0.2f}".format(cortical_voxels.mean()))

In [None]:
# Plot horizontal slices of `cortical_voxels` mask
fig1 = plt.figure(figsize=(6,5))
_ = neurods.viz.slice_3d_array(cortical_voxels, axis=0, fig=fig1)

In [None]:
# Alternative plot of mask (sagittal slices)
fig2 = plt.figure(figsize=(10,3))
_ = neurods.viz.slice_3d_array(cortical_voxels, axis=1, fig=fig2)

In [None]:
cortical_data = data[:, cortical_voxels]
print(cortical_data.shape)

Note how much smaller (in MB) `cortical_data` is compared to `data`!

In [None]:
whos

Note that this mask reduces the data size even more - down to ~12% of its original size! This will allow us to load more data and thus (eventually) to do more robust analyses. 

## Load description of experiment 
The experiment we have been working with is a *localizer* experiment. It is designed to find areas of the brain that respond to particular visual categories of objects: faces, bodies, and places. It also reveals areas that respond more to objects than to scrambled versions of the same objects. This experiment is a simple replication of past work, and is commonly done as a first step to locate (or localize) a region of interest for further analysis in a subsequent experiment.

For the localizer experiment, images from each category were presented in a block design. This means that images from the one category were shown one after another for a "block" of 20 seconds (10 TRs), followed by images from another category for a block of 20 seconds, and so on.

<img src="figures/CategoryLocalizerDesign.001.png" style="height: 400px;">

To analyze the data from this experiment at all, we need to know when the blocks for each category (faces, bodies, places, objects, and scrambled objects) began and ended. This information is stored in a *design matrix*, which we load below.

In [None]:
basedir = os.path.join(neurods.io.data_list['fmri'], 'categories')
design = np.load(os.path.join(basedir, 'experiment_design.npz'))
print('Experiment design variables: ', sorted(design.keys()))

In [None]:
conditions = design['conditions'].tolist()
print('Conditions: ', conditions)
design_run1 = design['run1']
print('Design shape: ', design_run1.shape)

It's often useful to show a design matrix as an image. In the image below, the yellow values indicate which time indices contained each condition!

In [None]:
_ = plt.imshow(design_run1.T, aspect='auto')

## Breakout session
> What are the dimensions here? Label the axes on the figure above!

In [None]:
# Essentials:
_ = plt.imshow(design_run1.T, aspect='auto')
_ = plt.xlabel('Time (TRs)')
_ = plt.ylabel('Condition')
# Tick labels (useful!):
_ = plt.xticks(range(0, 120, 10))
_ = plt.yticks(range(5), conditions)
# Some fanciness for an extra pretty plot:
_ = plt.grid(axis='x', color='white')
_ = plt.hlines(np.arange(0.5, 4.5), 0, 120, colors='w', alpha=0.5, linestyles=':')

# Event-related averages
In an experiment, we are always interested in the relationship between the stimulus and brain responses. The simplest way to visualize the relationship is to examine what happened to brain responses every time that a stimulus came on. Thus, we will now create *averages* of responses after a particular type of stimulus came on.

In [None]:
# For the following analyses, we will average z-scored data
dataz = load_data(*files[:1], do_zscore=True, dtype=np.float32)
cortical_dataz = dataz[:, cortical_voxels]

In [None]:
# First, use np.nonzero to find condition onsets for the first condition
on_times = np.nonzero(design_run1[:,0])
print(on_times)

**Python note**: The parentheses around the output here indicate that `on_times` is a tuple; we don't want the array to be inside a tuple, we just want it to be an array. Here are a few ways to make sure `on_times` is an array:

In [None]:
# Easiest: Explicitly select the first element of the tuple
on_times = np.nonzero(design_run1[:,0])
on_times = on_times[0]
print(on_times) # (note that all these print commands will give you the same array)

# Do the same thing in one line:
on_times = np.nonzero(design_run1[:,0])[0]
print(on_times)

# Fancy syntax:
on_times, = np.nonzero(design_run1[:,0])
print(on_times)

# (this fancy syntax is the same as setting multiple variables from a tuple, like this:)
a, b = (1, 2)
print("Two-tuple (a,b) values are equal to:")
print(a)
print(b)
# ... but more like this:
a, = (3, )
print("one-long tuple, first value is:", a)

## Breakout session
> Convert `on_times` to onsets! (i.e., the specific time that the stimulus came on)

> Select 10 time points after each time the stimulus came on, and save them in a list

> Average each set of points together! 

In [None]:
### STUDENT ANSWER

## Logic:
# Find all the indices for which the condition was on/present
on_times, = np.nonzero(design_run1[:,0])
print('Stimulus was on at indices:')
print(on_times)
# Add a null value at the beginning of the array to make sure we capture the first onset of the condition
on_times_add = np.hstack([-1, on_times])
# Find indices that are at the START of blocks - where the next index is more than 1 TR away
diff_indices = np.diff(on_times_add) > 1
print('These indices (among `on_times`) are the onsets:')
print(diff_indices)
# Select those times!
print("These are the condition onsets:")
print(on_times[diff_indices])

# Formalize this all in a function:
def get_onsets(cond):
    """Convert a set of indicators for when a condition is on to onset indices for that condition
    
    Parameters
    ----------
    cond : array
        An array of 1s and 0s (or a boolean array of Trues and Falses), indicating which time indices of
        an experimental timecourse were part of a single given condition
    
    Returns
    -------
    onset_times : array
        onset time indices for `cond`
    """
    # (Note fancy syntax from above to pull out first element of a tuple)
    on_times, = np.nonzero(cond)
    # Choose 
    keepers = np.diff(np.hstack([-1, on_times]))>1
    onset_times = on_times[keepers]
    return onset_times

In [None]:
# Get onset times for condition 1
onset_times = get_onsets(design_run1[:, 0])
cond_data = []
for ot in onset_times:
    # Select 10 time points starting from each condition onset (onset to onset + 10)
    cond_data.append(cortical_dataz[ot:ot+10])
# Compute the mean of all the repeats of condition 1
all_cond1 = np.array(cond_data)
print(all_cond1.shape)
data_avg_cond1 = all_cond1.mean(0)
print(data_avg_cond1.shape)

This leaves us with 10 time points for each voxel. This is our event-related average! We will plot this in pycortex.

In [None]:
# Make a movie of the temporal average in pycortex!
sub, xfm = 's01', 'catloc'
cond1_vol = cortex.Volume(data_avg_cond1, sub, xfm, vmin=-3, vmax=3, cmap='RdBu_r')
cortex.webgl.show(cond1_vol)

Note that this is a movie - you can scroll through time with the pop-up menu at the bottom of the screen to see how the response to this condition evolves over time.