# Final Exam

This final is open notes/books/internet and you will have until 10pm to finish it. Please do not turn it in late, you will not receive credit if you do.

In this exam you will prepare and run a full fMRI localizer analysis. Each exercise will walk you one step closer to the final result, a map of cortical voxels whose activity passes a hypothesis test.

To begin, run the cells below containing the modules and helper functions you'll need throughout the exam. After the imports and helper functions, a cell will use the helpers to load all the data, z-scoring it and masking it with a cortical mask so you get a 2D array of shape (time, # of voxels). This is the same category localizer data you've used often throughout the course.

**NOTE: MAKE SURE TO READ ALL INSTRUCTIONS, AND ANSWER ALL PARTS OF EACH QUESTION FOR FULL CREDIT!!**

In [None]:
# Don't change this cell; just run it. 
# The result will give you directions about how to log in to the submission system, called OK.
# Once you're logged in, you can run this cell again, but it won't ask you who you are because
# it remembers you. However, you will need to log in once per assignment.
from client.api.notebook import Notebook
ok = Notebook('final.ok')
_ = ok.auth(inline=True)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.stats
import cortex
import os
import nibabel
import urllib
import tempfile
import h5py
from scipy.stats import zscore
from scipy.stats import norm
from scipy.ndimage import gaussian_filter
from sklearn.linear_model import LinearRegression as SkLinearRegression
from sklearn.base import BaseEstimator
from nistats.hemodynamic_models import glover_hrf as create_hrf

np.random.seed(42)

# Set plotting defaults
%matplotlib inline

In [None]:
class LinearRegression(BaseEstimator):
    
    def __init__(self, block_size=5000, use_pinv=True, lr_args=None, fit_intercept=True, 
                 memmap_coefs=True, verbose=0):
        self.block_size = block_size
        self.use_pinv = use_pinv
        self.lr_args = lr_args
        self.fit_intercept = fit_intercept
        self.verbose = verbose
        self.memmap_coefs = memmap_coefs
    
    def zeros(self, shape, dtype='float32'):
        if isinstance(shape, int):
            shape = shape,
        if self.memmap_coefs is False:
            return np.zeros(shape, dtype=dtype)
        else:
            if self.memmap_coefs is True:
                file_id, memmap_file = tempfile.mkstemp()
                memmap_file = os.fdopen(file_id, 'rb+')
            else:
                # if filename specified
                memmap_file = open(self.memmap_coefs, 'rb+')
            memmap = np.memmap(memmap_file, dtype=dtype, shape=tuple(shape))
            memmap[:] = 0.
            return memmap
    
    def fit(self, X, Y):
        self.y_ndim = Y.ndim

        if Y.ndim == 1:
            Y = Y.reshape(-1, 1)

        self.coef_ = self.zeros([Y.shape[1], X.shape[1]])
        if self.fit_intercept:
            self.intercept_ = self.zeros(Y.shape[1])

        if self.use_pinv:
            if self.fit_intercept:
                X_pinv = np.linalg.pinv(np.hstack([X, np.ones_like(X[:, 0:1])]))
            else:
                X_pinv = np.linalg.pinv(X)

            for i in range(0, Y.shape[1], self.block_size):
                weights = X_pinv.dot(Y[:, i:i + self.block_size])
                self.coef_[i:i + self.block_size] = weights[:X.shape[1]].T
                if self.fit_intercept:
                    self.intercept_[i:i + self.block_size] = weights[-1]
                if self.verbose > 0:
                    print(".", end="")
        else:
            lr_args = {} if self.lr_args is None else lr_args
            lr = SkLinearRegression(fit_intercept=self.fit_intercept, **lr_args)
            for i in range(0, Y.shape[1], self.block_size):
                lr.fit(X, Y[:, i:i + self.block_size])
                self.coef_[i:i + self.block_size] = lr.coef_
                if self.fit_intercept:
                    self.intercept_[i:i + self.block_size] = lr.intercept_
                if self.verbose > 0:
                    print(".", end="")
        return self
    
    def predict(self, X):
        p = self.zeros((X.shape[0], self.coef_.shape[0]))
        for i in range(0, self.coef_.shape[0], self.block_size):
            p[:, i:i + self.block_size] = X.dot(self.coef_.T[:, i:i + self.block_size])
            if hasattr(self, 'intercept_') and self.fit_intercept:
                p[:, i:i + self.block_size] = p[:, i:i + self.block_size] + self.intercept_[i:i + self.block_size]
        if self.y_ndim == 1:
            p = p.ravel()
        return p

In [None]:
def z_score(x):
    x = x - x.mean(axis=0)
    x = x / (x.std(axis=0) + 1e-18)
    return x

In [None]:
def correlate(x, y, block_size=5000):
    if x.ndim == 1:
        x = x.reshape(-1, 1)
    if y.ndim == 1:
        y = y.reshape(-1, 1)
    
    output = np.zeros(x.shape[1])
    for i in range(0, x.shape[1], block_size):
        x_z = z_score(x[:, i:i + block_size])
        y_z = z_score(y[:, i:i + block_size])
        output[i:i + block_size] = np.mean(x_z*y_z, axis=0)
    return output

In [None]:
def load_file(filename, zscore=True, mask=None):
    
    img = nibabel.load(filename)
    data = img.get_data().T
    if mask is not None:
        data = data[:, mask]
    if zscore:
        data = data - data.mean(0)
        data /= data.std(0) + 1e-8
    return data

def load_files(filenames, zscore=True, mask=None):
    all_data = []
    for filename in filenames:
        all_data.append(load_file(filename, zscore, mask))
    return np.concatenate(all_data, axis=0)

In [None]:
def shuffle_blocks(array, block_size=10):
    n_samples = array.shape[0]
    n_blocks = n_samples // block_size
    
    indices = np.arange(n_samples)
    blocked_indices = indices[:n_blocks * block_size].reshape(-1, block_size)
    np.random.shuffle(blocked_indices)
    
    return array[indices]

## Load the Data

**Note:** Take a good look at this cell: It loads the BOLD data into `data`, loads the cortical mask into `mask` and loads the localizer category labels into `category_labels`.

In [None]:
# Create the filenames for the data
filenames = []

for i in (1, 2, 3, 4, 5, 6):
    filenames.append("/data/cogneuro/fMRI/categories/s03_catloc_run{:02d}.nii.gz".format(i))
             

# get the cortical mask 
mask = cortex.db.get_mask('s03', 'category_localizer', 'cortical')

# load the BOLD data
data = load_files(filenames, mask=mask)

# load the category labels for the experiment
f = h5py.File("/data/cogneuro/fMRI/categories/s03_catloc_design.hdf")
ulab = np.concatenate([['Nothing'], list(map(str.capitalize, map(bytes.decode, f['xnames'][:])))])

# These are the category labels the way you know them.
category_labels = ulab[f['events'][:]]

# Exercise 1: Preparing the design matrix

In this exercise you will create a design matrix out of response vectors. This will serve as the basis for the regressions in later exercises. You will be asked to create some plots with detailed instructions. Make sure to incorporate all of the instructions into your plot.

**a)** [2.5pts] Create a 1D array containing an HRF using the `create_hrf()` function. Specify `tr=2`,   `oversampling=1` and `time_length=32` as the arguments to that function, and store the result in the name `hrf`. Plot this HRF in a wide figure by creating a `plt.figure` called `fig_hrf` with the argument `figsize=(20, 4)`. Label the x-axis as `time / seconds` and make sure the x-ticks reflect the fact that every point of the HRF is 2 seconds apart. Add the title "Hemodynamic Response Function" and label the y-axis as `Intensity`.

In [None]:
# a)


**b)** [2.5pts] Find out which labels are in the array `category_labels`. Store the unique labels in an array named `unique_labels`. Print the unique labels (look closely: the words are capitalized). Observe that the word `Nothing` is among `unique_labels`. We do not want to make a stimulus or response vector for this condition. Create an array named `categories` which contains all the unique labels from `category_labels` except for `Nothing`. This can be done by array masking. 

In [None]:
# b)



**c)** [2.5pts] Here you'll ake a look at one of the stimulus vectors. Create the stimulus vector for `Places` and call it `stim_places`. Plot it into a figure of size `(20, 4)`. Label the x-axis as time in seconds and make sure the x-ticks represent this choice of unit (remember, each TR=2s). Give the plot the title "stimulus vector". Label the y-axis with "Stimulus Presence".

In [None]:
# c)


**d)** [2.5pts] You'll now create the response vector for the `Places` stimulus vector, by convolving the stimulus vector with the HRF. Make sure to shorten the result to the original length of the stimulus vector after the convolution operation. Call the response vector `resp_places`. Make a plot in the same way as you did for part **c)** and include both stimulus and response vectors in it. Label the x-axis as time / seconds and give it the title "Stimulus and Response Vectors"

In [None]:
# d)


**e)** [2.5pts] Now make the response vectors for all the remaining categories in the array `categories`, just as you did in part **d)**. Name each one of them according to the same scheme as in **c)** and **d)** (e.g. `stim_body` and `resp_body`). 

In [None]:
# e)


**f)** [2.5pts] Gather all response vectors into a 2-D such that each response vector becomes one of the columns of this array. **Make sure you arrange the categories in alphabetic order in order to be able to identify them later**. This array should be of shape `(number of time points (TRs), number of categories)`. Call it `response_design`. Plot this design matrix using `plt.imshow` with the argument `aspect='auto'` into a figure of size  `(4, 10)`. Place `xticks` corresponding to the catogory labels.

Do the same for the stimulus vectors, creating `stimulus_design`. Display `stimulus_design` in the same way you displayed `response_design`.

In [None]:
# f)


#### Note: The design matrix is crucial to further analysis. If you were unable to create it, then load a version of it by executing the next cell. It is not equal to the design matrix you constructed during the first exercise.

# Exercise 2: Simple Regression on a single voxel

Now you're going to test your familiarity with doing simple linear regression on a single voxel.

**a)** [2.5pts] Store the BOLD data time series of the voxel with index `27334` in the name `test_voxel1`.

In [None]:
# a)



**b)** [2.5pts] Using a `LinearRegression` object to fit a linear regression model of `resp_places` onto `test_voxel1`. Remember that you need to turn the 1D-array `resp_places` into a 2D column vector by using the `np.reshape` function. If you get an error message because of this, follow the instructions at the bottom of the error message.

In [None]:
# b)


**c)** [2.5pts] Get the slope and the intercept out of the `LinearRegression` object from the attributes `coef_` and `intercept_`. Through some combination of multiplication and addition between the slope, intercept and `resp_places` you should calculate the predictions of the voxel activity for all  `test_voxel1`. Remember what the formula for a linear model is if you need a hint. Call this `predictions_a`. Then use the `LinearRegression` object to `predict` the activity for `resp_places` (remember to reshape as you did for the fitting in **b)**). Call this `predictions_b`. Verify that these predictions are equal in all entries using the `np.allclose()` function.

In [None]:
# c)


**d)** [2.5pts] Plot the BOLD data time series of `test_voxel1` and the prediction time series `predictions_a` into a figure of size `(20, 4)`. Give the plot a title, a legend, label the x- and y-axes, and make sure the x-ticks represent the correct timing.

In [None]:
# d)



**e)** [2.5pts] The array `test_voxel1` is already z-scored because we did that when we loaded the data. Z-score the array `resp_places` and call the result `z_resp_places`. Re-run the linear regression with the z-scored arrays. Print out the slope and the intercept. Then answer the question
   * What is another name for the slope of this regression? 


In [None]:
# e) Code


# Exercise 3: Multiple Regression on all cortical voxels

In this exercise you will perform linear regression on all cortical voxels in order to understand which part of the brain responds to which stimulus types. You will examine the weights of one specific voxel and determine which stimulus type changes its activity most. After this, you will compute a contrast map and display it.

**a)** [2.5pts] Create a `LinearRegression` object and fit a linear regression model to the response design matrix and all the BOLD data for all cortical voxels. Store its attribute `coef_` in the name `regression_weights`.

In [None]:
# a)


**b)** [2.5pts] Store the weights of the voxel with index `22644` in `weight_voxel2`. This weight vector should have as many entries as the array `categories` is long. Each entry corresponds to this voxel's regression weight for each of the stimulus types. Make a bar plot showing each of these values. Label each bar with the stimulus type it corresponds to. Answer the following question:
* Which stimulus type does this voxel respond strongest to?

In [None]:
# b)


In [None]:
# b) text answer


**c)** [2.5pts] Identify which indices correspond to `Places` and `Object` in the 1D array `categories`. Use these indices to extract the corresponding weights for all cortical voxels from `regression_weights` and compute the `Places - Object` contrast and store that in the name `contrast_places_object`.

In [None]:
# c)


**d)** [2.5pts] Plot this contrast on a flatmap. (The subject name is `s03` and the transform name is `category_localizer`).

Note: You can use `with_rois=True` when calling the display function to see the ROI borders.

# Exercise 4: Permutation tests and p-values.
In this exercise you will perform a permutation test and compute p-values for a contrast value across all cortical voxels.

**a)** [2.5pts] Set the name `n_permutations` to 100. Create an array called `all_permuted_contrasts` filled with zeros, of the shape `(n_permutations, n_voxels)`. You can obtain `n_voxels` by looking at the shape of `data`.

In [None]:
# a)


**b)** [2.5pts] 

Use the helper function `shuffle_blocks` (located at top of notebook) to write a function `def get_permuted_response_design(stim_design, block_size=10)`, which takes the stimulus design matrix as input, permutes it with `shuffle_blocks`, then performs the convolution of the coloumns with the HRF and returns a permuted response design matrix.

In [None]:
# b)


**c)** [2.5pts]  In a for loop that iterates `n_permutations` times, count the iterations with the name `i`, and do the following once per iteration: 
1. Use `permuted_response_design` to obtain a permuted response design.
2. Then perform a linear regression on the permuted design matrix and `data`. 
3. Compute the weight contrast **Faces - Places** for all cortical voxels, and store it in the `i-th` row of `all_permuted_contrasts`.

Note: This should take around 20s

In [None]:
# c)


**d)** [2.5pts] Compute an unpermuted **Faces - Places** contrast using the linear regression you fit above.

**Note:** If in the above exercise you *overwrote* the name of your linear regression object or the weights matrix from before, make sure you recompute it here, or use different names in (c).

In [None]:
# d)



**e)** [2.5pts] For each voxel, count how many permuted contrast values are above the unpermuted contrast value (this is done using an array `>` operation and a summation along the right axis). Use this number and the total amount of permutations to compute the p-value. Make sure that you adjust it such that the p-value for a contrast value that is above all the permutations is not marked as being equal to 0.


In [None]:
# e)


**f)** [2.5pts] Display the negative log of the p-values on a flatmap. (Remember: `s03` and `category_localizer`)

In [None]:
# f)


**g)** [2.5pts] Extract the permutation distribution for voxel `1983` from `all_permuted_contrasts`. Plot a histogram of it with 20 bins and insert a vertical line using `plt.vlines` for the true value of the same voxel in the faces/places contrast (see HW10 for a refresher on `vlines`).

In [None]:
# g)



# Exercise 5: Encoding Model Analysis

In this exercise you will analyze the response amplitudes from the event-related design experiment we used in the last lecture of the semester. You will use an encoding model to do the analysis whose features come from **Gabor  filters**. Gabor filters are Gaussians modulated by a sine wave, and are defined by their spatial frequency, orientation and location (among other things). They are used to find contrast differences of varying frequencies in images (akin to finding edges in the image). Below is an image of what a collection of Gabor filter looks like on the left, and filters applied to an image of a face on the right:
<div>
<img src='figures/GaborFilters.jpg' width=300px align='left'>
<img src='figures/GaborFace.jpg' width=300px align='right'>
</div>

Many visual neuroscientists believe that many of the neurons in early area visual area V1 do information processing similar to what a Gabor filter does. For this reason, encoding models that consist of the output of many Gabor filters are used to model early visual regions.

You will now fit an encoding model whose features were created from a bank of 436 Gabor filters. 

Before starting you will need to download 3 files for use in this exercise. Run the cell below to start the download.

In [None]:
# download the files needed for this section.
data_filename_ER = '/home/jovyan/s04_color_natimes_data.npy'
gabor_designmat_filename = '/home/jovyan/gabor_wavelet_features.npy'
pred_acc_19_filename = '/home/jovyan/pred_acc_sem19_test.npy'

# download the data we'll be using today
_ = urllib.request.urlretrieve('https://berkeley.box.com/shared/static/z5ifpxggy2b6a34v3pgxmp7ylm1qa94s.npy', data_filename_ER)
_ = urllib.request.urlretrieve('https://berkeley.box.com/shared/static/5esxf65woj640ke56kxlbua4moltvdbq.npy', gabor_designmat_filename)
_ = urllib.request.urlretrieve('https://berkeley.box.com/shared/static/s2cjg21623q2ln5ae1f7szebim20s6gx.npy', pred_acc_19_filename)

**a) Load the Data**

You've just downloaded 3 numpy files (`.npy` file extension). Load two of them, containing the response amplitude data (from which we have removed `nan` values for you and performed z-scoring) and the Gabor filter design matrix for the event-related design experiment. The filenames for each are stored in the names `data_filename_ER` and `gabor_designmat_filename` used in the above cell to download them. Store the data into `data_ER` and the design matrix into `designmat_gabor`. Look at the shape of both. Both should be 2D matrices. The first 1260 stimuli are for training, and the last 126 are for testing (1386 total). There are response amplitudes for all cortical voxels. The design matrix has 436 features for each of the 1386 stimuli presented in the experiment.

Then split both the data (response amplitudes) and design matrix into a training and test set. Use the first 1260 stimuli for training, and the rest for testing. Call them `data_ER_train`, `data_ER_test` and `designmat_gabor_train`, `designmat_gabor_test`.

**b) Fit the Encoding Model**

Now use a `LinearRegression` model to fit the training design matrix to the training response amplitudes. Call the model you create `lr_gabor`.

**c) Calculate Prediction Accuracy**

Now we'll quantify how well the Gabor filter model fits all of the cortical voxels. 

To do so, use the model you just fit to predict response amplitude values of the test set (remember you need to use a design matrix to predict, not the data itself). 

Then calculate the prediction accuracy between the predictions of the test set and actual response amplitudes of the test set. Use the `correlate` function we provided at the top of this notebook to do this - it is memory-efficient. Store the prediction accuracy values in a name called `pred_acc_gabor_test`.

Finally, we want to set the voxels where there was movement to `nan`, so they won't be displayed in the flatmap you'll make in the next step. To do this, set all of the values in the prediction accuracy vector that are above `.99` to `np.nan` using masking. This works because we happen to know that we won't get a prediction accuracy as high as `.99` in any of the real data, only the voxels where there actually was no data.

**d) Compare Model Performance**

We learned that encoding models lend themselves very well to model comparisons. Now you'll compare the performance of the 19 semantic category model that we used in the last lecture to the performance of the gabor model you just fit. The third file you downloaded at the beginning of this exercise contains the prediction accuracy vector for the 19 semantic category model. Load the numpy file that is located in the filename stored in `pred_acc_19_filename` and store the result into `pred_acc_19_test`.

Now use a scatter plot to compare the performance of the 19 semantic category model with the Gabor filter model. Draw a line that shows where both models perform the same using the command: `plt.plot([-.4,.8], [-.4,.8])`. Then make both the x and y axis have equal units using `plt.axis('square')`. Finally, label the x and y axes with the names of the model being plotted on each.

**e) Plot Prediction Accuracy Flatmaps**

Now plot flatmaps of the Gabor filter and 19 semantic category prediction accuracy. Use a `'hot'` colormap, a `vmin` value of 0, and determine what the `vmax` value should be based on the size of the largest prediction accuracy scores you saw in the scatter plot above. 

**Remember that the `cortex.Volume` you need to create takes the arguments for subject and transform, which should be `subject='s04'` and `xfmname='color_natims'` for this data.**

Finally, submit your assignment once you're done! You only get one submission, so be sure you've looked over your exam before submitting!    

In [None]:
_ = ok.submit()