# MDMS time segment matching example

We fit the MDMS model on some datasets, and test the model on a completely unseen dataset to assess the quality of transfer learning between fMRI datasets using MDMS. As a toy example, we only use three datasets here. You can use more datasets in your experiment to get better results.

## Single node example

### Import

In [None]:
%matplotlib inline
import numpy as np
from scipy.stats import stats
from brainiak.fcma.util import compute_correlation
import pickle as pkl
from brainiak.funcalign.mdms import MDMS, Dataset

### Parameters

In [None]:
features = 75 # number of features, k
n_iter = 30 # number of iterations of EM

### Documentation of MDMS and Dataset

In [None]:
help(MDMS)
help(Dataset)

### Load datasets

In [None]:
with open('data/multi_dataset.pickle','rb') as f:
    data = pkl.load(f)

### Two dataset structure options

Datasets can be organized in two ways:

1) A dict of list of 2D arrays, where data[d] is a list of data in dataset d and d is the name of the dataset.
Element i in the list has shape=[voxels_i, samples_d], which is the fMRI data of the i'th subject in d.

- If datasets are in this format, you still need a JSON file (or JSON files) with datasets information.  
Each JSON file should contain a dict or a list of dict where each dict has information of one dataset. 
Each dict must have 'dataset', 'num_of_subj', and 'subjects' where 'dataset' is the name of the dataset, 
'num_of_subj' is the number of subjects in the dataset, and 'subjects' is a list of strings with names 
of subjects in the dataset in the same order as in the dataset. 

- Example of a JSON file: 
    [{'dataset':'MyData','num_of_subj':3,'subjects':['Adam','Bob','Carol']}, 
    {'dataset':'MyData2','num_of_subj':2,'subjects':['Tom','Bob']}]

In [None]:
# Our loaded datasets are in this format, let's inspect its structure
print ('------ Datasets in data (dict keys) ------')
print (list(data.keys()))
print ('------ Type and length of the sherlock dataset (number of subjects) ------')
print (type(data['sherlock']))
print (len(data['sherlock']))
print ('------ Shape of one subject in the sherlock dataset (voxel x sample) ------')
print (data['sherlock'][0].shape)
print ('The datasets are masked to regions of interest (ROI) : Default Mode Network (DMN)')

In [None]:
# The corresponding JSON file
! cat data/multi_dataset.json

# convert the JSON file to a Dataset object
ds_struct = Dataset('data/multi_dataset.json')

# display information of datasets
print ('------ Number of datasets ------')
print (ds_struct.num_dataset)
print ('------ Number of subjects ------')
print (ds_struct.num_subj)
print ('------ Visualize connectivity between datasets: datasets as nodes, number of shared subjects as edges------')
ds_struct.visualize_graph()

2) When it is a dict of dict of 2D arrays, where data[d][s] has shape=[voxels_s, samples_d], which is the fMRI 
data of subject s in dataset d, where s is the name of the subject and d is the name of the dataset.

- If a dataset is in this format, you don't need the JSON file to define datasets structure. MDMS can infer the 
structure from data.

- Example: data['sherlock'] = {'s1': matrix1, 's4': matrix2}, where matrix1 and matrix2 are voxels x samples matrices.

### Separate training and testing data and zscore data

In [None]:
# We use ['greeneye', 'sherlock'] to train and 'milky' to test. Note that we only test on subjects that are
# in at least one training dataset as well.

# get info of test data 
test_ds = 'milky'
test_subj_list = ds_struct.subj_in_dataset[test_ds]
test_data = data[test_ds]

# remove test dataset from the dataset structure without changing the data and MDMS will handle it automatically
_ = ds_struct.remove_dataset([test_ds])

# remove subjects in test_ds that are not in any training dataset
train_subj = set(ds_struct.get_subjects_list()) # all subjects in training set
test_subj_idx_to_keep = [] # index of subjects to keep
for idx, subj in enumerate(test_subj_list):
    if subj in train_subj:
        test_subj_idx_to_keep.append(idx)
test_subj_list = [test_subj_list[idx] for idx in test_subj_idx_to_keep]
test_data = [test_data[idx] for idx in test_subj_idx_to_keep]

# compute voxels mean and std of each subject from training data and use them to standardize training and testing data
mean, std = {}, {} # mean and std of each subject
matrix_csr = ds_struct.matrix.tocsr(copy=True)
for subj in range(ds_struct.num_subj): # iterate through all subjects
    subj_name = ds_struct.idx_to_subject[subj]
    indices = matrix_csr[subj,:].indices # indices of datasets with this subject
    # aggregate all data from this subject
    for idx, ds_idx in enumerate(indices):
        if idx == 0:
            mtx_tmp = data[ds_struct.idx_to_dataset[ds_idx]][ds_struct.dok_matrix[subj,ds_idx]-1]
        else:
            mtx_tmp = np.concatenate((mtx_tmp, data[ds_struct.idx_to_dataset[ds_idx]][ds_struct.dok_matrix[subj,ds_idx]-1]),axis=1)
    # compute mean and std
    mean[subj_name] = np.mean(mtx_tmp, axis=1)
    std[subj_name] = np.std(mtx_tmp, axis=1)
    # standardize training data
    for ds_idx in indices:
        ds_name, idx_in_ds = ds_struct.idx_to_dataset[ds_idx], ds_struct.dok_matrix[subj,ds_idx]-1
        data[ds_name][idx_in_ds] = np.nan_to_num((data[ds_name][idx_in_ds]-mean[subj_name][:,None])/std[subj_name][:,None])
        
# use the mean and std computed from training data to standardize testing data
for idx, subj in enumerate(test_subj_list):
    test_data[idx] = np.nan_to_num((test_data[idx]-mean[subj][:,None])/std[subj][:,None])

### Fit MDMS model

In [None]:
model = MDMS(features=features, n_iter=n_iter)

# Two ways to fit the model based on two dataset structure options
# 1) When data is a dict of list of 2D arrays (as in our case), you need to have the ds_struct built from JSON files. 
# But you have the flexibility to keep testing data in 'data' as well, and MDMS will only train on data in ds_struct.
model.fit(data, ds_struct)

# 2) When data is a dict of dict of 2D arrays, you don't need the ds_struct, but you need to remove all data not meant 
# to be used during the training phase.
# model.fit(data)  # uncomment this line if you have this kind of dataset structure

### Time Segment Matching Experiment

In [None]:
# This experiment is an easy sanity check of the quality of fitting. The higher the accuracy, the better.
def time_segment_matching(data, win_size=6): 
    nsubjs = len(data)
    (ndim, nsample) = data[0].shape
    accu = np.zeros(shape=nsubjs)
    nseg = nsample - win_size 
    # mysseg prediction prediction
    trn_data = np.zeros((ndim*win_size, nseg),order='f')
    # the trn data also include the tst data, but will be subtracted when 
    # calculating A
    for m in range(nsubjs):
        for w in range(win_size):
            trn_data[w*ndim:(w+1)*ndim,:] += data[m][:,w:(w+nseg)]
    for tst_subj in range(nsubjs):
        tst_data = np.zeros((ndim*win_size, nseg),order='f')
        for w in range(win_size):
            tst_data[w*ndim:(w+1)*ndim,:] = data[tst_subj][:,w:(w+nseg)]

        A =  np.nan_to_num(stats.zscore((trn_data - tst_data),axis=0, ddof=1))
        B =  np.nan_to_num(stats.zscore(tst_data,axis=0, ddof=1))

        # compute correlation matrix
        corr_mtx = compute_correlation(B.T,A.T)
    
        for i in range(nseg):
            for j in range(nseg):
                if abs(i-j)<win_size and i != j :
                    corr_mtx[i,j] = -np.inf
        max_idx =  np.argmax(corr_mtx, axis=1)
        accu[tst_subj] = sum(max_idx == range(nseg)) / float(nseg)

    return accu

### Transform testing data

In [None]:
# transform the data
transformed = model.transform(test_data, test_subj_list) # test_subj_list: element i is the name of subject of X[i]

# zscore the transformed data
for subj in range(len(transformed)):
    transformed[subj] = stats.zscore(transformed[subj], axis=1, ddof=1)

### Run the experiment

In [None]:
accu = time_segment_matching(transformed)
accu_mean = np.mean(accu)
accu_se = stats.sem(accu)
print ('Accuracy is {} +- {}'.format(accu_mean, accu_se))

### Compare the result with single dataset case 

We then train MDMS using one dataset 'greeneye' and test on 'milky' to show the extra training dataset ('sherlock') indeed transfer useful information. Note that the number of testing subjects is the same as the previous experiment.

In [None]:
# load data and JSON file again
with open('data/multi_dataset.pickle','rb') as f:
    data = pkl.load(f)
ds_struct = Dataset('data/multi_dataset.json')    
    
# get info of test data 
test_ds = 'milky'
test_subj_list = ds_struct.subj_in_dataset[test_ds]
test_data = data[test_ds]

# remove test dataset from the dataset structure
_ = ds_struct.remove_dataset([test_ds])

# remove the other training datasets
_ = ds_struct.remove_dataset(['sherlock'])

# remove subjects in test_ds that are not in any training dataset
train_subj = set(ds_struct.get_subjects_list()) # all subjects in training set
test_subj_idx_to_keep = [] # index of subjects to keep
for idx, subj in enumerate(test_subj_list):
    if subj in train_subj:
        test_subj_idx_to_keep.append(idx)
test_subj_list = [test_subj_list[idx] for idx in test_subj_idx_to_keep]
test_data = [test_data[idx] for idx in test_subj_idx_to_keep]

# compute voxels mean and std of each subject from training data and use them to standardize training and testing data
mean, std = {}, {} # mean and std of each subject
matrix_csr = ds_struct.matrix.tocsr(copy=True)
for subj in range(ds_struct.num_subj): # iterate through all subjects
    subj_name = ds_struct.idx_to_subject[subj]
    indices = matrix_csr[subj,:].indices # indices of datasets with this subject
    # aggregate all data from this subject
    for idx, ds_idx in enumerate(indices):
        if idx == 0:
            mtx_tmp = data[ds_struct.idx_to_dataset[ds_idx]][ds_struct.dok_matrix[subj,ds_idx]-1]
        else:
            mtx_tmp = np.concatenate((mtx_tmp, data[ds_struct.idx_to_dataset[ds_idx]][ds_struct.dok_matrix[subj,ds_idx]-1]),axis=1)
    # compute mean and std
    mean[subj_name] = np.mean(mtx_tmp, axis=1)
    std[subj_name] = np.std(mtx_tmp, axis=1)
    # standardize training data
    for ds_idx in indices:
        ds_name, idx_in_ds = ds_struct.idx_to_dataset[ds_idx], ds_struct.dok_matrix[subj,ds_idx]-1
        data[ds_name][idx_in_ds] = np.nan_to_num((data[ds_name][idx_in_ds]-mean[subj_name][:,None])/std[subj_name][:,None])
        
# use the mean and std computed from training data to standardize testing data
for idx, subj in enumerate(test_subj_list):
    test_data[idx] = np.nan_to_num((test_data[idx]-mean[subj][:,None])/std[subj][:,None])
         
# fit the model
model2 = MDMS(features=features, n_iter=n_iter)
model2.fit(data, ds_struct)

# transform test data
transformed = model2.transform(test_data, test_subj_list)

# zscore the transformed data
for subj in range(len(transformed)):
    transformed[subj] = stats.zscore(transformed[subj], axis=1, ddof=1)
    
# run the experiment
accu = time_segment_matching(transformed)
accu_mean = np.mean(accu)
accu_se = stats.sem(accu)
print ('Accuracy is {} +- {}'.format(accu_mean, accu_se))

The accuracy is lower than before, so we know that the extra training data indeed transfer useful information.

### Save model

In [None]:
# We now save the fitted model into a file
filename = 'model.pkl'
model.save(filename)

# # When we restore the model from file, we use
# model_restored = MDMS()
# model_restored.restore(filename)

## Multi node example

Please see ``mdms_time_segment_matching_distributed.py`` for details. Run the following line to run it with 4 nodes.

In [None]:
! mpirun -n 4 python3 mdms_time_segment_matching_distributed.py