# Tools for Decomposing Functional Connectivity Map

This Notebook contains functions for decomposing functional connectivity maps into different components (either using PCA, Dictionary Learning, or ICA as implemented in scikit-learn. The only non-standard package required to run these scripts is nibabel, which can be installed without sudo by running "pip install nibabel --user" in the command line. This code has been developed/tested using Python3.6. 

See the text below the functions to get more information on usage/functionality.

In [1]:
import nibabel as nib
import numpy as np
import glob
import os
import pandas as pd
from sklearn import decomposition
from scipy.stats import zscore
import json

In [8]:
def load_conns(path_or_list, parcel_ids_path=None):
    """
    #Function to assist in loading connectivity values.
    #This function supports a few different functionalities.
    #The input "path_or_list" can either be a path or list to
    #supported data types.
    
    #Supported data types include (1) pscalar files, (2) csv
    #files, (3) numpy arrays. In the (1) pscalar usage, path_or_list
    #can either be a path to a folder containing ONLY the pscalar files
    #you want to load. In the (2) csv case, path_or_list should be a string
    #that points to a csv file where the first column is named 'subject_id',
    #and the subsequent columns are headings for the connectivity edges
    #represented by subsequent rows. In the (3) case, similar to the (1)
    #case, 'path_or_list' can either be a path to a folder containing
    #ONLY *.npy files which are arrays to represent connectivity, AND optionally
    #a SINGLE *.txt file with one entry per line that specifies names for
    #different connectivity edges. The *.txt file is optional.
    
    #This function will then output a matrix shape <num subjects, num edges>,
    #a list of parcel_labels (if found), and a template cifti file that can
    #be used to help create visualizations (if pscalar.nii input option is
    #used)
    """
    
    starting_dir = os.getcwd()
    
    conns_list = []
    files_list = []
    template_cifti_path = None
    load_csv = False
    
    if type(path_or_list) == list:
        
        files_to_load = path_or_list
        subjects = files_to_load
        
    if type(path_or_list) == str:
        
        if path_or_list[-3:] == 'csv':
            
            load_csv = True
        
        else:
            
            os.chdir(path_or_list)
            nifti_files = glob.glob('*.nii')
            npy_files = glob.glob('*.npy')
            
            if len(nifti_files) > 1:
                
                files_to_load = nifti_files
                
            else:
                
                files_to_load = npy_files
                
            subjects = []
            for temp_file in files_to_load:
                subjects.append(temp_file.split('/')[-1])
            
    
    
    if load_csv:
        
        df = pd.read_csv(path_or_list)
        df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
        subjects = df.subject_id.values
        parcel_labels = df.columns[1:].values
        conn_inds = df.columns[1:]
        subj_conn_vals = df[conn_inds].to_numpy()
        
    else:
        
        conn_list = []
        
        if files_to_load[0][-3:] == 'nii':
            
            #Iterate through all cifti files
            for i, temp_file_path in enumerate(files_to_load):
                
                temp_cifti = nib.load(temp_file_path)
                
                #Grab the parcel labels from cifti header
                if i == 0:
                    
                    parcel_labels = temp_cifti.header.get_axis(1).name
                    template_cifti_path = os.path.join(path_or_list, temp_file_path)
                  
                conn_list.append(temp_cifti.get_fdata())
            
        else:
            
            for temp_file in files_to_load:
                
                temp_out = np.load()
                conn_list.append(temp_out.copy())
                            
            if type(parcel_ids_path) == type(None):
                
                os.chdir(path_or_list)
                parcel_ids_path = glob.glob('*txt')[0]
            
            with open(parcel_ids_path, 'r') as parcel_file:
                
                parcel_labels = parcel_file.read().split('\n')
                
        #Convert list to numpy matrix   
        subj_conn_vals = np.vstack(conn_list)
                
                
                
    return subj_conn_vals, parcel_labels, subjects, template_cifti_path


def make_pscalar_cifti(data_array, template_cifti_path, output_cifti_path, dimension_names=None):
    """
    #Function that creates a cifti pscalar.nii file for
    #visualizing overlays. data_array should  have a shape
    #<n_dimensions, n_regions> where n_dimensions is at least
    #1. template_cifti_path is the path to a cifti file (either
    #pscalar or ptseries works) that has the same parcel definitions
    #as desired for visualizing the data_array. Output_cifti_path
    #specifies the name of the file to be saved (should end in
    #*.pscalar.nii). dimension_names is optional list that will serve
    #as the name of different dimensions (i.e. PC1, PC2, etc.). If
    #specified, dimension_names should have one element for each dimension
    #(not region) in data_array. Function saves new file to output path,
    #doesn't return anything
    """
    
    #Set dimension names if not specified
    if type(dimension_names) == type(None):
        dimension_names = ['NONAME']*data_array.shape[0]
    
    #Load template cifti file
    template_cifti = nib.load(template_cifti_path)
    
    #Copy/generate axis for new cifti file
    pscalar_axis_1 = template_cifti.header.get_axis(1)
    pscalar_axis_0 = nib.cifti2.cifti2_axes.ScalarAxis(dimension_names)
    
    #Put the axes into a header
    cifti_hdr = nib.cifti2.cifti2_axes.to_header([pscalar_axis_0, pscalar_axis_1])
    
    #put the data array and axes into a new image
    cifti_image = nib.cifti2.cifti2.Cifti2Image(dataobj=data_array, header=cifti_hdr)
    
    #Save the cifti image
    nib.save(cifti_image, output_cifti_path)
    
    return

def format_subjects_conns_as_csv(output_path, subject_ids, connectivity_matrix, connection_headers=None):
    """
    #Connectivity matrix should be formatted as subjects x connections.
    #Connection headers specify the headers to be placed in the first row,
    #otherwise will be auto-populated. Subject IDs will be set as first column
    """
    
    with open(output_path, 'w') as output_file:
        
        output_file.write('subject_ids,')
        if type(connection_headers) == type(None):
            
            for i in range(connectivity_matrix.shape[1]):
                output_file.write('Connection_' + str(i) + ',')
                
                
        else:
            
            heading = ','.join(connection_headers) + '\n'
            output_file.write(heading)
            
            
        for i in range(connectivity_matrix.shape[0]):
            output_file.write(subject_ids[i] + ',')
            temp_line = ''
            for j in range(connectivity_matrix.shape[1]):
                temp_line += str(connectivity_matrix[i,j]) + ',' #np.array2string(connectivity_matrix[i,:], separator=',')[1:-1]
            output_file.write(temp_line + '\n')

def calc_variance_explained(data_matrix, trained_decomposition_object):
    """
    #Function that takes as input a trained decomposition object from sklearn
    #and data compatable with the trained decomposition object, and calculates
    #for each component. The function outputs to variables (1)
    #components_variance_explained which is very close to the variance_explained_
    #property for sklearn's PCA object, and (2) components_variance_explained_ratio_
    #which is different than sklearn's ratio property as this it is still calculated based
    #on the total variance across examples (summed per feature), not based
    #on the sum of variance explained for all components
    """
    
    num_components = trained_decomposition_object.components_.shape[0]
    
    components_variance_explained = np.zeros(num_components)
    components_variance_explained_ratio = np.zeros(num_components)
    
    PCA_object = decomposition.PCA()
    PCA_object.fit(data_matrix)
    
    original_variance = np.sum(np.var(data_matrix,axis=0))

    comp_scores = trained_decomposition_object.transform(data_matrix)
    
    for component_number in range(num_components):
    
        nth_comp_scores = np.reshape(comp_scores[:,component_number], (data_matrix.shape[0],1))
        nth_comp_weights = np.reshape(trained_decomposition_object.components_[component_number,:], (1,data_matrix.shape[1]))

        nth_component_in_orig_space = np.matmul(nth_comp_scores,nth_comp_weights)
        difference = data_matrix - nth_component_in_orig_space
        subtracted_variance = np.sum(np.var(difference,axis=0))
        components_variance_explained[component_number] = original_variance - subtracted_variance
        
    components_variance_explained_ratio = components_variance_explained/original_variance
    
    return components_variance_explained, components_variance_explained_ratio
            
                
            
def calc_connectivity_components(input_path_or_list, output_folder, parcel_ids_path=None, algorithm_type='PCA', n_components=None, z_score_prior_to_decomp=False):
    """
    #Function to take subjects' connectivity data (generally region to whole-brain)
    #and do a PCA across subjects, to produce a number of lower-dimension connectivity 
    #component 'scores' per subject. The function will save (1) subject_component_scores,
    #(2) compoent_weights (at least as a csv, but if pscalars used also projected onto the
    #cifti surface), (3) the amount of variance explained for each component (currently only
    #specified for PCA), (4) a csv file that is a tabulated copy of the input connectivity
    #data. This function only saves output, does not return anything.
    #
    #
    #DISCLAIMER: Current implementation of cifti output files is limited, such that when
    #you click on different areas of the cortex, wb_view will freeze and quit.. 
    #otherwise it works fine. Also remember the cifti file contains multiple dimensions...
    #Also, any Inf values will be set to 0
    #
    #
    #output_folder - the folder where the files generated from this function will be saved
    # (this function will make the folder if it doesn't exist)
    #
    #parcel_ids_path - optional (if using *npy files), this is the path to a text file that
    #will have one name for a connectivity edge per line
    #
    #algorithm_tpye - the type of sklearn algorithm to use, could be 'PCA', 'FastICA', or 'DictionaryLearning',
    #PCA will return components that are ordered in terms of the amount of variance they explain, and
    #'FastICA'/'DictionaryLearning' alternatively will be unordered
    #
    #n_components (optional) the number of components to be saved, defaults to None which returns the maximum amount.
    # probably will want < 10. REMEMBER, THE MAXIMUM NUMBER OF COMPONENTS IS GENERALLY SET BY THE NUMBER OF SUBJECTS,
    #IF YOU ONLY HAVE 10 SUBJECTS, YOU CAN ONLY GET 10 COMPONENTS. ALTERNATIVELY WHEN THERE ARE MORE SUBJECTS THAN
    #CONNECTIVITY EDGES, THE NUMBER OF COMPONENTS WILL BE SET BY THE NUMBER OF CONNECTIVITY EDGES
    #
    #z_score_prior_to_decomp - defaults to false, optionally you can z_score each edge across
    #subjects so that each edge is weighted equally in decomposition process
    
    """
    
    subj_conn_vals, parcel_labels, subjects, template_cifti_path = load_conns(input_path_or_list, parcel_ids_path=parcel_ids_path)
    
    #Set any edges that are equal to infinity to 0
    subj_conn_vals[subj_conn_vals == np.inf] = 0
    
    #If specified, z-score conn values so that
    #each edge is weighted evenly
    if z_score_prior_to_decomp:
        subj_conn_vals = zscore(subj_conn_vals,axis=0)
    
    if algorithm_type == 'PCA':
        
        decomposition_object = decomposition.PCA(n_components = n_components)
    
    elif algorithm_type == 'FastICA':
        
        decomposition_object = decomposition.FastICA(n_components = n_components)
    
    elif algorithm_type == 'DictionaryLearning':
        
        decomposition_object = decomposition.DictionaryLearning(n_components = n_components)
        
        
    decomposition_object.fit(subj_conn_vals)
    component_scores = decomposition_object.transform(subj_conn_vals)
    
    dimension_names = []
    for i in range(component_scores.shape[1]):
        dimension_names.append(algorithm_type + '_0' + str(i)) 
    
    
    #to do (1), make overlays, save component scores, save weights, make plot to show variance explained
    if os.path.exists(output_folder) == False:
        os.mkdir(output_folder)
    
    #Save the component scores for the subjects
    format_subjects_conns_as_csv(os.path.join(output_folder, 'subject_component_scores.csv'), subjects, component_scores, connection_headers=dimension_names)

    
    #Save the weights for different components to a csv
    data_array = decomposition_object.components_
    format_subjects_conns_as_csv(os.path.join(output_folder, 'component_weights.csv'), dimension_names, data_array, connection_headers=parcel_labels)
    
    #Write the weights for different components to a cifti file if a template is found
    if template_cifti_path != None:
        make_pscalar_cifti(data_array, template_cifti_path, os.path.join(output_folder, 'component_weights.pscalar.nii'), dimension_names=dimension_names)
        
    
    #Calculate variance exlained - this seems to be having some issues
    #outside of PCA decomposition.....
    variance_explained, variance_explained_ratio = calc_variance_explained(subj_conn_vals, decomposition_object)
    format_subjects_conns_as_csv(os.path.join(output_folder,'components_variance_explained.csv'), ['Variance_Explained', 'Variance_Explained_Ratio'], 
                                 np.vstack((variance_explained, variance_explained_ratio)), connection_headers=dimension_names)
    
    #Also save the raw connectivity values
    format_subjects_conns_as_csv(os.path.join(output_folder, 'original_conn_values.csv'), subjects, subj_conn_vals, connection_headers=parcel_labels)
    
    
    #Save the settings
    settings_dict = {'input_path_or_list' : input_path_or_list,
                     'output_folder' : output_folder,
                     'parcel_ids_path' : parcel_ids_path,
                     'algorithm_type' : algorithm_type,
                     'n_components' : n_components,
                     'z_score_prior_to_decomp' : z_score_prior_to_decomp}
    
    json_object = json.dumps(settings_dict, indent=4)
    json_path = os.path.join(output_folder, 'decomposition_settings.json')
    with open(json_path, 'w') as temp_file:
        temp_file.write(json_object)
    
    
    return   

# What does this code do?

As mentioned above, this code takes connectivity data (or really any 1d map) and reduces it to a number of components specified by the user, using either PCA, Dictionary Learning, or ICA from scikit-learn for computation. This is done using the function 'calc_connectivity_components' whose usage is described above and will also be shown below. After running 'calc_connectivity_components' a new folder will be created that will contain csv files for (1) subject component scores (i.e. component strengths), (2) weights for the different components (i.e. what areas contribute to the components), (3) the amount of variance explained by the different components, (4) a json describing the settings used to run the 'calc_connectivity_components' command, (5) and a csv copy of the original connectivity data used to run the decomposition.

The 'calc_connectivity_components' function supports a few different formats for input data, but if the input data are in parcellated cifti format, then the function will also output the different component weights as a cifti overlay so they can be visualized in connectome workbench.

# What can the input connectivity data look like?

The 'calc_connectivity_components' function takes a input variable input_or_path_list that can take a variety of forms, and that will point the function to the connectivity data. In different capacities, this supports (1) cifti pscalar files, (2) numpy arrays (*npy files), or (3) combined csv files to load connectivity data for the decomposition. If either pscalar/*npy files will be used to load connectivity data, then input_or_path_list can either be a path to a directory containing only *pscalar.nii files or *.npy files (mostly), or can be a list to the individual files you want to use. Since parcel labels can't be specified directly with numpy files, a file ending in txt can also be placed in the directory where each line has the desired (correctly ordered) label for a connectivity edge. *npy files should have shape <n_edges>. If a csv file is used, the first column should be subject_id, and there should be a heading for edge labels.

As mentioned previously, component weights will only be output in cifti if the cifti input option is used.

# What parameters do I need to specify?

### The function to be ran is shown as:

- calc_connectivity_components(input_path_or_list, output_folder, parcel_ids_path=None, algorithm_type='PCA', n_components=None, z_score_prior_to_decomp=False)

The two required inputs are input_path_or_list (described above), and output_folder (which is where the results will be stored - the function creates this folder if it doesn't exist)

### The remaining optional parameters are as follows:

- parcel_ids_path: only for when input_path_or_list points to *npy files, this is the path to label descriptors for each connectivity edge (should be formatted one label per line)
- algorithm_tpye: Defaults to be PCA, can also be scikit-learn's implementation of FastICA and DictionaryLearning
- n_components: When set to None will return the maximum possible number of components (constrained by either the number of subjects or number of connectivity edges) but can alternatively be set to any given integer below the maximum (probably will be < 10 for PCA and the number for ICA/DictionaryLearning may reuqire fine tuning as different n_component values will presumably give different weights for all components)
- z_score_prior_to_decomp: If set to true, each connectivity edge will be demeaned and variance normalized across subjects such that each edge is weighted equally in the decomposition

# Known Issues

- The cifti files generated here have an issue where when opened in connectome workbench, if the user clicks on any of the cortical areas (i.e. to see the value of a vertex) workbench will crash... otherwise visualization seems to work fine
- A custom function was built to test the variance explained by different components (as scikit-learn only has this feature for PCA and not FastICA or DictionaryLearning), which is numerically equivelent to scikit-learn's PCA attributes (except for purposeful deviation when n_components < max), but is giving values that seem to be incorrect for DictionaryLearning and probably also FastICA

# Usage Example

In [9]:
path_to_folder_with_pscalars = '/home/lnpi15-raid7/lee-data/HCP_PREPROC_PHCPA/bilateral_thal_gsr_zcorrs_pscalars'
path_to_where_i_want_data_saved = '/home/lnpi15-raid7/lee-data/HCP_PREPROC_PHCPA/bilateral_thal_gsr_zcorrs_pca'
calc_connectivity_components(path_to_folder_with_pscalars, path_to_where_i_want_data_saved, n_components=10)

In [11]:
os.listdir(path_to_where_i_want_data_saved)

['original_conn_values.csv',
 'component_weights.csv',
 'decomposition_settings.json',
 'subject_component_scores.csv',
 'component_weights.pscalar.nii',
 'components_variance_explained.csv']