## Subbundle Model Analysis

At this point, have built clustering models for each subject across sessions for each bundle. The results for each expirement -- consisting of feature selection and embedding, choice of clustring algorithm and corresponding model hyperparameters -- is saved to:

> s3://hcp-subbundle/`expirement_name`/`session_name`/`bundle_name`/`subject`/`n_clusters`/

Each model folder contains the following:
- `bundle_name`.trk

  _original bundle tractography_ 
  
  For each scalar feature (_optionally_):
  - `scalar_name`.nii.gz

     _original scalar image_
     
  - adjacency\_`scalar_name`\_r2.npy
  
    _scalar coefficent of determination adjacency matrix_
  
  For each cluster ($0...\text{n_clusters}-1$):
  - `model_name`\_cluster\_`cluster_id`.trk
  
    _uncleaned cluster tractography_
    
  - `model_name`\_cluster\_`cluster_id`\_density\_map.nii.gz
  
    _corresponding density map_

  - `model_name`\_idx.npy
  
    _array of length `n_streamlines` with each element representing the `cluster_id` assigned to streamline at that index_
    
  - `model_name`\_info.pkl
  
    _pandas DataFrame containing metadata information about the model_
    
    - `subject` - model constructed using this subject's data
    
    - `session` - model constructed using this session's data
    
    - `bundle` - model constructed usng this bundle's data
    
    - `algorithm` - which clusetering algorithm used (e.g. KMeans); used to interpret score
    
    - `embedding dimensions` - if using spectral clustering, the embedding dimension $d$ for this model
    
    - `max n_clusters` - if using graspologic clustering, the recommended number of clusters $K$
    
    - `n_clusters selected` - if using graspologic clustering, the number of clusters identified in the data, this may be less than or equal to `max n_clusters`
    
    - `labels` - the labels for these clusters
    
    - `scores` - silhouette scores (for KMeans) or BIC (for GMM)
    
  - `model_name`\_pairplot.png (_optionally_)
  
    _if graspologic spectral embedding, include the pairs plots showing the embeddings and cluster distributions_
    
  - `model_name`\_silhouette\_scores.npy (_optionally_)
  
    _if graspologic KMeans, the silhouette scores saved as array, so can be used to generate the group aggreate silhouette plots_
    
  - `model_name`\_silhouette\_scores.png (_optionally_)
  
    _if graspologic KMeans, the silhouette score for this individal model for $2...\text{n_clusters}$. If searching for optimal $K$ then best to choose the largest number K representing the maximum number of expected subbundles for the bundle._

#### Constants

Constants from pyAFQ and HCP dataset

In [1]:
from subbundle_model_analysis_utils import fetch_model_data
from identify_subbundles import *
import visualizations as viz

import logging
logger = logging.getLogger('subbundle')
logger.setLevel(logging.INFO)

In [2]:
# list of pyAFQ bundle identifers
BUNDLE_NAMES = [
    'ATR_L', 'ATR_R',
    'CGC_L', 'CGC_R',
    'CST_L', 'CST_R',
    'IFO_L', 'IFO_R',
    'ILF_L', 'ILF_R',
    'SLF_L', 'SLF_R',
    'ARC_L', 'ARC_R',
    'UNC_L', 'UNC_R',
    'FA', 'FP'
]

# list of HCP test-retest subject identifiers
SUBJECTS = [
    '103818', '105923', '111312', '114823', '115320',
    '122317', '125525', '130518', '135528', '137128',
    '139839', '143325', '144226', '146129', '149337',
    '149741', '151526', '158035', '169343', '172332',
    '175439', '177746', '185442', '187547', '192439',
    '194140', '195041', '200109', '200614', '204521',
    '250427', '287248', '341834', '433839', '562345',
    '599671', '601127', '627549', '660951', # '662551', 
    '783462', '859671', '861456', '877168', '917255'
]

# list of HCP test and retest session names
SESSION_NAMES = ['HCP_1200', 'HCP_Retest']

#### Experiment and Model Metadata

dictionary of information passed to helper functions

- `metadata` dict:
  - `metadata['experiment_name']`
  - `metadata['experiment_output_dir']`
  - `metadata['experiment_bundles']`
  - `metadata['experiment_subjects']`
  - `metadata['experiment_sessions']`
  - `metadata['experiment_test_session']`
  - `metadata['experiment_retest_session']`
  - `metadata['experiment_range_n_clusters']`
  - `metadata['experiment_bundle_dict']`
  - `metadata['model_name']`
  - `metadata['model_scalars']`

In [12]:
from os.path import join

import random

metadata = {}

metadata['experiment_name'] = 'MASE_FA_and_MD_Sklearn_KMeans'

# TODO: output directory should exclude BUNDLE_NAME and be added at lower level helper where appropriate
metadata['experiment_output_dir'] = join('subbundles', metadata['experiment_name'])

# NOTE: right now just run one bundle at a time
# BUNDLE_NAME = random.choice(BUNDLE_NAMES)
# NOTE: Experiment was run for only SLF_L and SLF_R
BUNDLE_NAME = random.choice(['SLF_L', 'SLF_R']) 
print(BUNDLE_NAME)
metadata['experiment_bundles'] = [BUNDLE_NAME]

# NOTE: Experiment was run for first five subjects
metadata['experiment_subjects'] = SUBJECTS[:5] 
# metadata['experiment_subjects'] = random.sample(SUBJECTS, 5)
print(metadata['experiment_subjects'])

metadata['experiment_sessions'] = SESSION_NAMES
metadata['experiment_test_session'] = metadata['experiment_sessions'][0]
metadata['experiment_retest_session'] = metadata['experiment_sessions'][1]

# NOTE: Experiment was run for 2-4 clusters
metadata['experiment_range_n_clusters'] = [2, 3, 4] 

def make_bundle_dict(metadata):
    """
    create a bundle dictionary object for the largest number of clusters
    in the experiment
    """
    bundle_dict = {}
    
    maximal_n_clusters = max(metadata['experiment_range_n_clusters'])
    for bundle_name in metadata['experiment_bundles']:
        bundle_name_prefix = bundle_name.split('_')[0]
        
        for cluster_id in range(maximal_n_clusters):
            bundle_dict[bundle_name_prefix + '_' + str(cluster_id)] = {"uid" : cluster_id}
        
    return bundle_dict

metadata['experiment_bundle_dict'] = make_bundle_dict(metadata)

metadata['model_name'] = 'mase_kmeans_fa_r2_md_r2_is_mdf'
metadata['model_scalars'] = [Scalars.DTI_FA, Scalars.DTI_MD]

SLF_R
['103818', '105923', '111312', '114823', '115320']


#### Pipeline

1. set up local directory and download necessary files for model analysis

In [13]:
model_data = fetch_model_data(metadata)

INFO:subbundle:Download SLF_R data from HCP reliability study
INFO:subbundle:Download SLF_R clustering models for K=[2, 3, 4]


2. then we want to identify a consensus subject and appropriately relabel clusters; be able to evaluate using various algorithms

build `cluster_info` dict:
- `cluster_info[n_clusters]`
  - `cluster_info[n_clusters]['consensus_subject']`
  - `cluster_info[n_clusters]['centroids']`
  - `cluster_info[n_clusters]['tractograms_filenames']`
  - `cluster_info[n_clusters]['tractograms']`


In [None]:
cluster_info = get_cluster_info(metadata, BUNDLE_NAME)

relabel retest clusters based on consensus subject

In [15]:
for algorithm in algorithms:
    match_retest_clusters(metadata, cluster_info, BUNDLE_NAME, algorithm)

afq profiles

In [16]:
cluster_afq_profiles = {}

for n_clusters in metadata['experiment_range_n_clusters']:    
     cluster_afq_profiles[n_clusters] = get_cluster_afq_profiles(
        metadata, 
        BUNDLE_NAME, 
        n_clusters, 
        cluster_info[n_clusters]['consensus_subject']
    )

3. then we want to see individual (in subject space) **and** group (in MNI space):

 3. 1. anatomy

3. 2. centriods

after generating cluster model studys published on aws s3 repository
and having identifed consensus subject visualize centroids

show consensus subject centroids

In [None]:
viz.display_consensus_centroids(metadata, cluster_info)

*optionally* choose a subject to investigate

Cluster Centroids Labeled by Streamline Count (default)

In [None]:
viz.display_streamline_count_centroids(metadata, cluster_info)

#### Checkout effects of different labeling algoritms

Clusters Centroids Labeled by Best Weigheted Dice Coefficient

**NOTE** this algorithm may 'collapse' multiple clusters into a single bundle as it will relabel cluster to consensus cluster with the highest overlap

In [None]:
viz.display_maxdice_centroids(metadata, cluster_info, BUNDLE_NAME)

Clusters Centroids Labeled by Munkres (maximal trace) Weighted Dice Coefficient

In [None]:
viz.display_munkres_centroids(metadata, cluster_info, BUNDLE_NAME)

Cluster Centroids Labeled by MDF

In [None]:
viz.display_mdf_centroids(metadata, cluster_info, BUNDLE_NAME)

3. 3. fa profiles

In [None]:
# TODO move to visualizations
# TODO only does Munkres
def display_cluster_profiles(metadata, cluster_info, cluster_afq_profiles, bundle_name):
    """
    display the scalar profiles for each `n_clusters`
    """
    from IPython.display import Image 
    from os.path import join
    import itertools
    
    base_dir = join(metadata['experiment_output_dir'], bundle_name)

    for n_clusters, scalar in itertools.product(metadata['experiment_range_n_clusters'], metadata['model_scalars']):
        scalar_abr = scalar.split('.')[0]
        viz.plot_cluster_reliability(
            base_dir,
            metadata['experiment_sessions'],
            metadata['experiment_subjects'],
            bundle_name,
            scalar_abr,
            cluster_afq_profiles[n_clusters][scalar],
            model_data[bundle_name]['model_names'],
            model_data[bundle_name]['cluster_names'],
            n_clusters
        )

        for session in metadata['experiment_sessions']:
            print(scalar, session, metadata['model_name'], n_clusters)
            display(Image(filename=f"{base_dir}/{session}_{metadata['model_name']}_{n_clusters}_clusters_{scalar_abr}_profile_ci.png"))
            
display_cluster_profiles(metadata, cluster_info, cluster_afq_profiles, BUNDLE_NAME)

3. 4. silhouette scores?

3. 5. pair plots?

see `subbundle_choose_k.pynb`

TODO: merge into this notebook

4. profile tensors for each bundle and scalar

      with each scalar profile tensor $KxNxMxS$, where:
   
      - $K$ is number of clusters,
      - $N=44$ is number of subjects,
      - $M=100$ is number of sampled streamline nodes, and 
      - $S=2$ is number of sessions

In [20]:
def get_cluster_profile_tensor(cluster_afq_profiles, n_clusters, subjects, n_nodes, session_names):
    """
    convert the cluster_afq_profile dict into ndarray
    """
    import itertools
    import numpy as np
    
    tensor = np.zeros((n_clusters, len(subjects), n_nodes, len(session_names)))

    for (subject, session, cluster_id, node_id) in itertools.product(subjects, session_names, range(n_clusters), range(n_nodes)):
        tensor[cluster_id][subjects.index(subject)][node_id][session_names.index(session)] = cluster_afq_profiles[subject][session][cluster_id][node_id]
    
    return tensor

import itertools
for n_clusters, scalar in itertools.product(metadata['experiment_range_n_clusters'], metadata['model_scalars']):  
    profile_tensor = get_cluster_profile_tensor(
        cluster_afq_profiles[n_clusters][scalar],
        n_clusters,
        metadata['experiment_subjects'],
        100,
        metadata['experiment_sessions']
    )

    print(n_clusters, scalar, profile_tensor.shape)

2 DTI_FA.nii.gz (2, 5, 100, 2)
2 DTI_MD.nii.gz (2, 5, 100, 2)
3 DTI_FA.nii.gz (3, 5, 100, 2)
3 DTI_MD.nii.gz (3, 5, 100, 2)
4 DTI_FA.nii.gz (4, 5, 100, 2)
4 DTI_MD.nii.gz (4, 5, 100, 2)
