# SST Individual Discriminability

This notebook measures discriminability of conditions for each subject, within runs.

It then takes the average discriminability across any multiple runs (I don't think we have them though).

Then we can compare discriminability against other subject-level variables.



Something similar was previously been done in `rsa_within_subj.ipynb`; it took similarity between matrices representing the averages of GNG and compared them intra-class similarity.

That file doesn't take the next step of trying to see how that similarity score might measure up against across-subject variables; it seems, because I did not find good evidence that intrasimilarity was actually higher than interclass similarity.

Conclusions:

 - maybe I should have tried discriminability, not just similarity
 - If we're interested in discriminability, then running an ML algorithm should be superior?
 - But don't rule out going back to similarity.

## TESQ

### Machine learning

First set up.

In [1]:
import pickle
from IPython.core.display import display, HTML, Markdown

In [2]:
from nilearn.decoding import Decoder

In [3]:
from sklearn.model_selection import StratifiedKFold
from random import randint
import math

In [4]:
import sys
import os
import pandas as pd
import gc

sys.path.append(os.path.abspath("../../ml/"))

from apply_loocv_and_save import *
from dev_wtp_io_utils import *
import gc
import nibabel as nib

from os import path



nonbids_data_path = "/gpfs/projects/sanlab/shared/DEV/nonbids_data/"
ml_data_folderpath = "/gpfs/projects/sanlab/shared/DEV/nonbids_data/fMRI/ml"
train_test_markers_filepath = ml_data_folderpath + "/train_test_markers_20210601T183243.csv"
test_train_df = pd.read_csv(train_test_markers_filepath)

all_sst_events= pd.read_csv(ml_data_folderpath +"/SST/" + "all_sst_events.csv")


dataset_name = 'conditions'

from nilearn.decoding import DecoderRegressor, Decoder

script_path = '/gpfs/projects/sanlab/shared/DEV/DEV_scripts/fMRI/ml'
# HRF 2s

#get a PFC mask
#pfc_mask = create_mask_from_images(get_pfc_image_filepaths(ml_data_folderpath + "/"),threshold=10)


def trialtype_resp_trans_func(X):
    return(X.trial_type)



python initialized for apply_loocv_and_save


  warn("Fetchers from the nilearn.datasets module will be "


4


In [5]:

dataset_name = 'conditions'


brain_data_filepath = ml_data_folderpath + '/SST/Brain_Data_betaseries_40subs_correct_cond.pkl'
#brain_data_filepath = ml_data_folderpath + '/SST/Brain_Data_conditions_43subs_correct_cond.pkl'

def decoderConstructor(*args, **kwargs):
    return(Decoder(scoring='accuracy',verbose=0, *args, **kwargs))


relevant_mask = None

 `load_and_preprocess` has problems but I"m going to continue using it because I'm having problems with the standardization in the nilearn package. So there's not much point in deviating. We can also use the slicer in load_and_preprocess, if we get teh subject list first.
 
To get the subject list, we'll load the data once, get the list, and then load each subject individually.

In [6]:
all_subjects = load_and_preprocess(
    brain_data_filepath,
    train_test_markers_filepath,
    subjs_to_use = None,
    response_transform_func = trialtype_resp_trans_func,
    clean=None)

all_subjects['groups']

subj_list = np.unique(all_subjects['groups'])

del all_subjects
gc.collect()

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
Brain_Data_allsubs: 48
clean: 16
subjs_to_use: 16


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


0

This takes too long. For that reason, even though there are accuracy challenges, I'd like to proceed with a short-cut--we do 10-fold cross validation using the LeaveOneGroupOut feature, and ensure that Go and NoGo values are as evenly distributed across the groups as possible.

But we'll leave the capacity to do full LeaveOneOut in there because it's probably good to go back to in the future.

First to test the technical process, let's try doing just three manually-generated groups.

OK, so this isn't very good for actually closely understanding what is being classified as what; we're not getting predictions for the scores under the hood. But we can take a look at the overall predictive accuracy on a non-held-out set. Note that we can't use this for asssessing model fit (to do that we better take the average of the prediction accuracies for the folds, or run an independent test-train analysis) but we can use this to understand things like class bias or look at the image being predicted.

In [8]:
def get_subject_discriminability(sample_subject):
    min_splits = 2
    # iterate through each subject; for each subject:
    display("loading subject " + sample_subject)

    subj_i_processed_data = load_and_preprocess(
        brain_data_filepath,
        train_test_markers_filepath,
        subjs_to_use = [sample_subject],
        response_transform_func = trialtype_resp_trans_func,
        clean="standardize")
    
    display(subj_i_processed_data['y'].value_counts())
    
    display("setting up decoder...")
    #we use stratified Kfold
    correct_stop_count =np.sum(subj_i_processed_data['y']=='correct-stop')
    
    if correct_stop_count< min_splits:
        return(None)
        #could do one split per correct-stop
        #because there are generally very few of them
    skf = StratifiedKFold(n_splits = 3,random_state= randint(0,math.pow(2,32)),
                          shuffle=True)
        #for testing for now we'll use 3
        
    #do this separately for each outcome group
    decoder = Decoder(standardize=True, 
                      cv = skf, #mask = mask,
                      n_jobs = cpus_to_use,#verbose=10,
                      scoring='accuracy'
                     )

    display("fitting")
    #get overfit individual predictions--only way we can assess individual predictions
    decoder_result = decoder.fit(X=subj_i_processed_data['X'],y=subj_i_processed_data['y'])
    
    display("evaluating")
    
    predictions = decoder.predict(subj_i_processed_data['X'])
    y_pred_vs_obs = pd.DataFrame({'y_obs':subj_i_processed_data['y'],'y_pred':predictions})
    overfit_accuracy = np.sum(y_pred_vs_obs['y_obs']==y_pred_vs_obs['y_pred'])/len(y_pred_vs_obs['y_obs'])
    
    #get mean_cv_scores - cross-validated scores but I'm not sure what they mean becuase the package is vague
    mean_cv_scores = np.mean([np.mean(c_scores) for c_name, c_scores in decoder.cv_scores_.items()])
    
    #alternative to this is to do our own cv stuff
    subj_discrim_results = {
        'mean_cv_scores':mean_cv_scores,
        'overfit_accuracy':overfit_accuracy,
        'overfit_y_pred_vs_obs': y_pred_vs_obs,
        'decoder_object' : decoder
    }
    display(subj_discrim_results)


    return(subj_discrim_results)

Not sure it actually matters to overfit here? We're interested in discriminability not to see if we really can discriminate above chance, but as an individual difference; to see if relative discriminability relates to other things we care about.

In that sense, we probably don't have to worry about overfitting.

In [21]:
# def get_function_with_cache(function_to_run,function_path):
#     if path.exists(function_path) is False:
#         results = function_to_run()
#     else:
#         results=pickle.load(open(function_path,'rb'))
        
#     return(results)
    
def get_subject_discriminability_with_cache(sample_subject,run_desc):
    results_filepath = ml_data_folderpath + "/SST/discriminability_tt_results_" + run_desc + "_" + sample_subject + ".pkl"
    if path.exists(results_filepath) is False:
        subj_discrim_results = get_subject_discriminability(sample_subject)
        with open(results_filepath, 'wb') as handle:
            pickle.dump(subj_discrim_results,handle)
    else:
        subj_discrim_results=pickle.load(open(results_filepath,'rb'))
        
    
    return(subj_discrim_results)

    
    
    

In [22]:
def get_all_subjs_discriminability_whole_brain():
    results_dict = {}

    for sample_subject in subj_list:
        run_desc = 'v1_whole_brain'
        results_dict[sample_subject] = get_subject_discriminability_with_cache(sample_subject,run_desc)
        
    summary_results = pd.concat([pd.DataFrame({
        'subid':k,
        'overfit_accuracy':[v['overfit_accuracy']],
        'mean_cv_scores':[v['mean_cv_scores']]}) 

                                 for k,v in results_dict.items()])
    
    return(summary_results)



In [None]:
summary_results = get_all_subjs_discriminability_whole_brain()

'loading subject DEV005'

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
subjs_to_use: 64
clean: 60
Brain_Data_allsubs: 48


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


correct-go      80
correct-stop     5
Name: trial_type, dtype: int64

'setting up decoder...'

'fitting'

'evaluating'

{'mean_cv_scores': 0.35303776683087024,
 'overfit_accuracy': 0.6352941176470588,
 'overfit_y_pred_vs_obs':            y_obs        y_pred
 0   correct-stop  correct-stop
 1     correct-go  correct-stop
 2     correct-go    correct-go
 3     correct-go    correct-go
 4     correct-go    correct-go
 ..           ...           ...
 80    correct-go    correct-go
 81    correct-go    correct-go
 82    correct-go    correct-go
 83    correct-go  correct-stop
 84    correct-go  correct-stop
 
 [85 rows x 2 columns],
 'decoder_object': Decoder(cv=StratifiedKFold(n_splits=3, random_state=2984842438, shuffle=True),
         estimator=LinearSVC(max_iter=10000.0), memory=Memory(location=None),
         n_jobs=3, scoring='accuracy')}

'loading subject DEV006'

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
subjs_to_use: 64
clean: 60
Brain_Data_allsubs: 48


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


correct-go      92
correct-stop    18
Name: trial_type, dtype: int64

'setting up decoder...'

'fitting'

'evaluating'

{'mean_cv_scores': 0.4364364364364364,
 'overfit_accuracy': 0.7272727272727273,
 'overfit_y_pred_vs_obs':             y_obs        y_pred
 85     correct-go  correct-stop
 86     correct-go    correct-go
 87     correct-go    correct-go
 88     correct-go    correct-go
 89     correct-go    correct-go
 ..            ...           ...
 190    correct-go    correct-go
 191  correct-stop  correct-stop
 192    correct-go    correct-go
 193  correct-stop  correct-stop
 194    correct-go    correct-go
 
 [110 rows x 2 columns],
 'decoder_object': Decoder(cv=StratifiedKFold(n_splits=3, random_state=1486360763, shuffle=True),
         estimator=LinearSVC(max_iter=10000.0), memory=Memory(location=None),
         n_jobs=3, scoring='accuracy')}

'loading subject DEV009'

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
subjs_to_use: 64
clean: 60
Brain_Data_allsubs: 48


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


correct-go    96
Name: trial_type, dtype: int64

'setting up decoder...'

'loading subject DEV010'

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
subjs_to_use: 64
clean: 60
Brain_Data_allsubs: 48


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


correct-go      94
correct-stop    13
Name: trial_type, dtype: int64

'setting up decoder...'

'fitting'

'evaluating'

{'mean_cv_scores': 0.3457671957671957,
 'overfit_accuracy': 0.7009345794392523,
 'overfit_y_pred_vs_obs':             y_obs        y_pred
 291  correct-stop  correct-stop
 292    correct-go    correct-go
 293    correct-go    correct-go
 294    correct-go  correct-stop
 295    correct-go  correct-stop
 ..            ...           ...
 393    correct-go    correct-go
 394    correct-go    correct-go
 395    correct-go    correct-go
 396    correct-go    correct-go
 397    correct-go  correct-stop
 
 [107 rows x 2 columns],
 'decoder_object': Decoder(cv=StratifiedKFold(n_splits=3, random_state=3050342173, shuffle=True),
         estimator=LinearSVC(max_iter=10000.0), memory=Memory(location=None),
         n_jobs=3, scoring='accuracy')}

'loading subject DEV012'

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
subjs_to_use: 64
clean: 60
Brain_Data_allsubs: 48


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


correct-go    95
Name: trial_type, dtype: int64

'setting up decoder...'

'loading subject DEV013'

checked for intersection and no intersection between the brain data and the subjects was found.
there were 40 subjects overlapping between the subjects marked for train data and the training dump file itself.
test_train_set: 9549
pkl_file: 168
brain_data_filepath: 152
train_test_markers_filepath: 141
response_transform_func: 136
sys: 72
subjs_to_use: 64
clean: 60
Brain_Data_allsubs: 48


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


4229
4229


correct-go      94
correct-stop    11
Name: trial_type, dtype: int64

'setting up decoder...'

'fitting'

'evaluating'

{'mean_cv_scores': 0.3333333333333333,
 'overfit_accuracy': 0.6761904761904762,
 'overfit_y_pred_vs_obs':             y_obs        y_pred
 493  correct-stop  correct-stop
 494    correct-go    correct-go
 495    correct-go  correct-stop
 496    correct-go  correct-stop
 497    correct-go  correct-stop
 ..            ...           ...
 593    correct-go  correct-stop
 594    correct-go    correct-go
 595    correct-go    correct-go
 596    correct-go  correct-stop
 597    correct-go  correct-stop
 
 [105 rows x 2 columns],
 'decoder_object': Decoder(cv=StratifiedKFold(n_splits=3, random_state=888671607, shuffle=True),
         estimator=LinearSVC(max_iter=10000.0), memory=Memory(location=None),
         n_jobs=3, scoring='accuracy')}

How to deal with imbalanced classes?
We're using svc. One solution: https://chrisalbon.com/code/machine_learning/support_vector_machines/imbalanced_classes_in_svm/

### analysis

In [None]:
from analyze_results import remove_selected_outliers
from scipy.stats import pearsonr,spearmanr

In [None]:
# correlate discriminability against shit we care about.

summary_results2 = summary_results.rename(columns={
    'mean_cv_scores':'discriminability_mean_cv_scores',
    'overfit_accuracy':'discriminability_overfit_accuracy'})


individual_differences = pd.read_csv(ml_data_folderpath + "/" + data_by_ppt_name)
individual_differences = individual_differences.rename(columns={'SID':'subid'})
individual_differences['wave']=1
#individual_differences['wave'] = individual_differences['wave'].astype(object) # for compatibility with the wave column in the dataset
ind_div_combined = summary_results2.merge(individual_differences)

In [None]:

def remove_selected_outliers_tesq_study(ind_div_combined,show_plot=False):
    idc_outliers_removed = remove_selected_outliers(ind_div_combined,
    ['discriminability_overfit_accuracy','discriminability_mean_cv_scores',
        'BFI_extraversion','RMQ_locomotion','ses_aggregate','PLAN_cognitive_strategies',
     'SST_SSRT','BIS_11','BSCS','TESQ_E_suppression', 'TESQ_E_avoidance_of_temptations', 
     'TESQ_E_goal_deliberation', 'TESQ_E_controlling_temptations', 'TESQ_E_distraction',
     'TESQ_E_goal_and_rule_setting','EDM','RS','TRSQ','ROC_Crave_Regulate_Minus_Look',
     'SRHI_unhealthy',
     'cancer_promoting_minus_preventing_FFQ','bf_1'],
    show_plot=show_plot)
    return(idc_outliers_removed)

In [None]:
ind_div_combined_3sd = remove_selected_outliers_tesq_study(
    ind_div_combined,
    show_plot=True)

In [None]:
def display_discriminability_correlations(ind_div_combined_3sd):
    for neural_var in ['discriminability_overfit_accuracy','discriminability_mean_cv_scores']:
        display(Markdown("### " + neural_var))
        for correlate in ['BFI_extraversion','RMQ_locomotion','ses_aggregate','PLAN_cognitive_strategies',
                          'SST_SSRT','BIS_11','BSCS',
                          'TESQ_E_suppression', 'TESQ_E_avoidance_of_temptations', 
                          'TESQ_E_goal_deliberation', 'TESQ_E_controlling_temptations', 'TESQ_E_distraction',
                          'TESQ_E_goal_and_rule_setting',
                        'EDM','RS','TRSQ','ROC_Crave_Regulate_Minus_Look','SRHI_unhealthy']:
            display(Markdown("#### " + correlate))
            nan_rows = np.isnan(ind_div_combined_3sd[correlate]) | np.isnan(ind_div_combined_3sd[neural_var])
            cor2way_df = ind_div_combined_3sd.loc[nan_rows==False,]
            pearson_result = pearsonr(cor2way_df[neural_var],cor2way_df[correlate])
            display(HTML("r=" + format(pearson_result[0],".2f") +"; p-value=" + format(pearson_result[1],".4f")))
            spearman_result = spearmanr(cor2way_df[neural_var],cor2way_df[correlate])
            display(HTML("rho=" + format(spearman_result[0],".2f") +"; p-value=" + format(spearman_result[1],".4f")))
            cplot = pyplot.scatter(cor2way_df[neural_var],cor2way_df[correlate])
            cplot.axes.set_xlabel(neural_var)
            cplot.axes.ylabel=correlate
            pyplot.show()

In [None]:
display_discriminability_correlations()

### Next steps

probably repeat the whole process with some masks excluding, at a minimum, movement and visual cortices. We also are using a 40-subject dataset. It needs to be extended to 84-subject, even though this is going to be difficult because the dataset is so much more detailed. We need to get good at only storing a minimal amount of data at a time.

In fact, extending to 84-subjects is probably the very first thing we need to handle.

### Repeating the above with 84 subjects

In [None]:
dataset_name = 'conditions'


brain_data_filepath = ml_data_folderpath + '/SST/Brain_Data_betaseries_84subs_correct_cond.pkl'
#brain_data_filepath = ml_data_folderpath + '/SST/Brain_Data_conditions_43subs_correct_cond.pkl'

def decoderConstructor(*args, **kwargs):
    return(Decoder(scoring='accuracy',verbose=0, *args, **kwargs))


relevant_mask = None

In [None]:
all_subjects = load_and_preprocess(
    brain_data_filepath,
    train_test_markers_filepath,
    subjs_to_use = None,
    response_transform_func = trialtype_resp_trans_func,
    clean=None)

all_subjects['groups']

subj_list = np.unique(all_subjects['groups'])

del all_subjects
gc.collect()

In [None]:
def get_all_subjs_discriminability_whole_brain_v_1_1():
    results_dict = {}

    for sample_subject in subj_list:
        run_desc = 'v1_1_whole_brain_84subjs'
        results_dict[sample_subject] = get_subject_discriminability_with_cache(sample_subject,run_desc)
        
    summary_results = pd.concat([pd.DataFrame({
        'subid':k,
        'overfit_accuracy':[v['overfit_accuracy']],
        'mean_cv_scores':[v['mean_cv_scores']]}) 

                                 for k,v in results_dict.items()])
    
    return(summary_results)



In [None]:
## pfc only

In [None]:
pfc_mask = create_mask_from_images(get_pfc_image_filepaths(ml_data_folderpath + "/"),threshold=10)
relevant_mask = pfc_mask
