# Classifying trialwise CorrectGo and NoGo trials

There are a number of steps to this. Hopefully we can recycle previous code and be up fairly quickly!

1. Load beta data. Ideally this process should include a cache into a pure python object so we don't have to reload it each time.
2. Preprocess the data.
3. Do cross-validated training and testing. Ideally an inner loop to select best parameters, an outer loop to get cross-validated performance, and final training over all the data to get an image. The inner loop can be probably be handled within the package we use probably.

In [1]:
import sys
import os
import pandas as pd



sys.path.append(os.path.abspath("../../ml/"))
from apply_loocv_and_save import load_and_preprocess
from dev_utils import read_yaml_for_host
import warnings


config_data = read_yaml_for_host("sst_config.yml")



python initialized for apply_loocv_and_save
cpus available; cpus to use:
10 9
10


In [2]:
import multiprocessing
import math
import nibabel as nib
import nilearn as nl
from nilearn.decoding import DecoderRegressor,Decoder
from sklearn.model_selection import KFold,GroupKFold,LeaveOneOut
cpus_available = multiprocessing.cpu_count()

cpus_to_use = min(cpus_available-1,math.floor(0.9*cpus_available))
print(cpus_to_use)

9


In [3]:
from dev_wtp_io_utils import cv_train_test_sets, asizeof_fmt
from nilearn.decoding import DecoderRegressor,Decoder

In [4]:
nonbids_data_path = config_data['nonbids_data_path']
ml_data_folderpath = nonbids_data_path + "fMRI/ml"


## Set up the paradigm

In [5]:

def trialtype_resp_trans_func(X):
    return(X.trial_type)


## Loading beta data

beta data is generally written in `load_multisubject_brain_data_sst_w1.ipynb`.

We just have to load it.

In [6]:
brain_data_filepath = ml_data_folderpath + '/SST/Brain_Data_betaseries_30subs_correct_cond_pfc.pkl'
warnings.warn("not sure if this file holds up--it was created in 2021; need to see if it's still valid")
train_test_markers_filepath = ml_data_folderpath + "/train_test_markers_20220818T144138.csv"



In [7]:


all_subjects = load_and_preprocess(
    brain_data_filepath,
    train_test_markers_filepath,
    subjs_to_use = None,
    response_transform_func = trialtype_resp_trans_func,
    clean=None)

warnings.warn("the data hasn't been cleaned at any point. the fMRIPrep cleaning pipeline has been applied; nothing else has been.")




checked for intersection and no intersection between the brain data and the subjects was found.
there were 30 subjects overlapping between the subjects marked for train data and the training dump file itself.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None


test_train_set: 62918
brain_data_filepath: 173
pkl_file: 168
train_test_markers_filepath: 158
response_transform_func: 144
sys: 72
Brain_Data_allsubs: 48
clean: 16
subjs_to_use: 16
3171
3171
cleaning memory




In [8]:
# get the PFC mask
mask_nifti = nib.load(ml_data_folderpath + '/prefrontal_cortex.nii.gz')

In [9]:
#convert the y array to an integer array representing the string values of the y array
all_subjects['y_cat'] = all_subjects['y'].astype('category')
all_subjects['y_int']=all_subjects['y_cat'].cat.codes

# Training

I'm going to start with `cv_train_test_sets` and see how that goes. It sems likely it'll have to be re-written somewhat, but it might be a good starting point.

I think I need to run this without all the extra scaffolding--just tresting the Decoder on the data until I get something sensible. At the very least we need to know the Decoder object is handling balanced classes correctly.

In [None]:
# use add PFC mask.ipynb to figure out how to get a PFC mask onto this data.

In [10]:
dec_main = Decoder(standardize=True,cv=GroupKFold(3),scoring='roc_auc',n_jobs=cpus_to_use,mask=mask_nifti)
cv_results = cv_train_test_sets(
    trainset_X = all_subjects['X'],
    trainset_y = all_subjects['y_int'],
    trainset_groups = all_subjects['metadata']['subject'],
    decoders = [dec_main],
    cv=KFold(n_splits=3) # we use KFold, not GroupKfold, because it's splitting on Group anyway
    )

Groups are the same.
fold 1 of 3
In order to test on a training group of 20 items, holding out the following subjects:['DEV041' 'DEV006' 'DEV023' 'DEV024' 'DEV009' 'DEV010' 'DEV012' 'DEV021'
 'DEV014' 'DEV015']. prepping fold data.... fitting.... 17.1 GiB. trying decoder 1 of 1. 

In [None]:
final_prediction = dec_main.predict(trainset_X)



In [None]:
pd.DataFrame({'obs':trainset_y,'pred':final_prediction}).value_counts()
#get precision and recall
print(precision_recall_fscore_support(trainset_y,final_prediction,average='macro'))

#get roc_auc
from sklearn.metrics import roc_auc_score
print(roc_auc_score(trainset_y,final_prediction))

(0.5804711445017567, 0.6827315628209372, 0.5669378446968172, None)
0.6827315628209372


We aren't going to get train/test results from this. We need to figure out how to pull the observations from the data and get those.