This file tests a new pipeline where cv_train_test_sets selects an overall best hyper regressor.

Not sure whether it should also use it to do a final training and prediction or whether we just pass back the regressor object and have the wrapper class do that.

In [8]:
from apply_loocv_and_save import *


In [9]:
def cv_train_test_sets(
    trainset_X, trainset_y, trainset_groups,
    regressors = None,
    testset_X = None,testset_y = None, testset_groups = None,
    param_grid=None,
    cpus_to_use=-2,
    cv = None):
    """
    uses a division of 'trainset' and 'testset' to allow different values to be trained and tested 
    in KFold Cross Validation. All the values are used for training and testing, but we use different ones.
    This enables us to e.g., pass in aggregated images for training and separate images for testing.

    trainset_X: x values applicalbe for TRAINING
    trainset_y
    trainset_groups: group allocations for the trainset dataset
    testset_X: values grouped into averages for testing
    testset_y
    cv: a Grouped cross-validator
    group_list: name of the groups
    """
    if cv is None:
        cv=KFold(n_splits=5)

    if param_grid is not None and regressors is not None:
        raise Exception('values for param_grid and regressors both passed, but param_grid is ignored if regressors is passed. choose one.')

    #if the groups we're using are actually the same.
    if (testset_X is None) and (testset_y is None):
        testset_X = trainset_X
        testset_y = trainset_y
        testset_groups = trainset_groups
        print('Groups are the same.')

    results_by_trainset_item = pd.DataFrame({
        'y': trainset_y,
        'group':trainset_groups,
        'y_pred':np.repeat(None,len(trainset_y))#,
        #'y_match':np.repeat(None,len(trainset_y))#just for debugging. delete.
    })


    groups_array = np.array(list(set(testset_groups)))
    assert(set(trainset_groups)==set(testset_groups))

    #the CV that the inner Regressor uses
    cv_inner = GroupKFold(3)
    if regressors is None:
        regressors = [DecoderRegressor(standardize= True,param_grid=param_grid,cv=cv_inner,scoring="r2",
                                      n_jobs=cpus_to_use)]
        print('using default regressor',end='. ')

    #we actually use KFold on the group names themselves, then filter across that
    #that's equivalent to doing a GroupedKFold on the data.
    test_scores = []
    results = []

    if type(cv)==type(LeaveOneOut()):
        outer_n=len(groups_array)
    else:
        outer_n = cv.get_n_splits()
    for i, x in enumerate(cv.split(groups_array)):
        train_i = x[0]
        test_i = x[1]
        print("fold " + str(i+1) + " of " + str(outer_n))

        fold_i_results = {}
        train_group_items, test_group_items = groups_array[train_i], groups_array[test_i]
        print('In order to test on a training group of ' +
              str(len(train_group_items)) + ' items, holding out the following subjects:' +
              str(test_group_items),end='. ')
#         print(
#             'held out ' + str(len(test_group_items)) + ' items and trained on ' + str(len(train_group_items)) + ' items',
#             end='. ')

        print('prepping fold data...',end='. ')
        #select training data from the averages
        #print('selecting training data',end='. ')
        train_selector = [i for i, x in enumerate(trainset_groups) if x in train_group_items]
        train_y = trainset_y[train_selector]
        train_X = nib.funcs.concat_images([trainset_X.slicer[...,s] for s in train_selector])
        train_groups = trainset_groups[train_selector]
        #print(train_X.shape,end='. ')
        #print(asizeof_fmt(train_X),end='. ')

        #select testing data from the individual values
        #print('selecting test data',end='. ')
        test_selector = [i for i, x in enumerate(testset_groups) if x in test_group_items]
        test_y = testset_y[test_selector]
        test_X = nib.funcs.concat_images([testset_X.slicer[...,s] for s in test_selector])
        test_groups = testset_groups[test_selector]
        #print(asizeof_fmt(test_X),end='. ')
        #print(test_X.shape,end='. ')


        print("regressing...",end='. ')
        print(asizeof_fmt(train_X),end='. ')

        val_scores = []
        #iterate through regressor objects.
        #this is my way of doing cross-validation across different regressors...
        hyper_scores = []
        train_results = {}
        inner_cv_results = {}
        for r_i, reg in enumerate(regressors):
            cur_r_results = {}
            print('trying regressor ' + str(r_i+1) + ' of ' + str(len(regressors)),end='. ')
            #if there is nested CV within this function the best hyper-paramters are already being chosen
            #we need only to finish the job by identifying the best overall regressor, as the final hyper-parameter
            reg.fit(y=train_y,X=train_X,groups=train_groups)
            print("predicting",end='. ')
            #hyper_score = reg.score(train_X,train_y)
            hyper_score = np.max([np.mean(param_values) for param_name, param_values in reg.cv_scores_.items()])
            #think there is a bug here. we should not have to be guessing/ignoring param names.
            #need to report this.

            hyper_scores = hyper_scores + [hyper_score]

            cur_r_results['hyper_score'] = hyper_score
            cur_r_results['cv_scores_'] = reg.cv_scores_
            cur_r_results['cv_params_'] = reg.cv_params_
            inner_cv_results[str(reg)] = cur_r_results

        fold_i_results['train_results']= inner_cv_results

        #identify which was the best
        #print(hyper_scores)
        #print(np.where([h==np.max(hyper_scores) for h in hyper_scores])[0][0])
        best_hyper_regressor = regressors[np.where([h==np.max(hyper_scores) for h in hyper_scores])[0][0]]

        #print(best_hyper_regressor)

        #now run JUST that one on this fold.


        #now predict on our test split
        test_score = best_hyper_regressor.score(test_X,test_y)
        test_y_pred = best_hyper_regressor.predict(test_X)
        fold_test_rawdata = pd.DataFrame({
            'y_obs':test_y,
            'y_pred':test_y_pred,
            'y_groups':test_groups

        })
        #results_by_trainset_item.loc[train_selector,'y_pred']
        results_by_trainset_item.loc[test_selector,'y_pred'] = test_y_pred
        #results_by_trainset_item.loc[test_selector,'y_match'] = test_y
        fold_i_results['fold_test_rawdata'] = fold_test_rawdata
        #so we can do scoring externally to this function.

        test_scores = test_scores+[test_score]
        print('test score was:',end='. ')
        print(test_score)

        results = results + [fold_i_results]

        del test_X
        del train_X
        gc.collect() #clean up. this is big data we're working with
        #https://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python
        
    #now run the classifier once more on the train AND test data to get an overall beta image
    #but if we ran more than one, then...which one do we run? We'd need to examine their performance across all the folds
    #and select one overall one. Shall we do that? yeah let's take a look....
    #so what we need to do is that we need to end up with a matrix of which regressor does best with which fold....
    #best_hyper_regressor.fit()

    return(test_scores,results,results_by_trainset_item)


In [12]:

def apply_loocv_and_save(
    results_filepath,
    brain_data_filepath = '../data/Brain_Data_2sns_60subs.pkl',
    train_test_markers_filepath = "../data/train_test_markers_20210601T183243.csv",
    subjs_to_use = None, #set this to get a subset, otherwise use all of them.
    response_transform_func = None
    ):
    #pd.set_option('display.max_rows', 99)
    test_train_set = pd.read_csv(train_test_markers_filepath)

    with open(brain_data_filepath, 'rb') as pkl_file:
        Brain_Data_allsubs = pickle.load(pkl_file)
    
    dev_wtp_io_utils.check_BD_against_test_train_set(Brain_Data_allsubs,test_train_set)

    #################################################
    #######PRE-PROCESS

    
    if response_transform_func is None:
        Brain_Data_allsubs.Y = Brain_Data_allsubs.X['response'].copy()
    else:
        Brain_Data_allsubs.Y = response_transform_func(Brain_Data_allsubs.X)
    
        
    print(Brain_Data_allsubs.Y.value_counts())
    Brain_Data_allsubs.Y[Brain_Data_allsubs.Y=='NULL']=None
    print(Brain_Data_allsubs.Y.value_counts())

    import sys
    for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                             key= lambda x: -x[1])[:10]:
        print(name + ': ' + str(size))
    print(Brain_Data_allsubs.Y.isnull().value_counts())
    Brain_Data_allsubs_nn = Brain_Data_allsubs[Brain_Data_allsubs.Y.isnull()==False]
    print(len(Brain_Data_allsubs_nn))
    print(len(Brain_Data_allsubs))


    all_subs_nn_nifti = Brain_Data_allsubs_nn.to_nifti()
    all_subs_nn_nifti_Y = Brain_Data_allsubs_nn.Y
    all_subs_nn_nifti_groups = Brain_Data_allsubs_nn.X.subject
    all_subs_nn_nifti_groups


    all_subs_nn_nifti_metadata = Brain_Data_allsubs_nn.X


    #################################################
    #######GET SUB-SET


    del Brain_Data_allsubs
    #del Brain_Data_allsubs_grouped
    gc.collect()

    from nilearn.decoding import DecoderRegressor
    regressors = [DecoderRegressor(standardize= True,param_grid=param_grid,cv=cv_inner,scoring="r2",
                                      n_jobs=cpus_to_use)]

    print(asizeof_fmt(Brain_Data_allsubs_nn))

    print(asizeof_fmt(all_subs_nn_nifti))
    
    if subjs_to_use is None:
        subjs_to_use=len(np.unique(all_subs_nn_nifti_groups))

    sample_subject_items = np.unique(all_subs_nn_nifti_groups)[0:subjs_to_use] #get all of them
    sample_subject_vector = [i for i, x in enumerate(all_subs_nn_nifti_groups) if x in sample_subject_items]

    first_subs_nifti = nib.funcs.concat_images([all_subs_nn_nifti.slicer[...,s] for s in sample_subject_vector])
    first_subs_nifti_Y = all_subs_nn_nifti_Y[sample_subject_vector]
    first_subs_nifti = nil.image.clean_img(first_subs_nifti,detrend=False,standardize=True)
    first_subs_nifti_groups = all_subs_nn_nifti_groups[sample_subject_vector]

    del all_subs_nn_nifti
    gc.collect()

    first_subs_nifti_metadata = all_subs_nn_nifti_metadata.loc[sample_subject_vector,:]

    print("starting LeaveOneOut")
    #in this design, we're actually dealing with groups
    #we select group IDs and then grab the subjects
    #so we don't need to use LeaveOneGroupOut
    #the grouping is implicit
    cv_outer = LeaveOneOut()

    print("finished preprocessing")




    test_scores_same,tt_results,results_by_trainset_item = cv_train_test_sets(
        trainset_X=first_subs_nifti,
        trainset_y=first_subs_nifti_Y,
        trainset_groups=first_subs_nifti_groups,
        regressors = regressors,
        cv=cv_outer,
        cpus_to_use=cpus_available

    )

    print(test_scores_same[0])
    print(np.mean(test_scores_same[0]))

    print('finished learning')

    with open(results_filepath, 'wb') as handle:
        pickle.dump([test_scores_same,tt_results,results_by_trainset_item],handle)


    print('saved.')
    


In [14]:
import pickle


    
apply_loocv_and_save(
    results_filepath="../data/cv_train_test_ns_5subjs_outer_n_loocv.pkl",
    brain_data_filepath = '../data/Brain_Data_ns_5subs.pkl',
    train_test_markers_filepath = "../data/train_test_markers_20210601T183243.csv"#,
#    response_transform_func =transform_normalize
    
)


checked for intersection and no intersection between the brain data and the subjects was found.
there were 5 subjects overlapping between the subjects marked for train data and the training dump file itself.
5.0    111
6.0     81
7.0     64
8.0     51
Name: response, dtype: int64
5.0    111
6.0     81
7.0     64
8.0     51
Name: response, dtype: int64
test_train_set: 9549
pkl_file: 168
results_filepath: 98
train_test_markers_filepath: 95
brain_data_filepath: 80
sys: 72
Brain_Data_allsubs: 48
subjs_to_use: 16
response_transform_func: 16
False    307
True      13
Name: response, dtype: int64
307
320
284.3 MiB
1.0 GiB
starting LeaveOneOut
finished preprocessing
Groups are the same.
fold 1 of 5
In order to test on a training group of 4 items, holding out the following subjects:['DEV001']. prepping fold data.... regressing.... 1.6 GiB. trying regressor 1 of 2. 



predicting. trying regressor 2 of 2. 



predicting. test score was:. 0.2316291643071221
fold 2 of 5
In order to test on a training group of 4 items, holding out the following subjects:['DEV006']. prepping fold data.... regressing.... 1.7 GiB. trying regressor 1 of 2. 



predicting. trying regressor 2 of 2. 



predicting. test score was:. 0.20887206935528646
fold 3 of 5
In order to test on a training group of 4 items, holding out the following subjects:['DEV009']. prepping fold data.... regressing.... 1.6 GiB. trying regressor 1 of 2. 



predicting. trying regressor 2 of 2. 



predicting. test score was:. 0.06890307588877076
fold 4 of 5
In order to test on a training group of 4 items, holding out the following subjects:['DEV010']. prepping fold data.... regressing.... 1.7 GiB. trying regressor 1 of 2. 



predicting. trying regressor 2 of 2. 



predicting. test score was:. 0.3311807547709793
fold 5 of 5
In order to test on a training group of 4 items, holding out the following subjects:['DEV005']. prepping fold data.... regressing.... 1.6 GiB. trying regressor 1 of 2. 



predicting. trying regressor 2 of 2. 



predicting. test score was:. 0.20553738643177666
0.2316291643071221
0.2316291643071221
finished learning
saved.


In [16]:
loocv_results = pickle.load(open("../data/cv_train_test_ns_5subjs_outer_n_loocv.pkl",'rb'))

In [21]:
loocv_results[1][0].keys()

dict_keys(['train_results', 'fold_test_rawdata'])

In [23]:
loocv_results[1][0]['train_results']

{'DecoderRegressor(estimator=RidgeCV(alphas=array([ 0.1,  1. , 10. ])),\n                 memory=Memory(location=None))': {'hyper_score': 0.1334553659782704,
  'cv_scores_': {'beta': [0.24546541019031898,
    0.12846594873346473,
    -0.06741689987425015,
    0.22730700486354805]},
  'cv_params_': {'beta': {}}},
 "DecoderRegressor(estimator=SVR(kernel='linear', max_iter=10000.0),\n                 memory=Memory(location=None))": {'hyper_score': 0.15676823512438506,
  'cv_scores_': {'beta': [0.2596661569483836,
    0.12787342116889444,
    -0.01240405238771003,
    0.25193741476797227]},
  'cv_params_': {'beta': {'C': [100.0, 100.0, 100.0, 100.0]}}}}

In [60]:
hyper_scores_across_folds_list = []
for fold_i in range(len(loocv_results[1])):
    fold_hyper_scores = [pd.DataFrame(
        {"decoder":[k],
         "fold":fold_i,
         "hyper_score":[v['hyper_score']]}) for k,v in loocv_results[1][fold_i]['train_results'].items()]
    hyper_scores_across_folds_list = hyper_scores_across_folds_list + fold_hyper_scores
hyper_scores_across_folds = pd.concat(hyper_scores_across_folds_list).reset_index(drop=True)
hyper_scores_across_folds

Unnamed: 0,decoder,fold,hyper_score
0,DecoderRegressor(estimator=RidgeCV(alphas=arra...,0,0.133455
1,DecoderRegressor(estimator=SVR(kernel='linear'...,0,0.156768
2,DecoderRegressor(estimator=RidgeCV(alphas=arra...,1,0.183787
3,DecoderRegressor(estimator=SVR(kernel='linear'...,1,0.176635
4,DecoderRegressor(estimator=RidgeCV(alphas=arra...,2,0.208691
5,DecoderRegressor(estimator=SVR(kernel='linear'...,2,0.197646
6,DecoderRegressor(estimator=RidgeCV(alphas=arra...,3,0.158236
7,DecoderRegressor(estimator=SVR(kernel='linear'...,3,0.161157
8,DecoderRegressor(estimator=RidgeCV(alphas=arra...,4,0.106324
9,DecoderRegressor(estimator=SVR(kernel='linear'...,4,0.10336


Now we need the folds to 'vote' on the best regressor.
I don't know if we should be doing this...are we somehow overfitting? Not sure...anyway...

In [61]:
hyper_scores_across_folds.groupby('decoder').hyper_score.mean()

decoder
DecoderRegressor(estimator=RidgeCV(alphas=array([ 0.1,  1. , 10. ])),\n                 memory=Memory(location=None))    0.158099
DecoderRegressor(estimator=SVR(kernel='linear', max_iter=10000.0),\n                 memory=Memory(location=None))       0.159113
Name: hyper_score, dtype: float64

In [42]:
fold_hyper_scores

[                                                       decoder  \
 cv_params_   DecoderRegressor(estimator=RidgeCV(alphas=arra...   
 cv_scores_   DecoderRegressor(estimator=RidgeCV(alphas=arra...   
 hyper_score  DecoderRegressor(estimator=RidgeCV(alphas=arra...   
 
                                                    hyper_score  
 cv_params_                                        {'beta': {}}  
 cv_scores_   {'beta': [0.24546541019031898, 0.1284659487334...  
 hyper_score                                           0.133455  ,
                                                        decoder  \
 cv_params_   DecoderRegressor(estimator=SVR(kernel='linear'...   
 cv_scores_   DecoderRegressor(estimator=SVR(kernel='linear'...   
 hyper_score  DecoderRegressor(estimator=SVR(kernel='linear'...   
 
                                                    hyper_score  
 cv_params_       {'beta': {'C': [100.0, 100.0, 100.0, 100.0]}}  
 cv_scores_   {'beta': [0.2596661569483836, 0.12787342116889...

Right I think we have the data here. How do we select the best regressor across all the train_results?

Internally for each fold we are comparing hyper_score, right?

In [None]:
best_hyper_regressor = regressors[np.where([h==np.max(hyper_scores) for h in hyper_scores])[0][0]]