# Main Code used for Decoding
### Documentation created by Ana Vedoveli in 27 Feb 2017


This is a code used for parameter optimization on decoding tasks - mainly to refine feature selection procedures. When I worte this, I was beginning learning python, so anyone with a keen eye will perceive that major parts of this code are similar to the scikit-learn tutorial. However, having this tutorial still might be helpful to others.

I mainly used the GridSearch CV to find the best feature reduction technique and the best number of features to keep. I also only did this using **whole trial decoding**. My reasoning here was that whatever parameters I found on the whole trial decoding could be generalized to the time-decoding. Otherwise, I would have to use GridSearch CV on *every* n_times classifiers trained during the time decoding.

**OBS**: This tutorial needs a reasonable understanding on how dictionaires work on python.

**OBS2**: Sometimes people use nested gridsearch methods (a grid search inside of a different crossvalidation fold) as they assume that the non-nested version might give inflated results. This tutorial implements a NON-Nested gridsearch CV because of the computational nested gridsearch imply.*

---

As always, we start importing the important functions we are using in this code:

In [1]:
import matplotlib.pyplot as plt
from scipy.io import (loadmat, savemat)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.utils import class_weight
import numpy as np
from mne.datasets import sample
from sklearn.preprocessing import LabelEncoder            

Now we will define our own functions. There are three functions in this code: the load_cor function -- to load the data from matlab, the grid_dim_red that implements my routine for comparing the performance of the parameters using GridSearchCV, and function plot_dimcom that plots the results of the comparisons (*NOTE that in this example I am computing the AUC on the CATEGORICAL OUTPUTS of the SVM, and not in the probabilistic*). We will go through each function now:

In [27]:
## This code is meant to work on WINDOWS. 
def load_cor(xDir, var_name='struct_cor'):
    import mne
    from mne import create_info
    from mne.epochs import EpochsArray
    import scipy.io as sio
    import numpy as np
    # load Matlab/Fieldtrip data
    mat = sio.loadmat(xDir, squeeze_me=True, struct_as_record=False)
    ft_data = mat[var_name]
    event = ft_data.trialinfo[:, 1]

    # convert to mne
    n_trial, n_chans, n_time = ft_data.trial.shape
    data = np.zeros((n_trial, n_chans, n_time))
    data = ft_data.trial

    sfreq = 200
    time = ft_data.time

    
    coi = range(n_chans)
    data = data[:, coi, :]
    chan_names = [l.encode('ascii') for l in ft_data.label[coi]]
    chan_types = ft_data.label[coi]
    chan_types[:] = 'eeg'
    info = create_info(chan_names, sfreq, chan_types)
    events = np.array([np.arange(n_trial), np.zeros(n_trial), event], int).T
    epochs = EpochsArray(data, info, events=events,
                         tmin=np.min(time), verbose=False)
    montage = mne.channels.read_montage('GSN-HydroCel-257')
    epochs.set_montage(montage)
    return epochs, ft_data.trialinfo
    

## This function is the proper gridsearch function.
# mainPath = path of your CI\Python\Subjects folder
# name = name of your subject
# varname = name of the type of data. Can be _cor or _inter
# N_features = number of features to be tested
# c_options = values of C to be tested
def grid_dim_red(mainPath, name, varname, N_FEATURES_OPTIONS, C_OPTIONS):

    print('working on subject ' + name)
    
    # This is the same as the decoding script.
    # If you haven't seen the tutorial for the decoding
    # script yet, I would suggest that you should
    # read that tutorial first.
    filePath = mainPath + name + '\\'
    #loading labels for conditions
    yDir = filePath + 'trl_conditions.mat'
    Y = loadmat(yDir)
    conditions = Y['trl_conditions']
    Y = conditions.transpose().ravel()
    Y[Y==-1] = 0
    
    #loading data as epoch object
    print('loading data...')
    xDir = filePath + varname + '.mat'
    epochs, _ = load_cor(xDir, var_name=varname)
    
    #Retrieving data as matrix
    data = epochs.get_data()
    # As we are using this function to find the best
    # parameters for the whole trial decoding, we 
    # collapes the dimensions of channels and time points
    X = np.reshape(data, [data.shape[0], data.shape[1]*data.shape[2]])
    
    
    ## Now we start to prepare for the gridsearch CV.
    # The gridsearch CV takes as an input a dictionary 
    # with lists of parameters you would like to test.
    # From the python documentation, the param_grid should be:
    # "A Dictionary with parameters names (string) 
    # as keys and lists of parameter settings to try as values, or a list of such 
    # dictionaries, in which case the grids spanned by each dictionary 
    # in the list are explored. 
    # This enables searching over any sequence of parameter settings."
    # We will go through this definition with more details in
    # the next lines.
    
    
    # Here we test three different Dim_red techniques:
    # - PCA
    # - Univariate feature reduction
    # - KClustering
    print('defining reduction techniques')
    param_grid = [ # Here we define a *list of dictionaires*.
                   # List as defined with [].
                   # We have in total *3 dictionaires* with
                   # the different reduction methods we want 
                   # to try.
        { # This is our first dictionary. Dictionaires
          # are defined with {}. It is defining parame-
          # ters to be tested for the PCA.
            'reduce_dim': [PCA(iterated_power=7)], # this is our first dictionaire entry.
                                                   # the key name is 'reduce_dim', and it
                                                   # equal to the reduction technique you
                                                   # want to try.
            'reduce_dim__n_components': N_FEATURES_OPTIONS, # this is our first dictionaire
                                                   # entry. the key name is 'reduce_dim
                                                   # __n_components' and it is equal to
                                                   # a list of the number of COMPONENTS
                                                   # of the PCA you would like to keep.
            
            # Pay attention that the name of the keys are NOT RANDOM.
            # They have to do with the function attributes.
            # For instance, the PCA function takes an input
            # named "n_components", that is the number of 
            # components you want to keep. Similarly, in the
            # next entry of our dictionary, the classifier
            # (named by classify -- see the comments next to the
            # pipeline some lines below) takes the input C
            # thus the name of the key is classify__C. 
            
            'classify__C': C_OPTIONS # this is our first dictionaire
                                     # entry. the key name is 'classify
                                     # __C' and it is equal to
                                     # a list of the number of C values
                                     # you would like to try out.
        },  
        
        # This is the second dictionary. It is defining parame-
        # ters to be tested for the Univariate feature selection.
        # Similar to the previous example.
        {
            'reduce_dim': [SelectKBest(f_classif)],
            'reduce_dim__k': N_FEATURES_OPTIONS, # here the name is 
                                                 # reduce_dim__k because
                                                 # f_classif takes k as an
                                                 # input. K is the number
                                                 # features you want to
                                                 # keep. 
            'classify__C': C_OPTIONS
        },  
        # This is the third dictionary. It is defining parame-
        # ters to be tested for the K-cluster feature reduction.
        # Similar to the previous example.
        {
            'reduce_dim': [MiniBatchKMeans()],
            'reduce_dim__n_clusters': N_FEATURES_OPTIONS, # here the name is 
                                                 # reduce_dim__n_clusters 
                                                 # because the cluster function
                                                 # KMeans takes n_clusters as an
                                                 # input. 
            'classify__C': C_OPTIONS
        },  
        
        ## REMEMBER: C_OPTIONS AND N_FEATURES OPTIONS ARE INPUT OF OUR
        ## FUNCTION GRID_DIM_RED
        
    ]
    # now we define the labels for our reduction techniques, in the same
    # order as they appear in the list of dictionaires.
    reducer_labels = ['PCA', 'KBest(f_classif)', 'Clustering (K-means)']
    
    ## PIPELINE:
    # This is the pipeline for your classifier. 
    # The pipeline takes an important role in this
    # code.
    pipe = Pipeline([ # we give a name for each step.
                      # So the StandardScaler step is
                      # called 'scalling', the classifier
                      # is called "classify. The 'reduce_dim' 
                      # field will change according to
                      # what you have previously specified 
                      # in param_grid
        ('scaling', StandardScaler()),
        ('reduce_dim', SelectKBest(f_classif)), # here we do not define K,
                                                # this was defined in the 
                                                # param_grid dictionairies.
        ('classify',  SVC(class_weight='balanced', probability=False, kernel='linear'))
    ])
    
    
    # Defining cv folds parameter
    inner_cv = StratifiedKFold(n_splits=5, shuffle=True)
    
    print('training gridsearch')
    # Now we actually perform the gridsearch. we actually use 
    # the gridsearch cv function.
    grid = GridSearchCV(pipe, cv=inner_cv, n_jobs=1, param_grid=param_grid, scoring='roc_auc')
    # now we fit our grid object with the data. This step takes a
    # reaaally long time.
    grid.fit(X, Y)
    
    print('saving results')
    # Now we want to get the comparison of the grid search.
    # The object grid has a very important attribute:
    # 'cv_results_'. This attribute is a dictionairie
    # with different statistics of our gridsearch, as
    # for instance, the mean time across folds for each
    # tested model, the mean and std results across
    # folds of each model. Here we want the key with
    # the mean results, and the name of the key is
    # 'mean_test_score'. In the next lines we get
    # the values inside of this key and save it.
    reduction_results = np.array(grid.cv_results_['mean_test_score'])
    np.save(filePath + 'reduction_results', reduction_results) 
    return grid, reduction_results # the function in the end
                                   # returns the grid object 
                                   # and the reduction results.


## We now computed the gridsearch, but we to visualize it.
# So this is what our next function will do!
        
# This functions plots the results of the gridsearch
def plot_dimcomp(name, mean_scores=reduction_results):
    
    # scores are in the order of param_grid iteration, which is alphabetical
    mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
    # select score for best C 
    # (in the case you have tried 
    #  more than one C)
    mean_scores = mean_scores.max(axis=0)
    
    # Now we just plot. Feel free to change.
    # This plotting was completely taken
    # from the scikit website!
    bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *
                   (len(reducer_labels) + 1) + .5)

    plt.figure()
    COLORS = 'bgrcmyk'
    for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
        plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])

    plt.title("Comparing feature reduction techniques for " + name[0:8])
    plt.xlabel('Reduced number of features')
    plt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)
    plt.ylabel('Digit classification AUC')
    plt.ylim((0, 1))
    plt.legend(loc='upper left')

Great!! Now we have all out functions and we understand what they do. Let's try them out then. I always started making a list of all subjects I have:

In [None]:
subject = os.listdir('C:\\Users\\Ana\\Desktop\\CI\\Python\\Subjects')
subject = np.sort(subject)

Now we are going to use the function. We first have to define a list with the number of features we want to keep and a number of C parameters we want to test. Then, we use our function.

In [28]:
N_FEATURES_OPTIONS = [10, 20, 30, 40, 50, 60, 100]
C_OPTIONS = [1]

grid, reduction_results = grid_dim_red('C:\\Users\\Ana\\Desktop\\CI\\Python\\Subjects\\', subject[0], 'struct_cor', N_FEATURES_OPTIONS, C_OPTIONS)


working on subject DB171120v03HT
loading data...
defining reduction techniques
training gridsearch
saving results


In [26]:
# now we plot our results!
plot_dimcomp(subject[0], mean_scores=reduction_results)

NameError: name 'param_grid' is not defined

You can use the grid search to test everything: different classifiers, different filter cutoffs, you can use multiple scoring system at the same time! It is really loads of fun, but takes sometime to compute. Have fun!