## Use SVM to investigate which gamma frequency band represents the contrast level of grating stimuli.

Check the following link for the research story of why I am doing it: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31020-6?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0960982219310206%3Fshowall%3Dtrue#secsectitle0035

The following analysis is relevant to the result session __"Gratings Contrast Can Be Decoded Better Using NBG Than BBG"__, and method session __"Visual grating contrast classification"__.

Two frequency band of interest: 
<br>
1) 20-60 Hz (variable name in the code is 'NBG');
<br>
2) 70-150 Hz (variable name in the code is 'BBG').

This notebook will: 
<br>
1) read event files to extract contrast information of trials from the task.
<br>
2) read feature file to extract feature matrix as the input for the SVM modeling.
<br>
3) apply SVM modeling with data of some example data
<br>
4) show the code to loop through the pipeline with all the data contained in this project.

Note: 
<br>
This notebook won't provide the codes for the preprocessing step for the ECoG data. The data is already clean and transform into time-frequency data using Morlet wavelet transform (codes were not provided in the repository either). You can take a loot at 'extract_epoch.m' (use in MATLAB) to see how I extract the task epoch into the feature file I need for the modeling. However, you may have a different way to structure your EEG/ECoG file and may do it in a different way.

This notebook mainly demonstrate how the modeling is done in one single electrode in one patients.

In [2]:
import os
import scipy.io as sio
import numpy as np
import pandas as pd
import itertools
from sklearn import svm,preprocessing
from sklearn.model_selection import LeaveOneOut, permutation_test_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

### Section 1-3 will demonstrate the pipeline with data from one single example electrode

## 1. Read event file to extract the condition labels of task trials

Set the path of the file and read the mat file of condition events. I have provided one example event file from one task block of one patient subject in this repository.

In [3]:
# set path and read event file
# modify the event_path based on how you save your data

event_path   = 'example_data/task_events_example.mat'
event_file   = sio.loadmat(event_path)

In this project, there are trials of a given condition that I want to exclude, here I delete the events of those trials (trials present patients with a grating stimulus at the orientation of 45 degree).

In [4]:
# get contrast labels and detele those at 45 degree
contrast     = np.asscalar(event_file['events_info']['trial_contrast'])
odd_orient   = np.asscalar(event_file['events_info']['trial_orientation']) 
contrast_new = np.delete(contrast, np.where(odd_orient == 45)[0]) 

 I recode the task condition into 1,2,3 for three different task conditions.

In [5]:
# label the task conditions (1 = 20% contrast, 2 = 50% contrast, 3 = 100% contrast)
label_v                 = np.reshape(contrast_new,(90,))
label_v[label_v == 1]   = 3
label_v[label_v == 0.5] = 2
label_v[label_v == 0.2] = 1

The label_v is the label file (output/ observed values) we need for the SVM.

## 2. Read the time-series data
With this notebook, the real data of one electrode from our patients is provided in the repository.

In [12]:
# read time series data
# the data with be the spectrogram data of one task block
tf_path = 'example_data/tf_epochs_example.mat'
tf_data = sio.loadmat(tf_path) # it is a dict file, the "tf_epochs" in it is the features we need

Take a look at the shape of the data to understand it.

In [13]:
tf_data['tf_epochs'].shape

(200, 250, 90)

Dimensions of tf_data['tf_epochs'] mean:
<br>
"200" frequency points, from 2 Hz to 201 Hz;
<br>
"250" time poins, from 250 ms to 500 ms after the stimulus onset;
<br>
"90" trials of the task -- grating stimuli at three contrast levels were shown to the subject for 90 times in one task.

Take the average of the data across the time window:

In [8]:
tf_avg = np.mean(tf_data['tf_epochs'],axis =1)

Next, define the variable of the boundary of two gamma frequency bands and extract the feature input matrixs.

In [9]:
# define two gammas
NBG = [20,60]
BBG = [70,150]

# get feature inputs
feature_NBG = tf_avg[(NBG[0]-2):(NBG[1]-1),:].T # transform the matrix to make the column is frequency info
feature_BBG = tf_avg[(BBG[0]-2):(BBG[1]-1),:].T

## 3. Train and test the SVM model.

The model pipeline will:
<br>
1) Scale the data using MinMax method.
<br>
2) Implement a leave-one-out-cross-validation method to get the prediction accuracy.
<br>
3) Get the significant level of the prediction accuracy using permutation test. (In this notebook we will use 1000-time-permutation).

In the brain science, the aim is not to create a fancy model to have the highest prediction performance, but to use modeling method to compare the prediction performance of different brain signals or a given signal from different brain areas.

We want to compare:
<br>
1) The prediction performance of model using NBG as input vs using BBG as input -- so we can see which frequency band is representing the stimulus contrast information.
<br>
2) The prediction performance of model using NBG within visual cortex vs outside of visual cortex -- it serves as a great control analysis to validate the SVM model. The NBG from areas outside of visual cortex should not be able to predict the stimulus visual properties.


In [9]:
# function of the SVM modeling
def linear_SVM(feature_mx,npermu,label):
    """This function conduct SVM on single electrode data.
    
    feature_mx: the feature input of the model.
    
    nperm: number of permutation times. 
    
    label: label vector. """
 
    # SVM pipeline
    loo = LeaveOneOut()
    pipe = Pipeline([('scaler', preprocessing.MinMaxScaler()),('clf', svm.SVC(kernel='linear'))])
    score, permutation_scores, pvalue = permutation_test_score(pipe, feature_mx, label, scoring = 'accuracy', cv=loo, n_permutations=npermu, n_jobs = -1)

    return score,pvalue

In [10]:
# get the prediction results
score_NBG, pvalue_NBG = linear_SVM(feature_NBG,1000,label_v)
score_BBG, pvalue_BBG = linear_SVM(feature_BBG,1000,label_v)
print('the accuracy of model prediction using NBG is:' + str(score_NBG) + '; p-value: ' + str(pvalue_NBG))
print('the accuracy of model prediction using BBG is:' + str(score_BBG) + '; p-value: ' + str(pvalue_BBG))

the accuracy of model prediction using NBG is:0.6444444444444445; p-value: 0.000999000999000999
the accuracy of model prediction using BBG is:0.4222222222222222; p-value: 0.060939060939060936


From this example electrode we can see the NBG is representing the stimulus visual property -- contrast level. 

Next I will extract the data of an electrode from areas outside of visual cortex and do the same modeling again.


In [11]:
# read data and make input feature
tf_path = 'example_data/tf_epochs_example_nonvisual.mat'
tf_data = sio.loadmat(tf_path)
tf_avg = np.mean(tf_data['tf_epochs'],axis =1)
feature_nonvisual = tf_avg[(NBG[0]-2):(NBG[1]-1),:].T

In [12]:
# get the prediction result
score_nonvisual, pvalue_nonvisual = linear_SVM(feature_nonvisual,1000,label_v)
print('the accuracy of model prediction using NBG outside of visual area is:' + str(score_nonvisual) + '; p-value: ' + str(pvalue_nonvisual))

the accuracy of model prediction using NBG outside of visual area is:0.36666666666666664; p-value: 0.17682317682317683


As expected, the NBG outside of visual area cannot predict the contrast level of the visual stimulus well, so we know the model works.
Next, loop through all the visual electrodes from all the patients to get the prediction accuracy and p-value for each electrode.

## 4. Loop through data from all electrodes to the SVM pipeline
Due to the big size of our raw data, I can't provide all the data we used for this project in the repository. However, I will demonstrate the way I save the subject information in the dataframe and the code to loop through the data using the dataframe.

Let me first make a "fake" subject information sheet first.

In [22]:
# create the datafram that saves the subject information
d = {'subject_code': ['AAA','AAB','AAC','AAD'],
     'task_block_code': ['001,013','009','014,023','005,006'],
     'electrode_code': ['097,098,099,100,122,123','065,066,077,088,089','001,002,003','123,124,125,126,127']}
df = pd.DataFrame(data = d)

In [9]:
df

Unnamed: 0,subject_code,task_block_code,electrode_code
0,AAA,1013,97098099100122123
1,AAB,9,65066077088089
2,AAC,14023,1002003
3,AAD,5006,123124125126127


In my real project, I have more subjects, more task blocks and more electrodes. In addition, I separate the electrodes into "visual" and "non_visual", here I just made the dataframe simple enough to understand.

In [45]:
# create a dataframe to save the result
# I need to record: subject code, electrode_code, accuracy, p-value
# Note: for a given electrode, data from multiple task blocks will be concatenate together for SVM modeling
df_result = pd.DataFrame(columns=['subject','elec','NBG_acc','NBG_p','BBG_acc','BBG_p'])

# save it into csv
df_result.to_csv('result.csv',index = False)

In [46]:
df_result

Unnamed: 0,subject,elec,NBG_acc,NBG_p,BBG_acc,BBG_p


In [16]:
def concate_blks(block_string,sbj,ci):
    '''
    concate the time-frequency data from different task blocks together
    
    block_string: element value in column task_block_code in subject information dataframe
    sbj: element value in column subject_code in subject information dataframe
    ci: electrode code of a single electrode, such as the '097' of subject AAA in the example dataframe
    
    return: numpy array of time-frequency data concatenated across blocks for one electrode
    '''
    
    blocks = block_string.split()
    
    for i in range(len(blocks)):
        block_code = blocks[i]
        
        # set path (it is a fake path here, modify it for your own project)
        path = '/{}/{}/{}.mat'.format(sbj,block_code,ci)
        tf_data = sio.loadmat(tf_path)
        
        if i == 0:
            concated_tf_data = tf_data['tf_epochs']
        else:
            to_concate = tf_data['tf_epochs']
            concated_tf_data = np.concatenate((concated_tf_data,to_concate), axis = 2)
            
    return concated_tf_data

In [17]:
def extract_feature_mx(tf_data,freq_band):
    '''
    get the feature input to be used in the SVM
    
    tf_data: time-frequency data
    freq_band: frequency boundary of the signal of interest

    return: numpy array of feature matrix (row: samples/task trials; col: frequency point)
    '''
    
    tf_avg = np.mean(tf_data,axis =1)
    feature_mx = tf_avg[(freq_band[0]-2):(freq_band[1]-1),:].T
    
    return feature_mx

In [47]:
def extract_label(sbj_code,block):
    '''
    return the label of one block of task from a given subject
    '''
    # set path (it is a fake path)
    event_path   = '/{}/{}.mat'.format(sbj_code,block)
    event_file   = sio.loadmat(event_path)
    
    contrast     = np.asscalar(event_file['events_info']['trial_contrast'])
    odd_orient   = np.asscalar(event_file['events_info']['trial_orientation']) 
    contrast_new = np.delete(contrast, np.where(odd_orient == 45)[0])

    label_v                 = np.reshape(contrast_new,(90,))
    label_v[label_v == 1]   = 3
    label_v[label_v == 0.5] = 2
    label_v[label_v == 0.2] = 1
    
    return label_v

__A trick:__
<br>
At this point, you might use the above functions to write some code using parallel processing if you have a large size of data (usually in fMRI studies). However, in ECoG studies, we are not able to have many subjects for one project, so I won't use parallel processing. <br> I will have a function to apply SVM modeling and save the result for each subject, so if I want to be cautious I can do the modeling for each subject one-by-one and check the result after each of them is done. I will still provide another block of code to loop through all the subjects in the end of the notebook.

In [79]:
def sbj_SVM(sbj_code,npermu = 1000):
    '''
    run the pipeline of SVM for electrodes of one subject
    
    npermu : times of permutation test
    '''
    
    # read the csv file to save the result
    df_result = pd.read_csv('result.scv') # change the path as needed
    
    # get the row of data for the given subject
    row_sbj = df.loc[df['subject_code']=='AAA'] 
    
    # get block info and electrode list
    block_str = row_sbj['task_block_code'].values
    electrode_list = row_sbj['electrode_code'].values
    
    # extract the label
    blocks = block_str[0].split(',')
    
    for iblk in range(len(blocks)):
        label_v_blk = extract_label(sbj_code,blocks[iblk])
        if iblk == 0:
            label_v = label_v_blk
        else:
            label_v = np.concatenate((label_v,label_v_blk))
    
    # loop through the electrode
    electrode_list = electrode_list[0].split(',')
    
    for ielec in range(len(electrode_list)):
        ci = electrode_list[ielec]
        
        # concatenate data
        tf_data = concate_blks(block_str,sbj_code,ci)
        
        # get the feature matrix
        feature_NBG = (tf_data,[20,60])
        feature_BBG = (tf_data,[70,150])
        
        # run SVM
        score_NBG, p_NBG = linear_SVM(feature_NBG,npermu,label_v)
        score_BBG, p_BBG = linear_SVM(feature_BBG,npermu,label_v)
        
        # append the data to the result dataframe
        data_to_add = pd.DataFrame([[sbj_code,ci,score_NBG,p_NBG,score_BBG,p_BBG]],columns = df_result.columns)
        df_result = df_result.append(data_to_add, ignore_index = True)
    
    # save the result
    df_result.to_csv('result.csv',index = False)

In [None]:
# run the following code to loop through all the subjects 

nsbj = len(df['subject_code'])

for isbj in range(nsbj):
    sbj_code = df['subject_code'][isbj]
    sbj_SVM(sbj_code)