# Model input ERP format preparation 

In this notebook: 
- Necessary inputs
- Read all epochs
- Function to create dataframe with average mismatch response for all participants (needs to be transformed to function)
- Formatting dataframe as suitable model input

## Imports

In [5]:
import mne      # toolbox for analyzing and visualizing EEG data
import os       # using operating system dependent functionality (folders)
import pandas as pd # data analysis and manipulation
import numpy as np    # numerical computing (manipulating and performing operations on arrays of data)
import ipywidgets as widgets
from IPython.display import display
from numpy import trapz

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import sys
sys.path.insert(0, '../eegyolk') # path to helper functions
#import eegyolk
import helper_functions as hf # library useful for eeg and erp data cleaning
import initialization_functions #library to import data
import epod_helper

In [2]:
metadata = pd.read_csv('metadata.csv', sep = ',')

In [3]:
metadata.head()

Unnamed: 0,eeg_file,ParticipantID,test,sex,age_months,dyslexic_parent,Group_AccToParents,path_eeg,path_epoch,path_eventmarkers,epoch_file
0,105a,105,a,f,17,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,105a_epo.fif
1,107a,107,a,f,16,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,107a_epo.fif
2,106a,106,a,m,19,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,106a_epo.fif
3,109a,109,a,m,21,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,109a_epo.fif
4,110a,110,a,m,17,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,110a_epo.fif


## Read all epochs from files

The function below loads all filtered epochs from the dataframe which contains the metadata and the epochs filepaths + filenames. The epochs are arrays for each stimuli with a time interval of -0.2 to 0.8.

In [4]:
epochs = initialization_functions.read_filtered_data(metadata)

Checking out file: 105a_epo.fif
Checking out file: 107a_epo.fif
Checking out file: 106a_epo.fif
Checking out file: 109a_epo.fif
Checking out file: 110a_epo.fif
Checking out file: 112a_epo.fif
Checking out file: 111a_epo.fif
Checking out file: 114a_epo.fif
Checking out file: 115a_epo.fif
Checking out file: 117a_epo.fif
Checking out file: 116a_epo.fif
Checking out file: 118a_epo.fif
Checking out file: 119a_epo.fif
Checking out file: 123a_epo.fif
Checking out file: 122a_epo.fif
Checking out file: 124a_epo.fif
Checking out file: 127a_epo.fif
Checking out file: 125a_epo.fif
Checking out file: 126a_epo.fif
Checking out file: 130a_epo.fif
Checking out file: 128a_epo.fif
Checking out file: 129a_epo.fif
Checking out file: 131a_epo.fif
Checking out file: 135a_epo.fif
Checking out file: 133a_epo.fif
Checking out file: 137a_epo.fif
Checking out file: 138a_epo.fif
Checking out file: 139a_epo.fif
Checking out file: 141a_epo.fif
Checking out file: 144a_epo.fif
Checking out file: 143a_epo.fif
Checking

In [5]:
len(epochs)

101

## Create pandas dataframe with the average difference between standard and deviant responses

The function below needs `metadata`, the loaded `epochs` and the definition of the standard and deviant events as input. You should define your standard and deviant events as an array. In the function `input_mmr_prep` it's important to know that the assumption is made that the deviant follows after a standard event. Therefore the deviant belonging to the standard is the  standard event number + 1. Make sure your events are numbered like this, else the function won't calculate the mismatch response.  

In [110]:
def input_mmr_prep(metadata, epochs, standard_events, deviant_events): 
    # create dataframe with expected columns 
    df = pd.DataFrame(columns=["eeg_file",  "channel", "mean", 'std', 'sur', 'zero']) # "paradigm",

    # loop over all participants
    for i in range(len(metadata['eeg_file'])):
        
        std_evoked = epochs[i][standard_events].average() 
        dev_evoked = epochs[i][deviant_events].average()
        
        chnames_list = [ 'FC5', 'Pz', 'O1', 'PO4', 'AF4']
        
        for channel in chnames_list: 
            evoked_diff = mne.combine_evoked([std_evoked, dev_evoked], weights=[1, -1]).get_data(picks=channel) # calculate the mismatch response between standard and deviant evoked
            evoked_diff = np.reshape(evoked_diff, 2049)

            #chnames = mne.pick_channels(evoked_diff.info['ch_names'], include=[channel])
            #roi_dict = dict(left_ROI=chnames) # combine_channels only takes a dictionary as input
            #roi_evoked = mne.channels.combine_channels(evoked_diff, roi_dict, method='mean')
            #mmr = roi_evoked.to_data_frame()
            mmr_avg = evoked_diff.mean()
            mmr_std = evoked_diff.std()
            area = trapz(evoked_diff)
            mmr_sur = area
            
            zerocross= 0
            for j in range(1, len(evoked_diff)): 
                
                if ((evoked_diff[j-1]) > 0 and evoked_diff[j] < 0):
                    zerocross +=1
                if ((evoked_diff[j-1]) < 0 and evoked_diff[j] > 0):
                    zerocross +=1
                               
            mmr_zero = zerocross
            #mmr_skew = mmr['left_ROI'].skew()
            #mmr_var = mmr['left_ROI'].var()
            #mmr_kurt = mmr['left_ROI'].kurtosis()
            #mmr_min = mmr['left_ROI'].min()
            #mmr_max = mmr['left_ROI'].max()
            #df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'skew' : mmr_skew, 'kurt' : mmr_kurt, 'var' : mmr_var,'min' : mmr_min,'max' : mmr_max}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms 
   
            df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'sur' : mmr_sur, 'zero' : mmr_zero}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms 
    return df

In [111]:
# define the events for standard and deviant
standard_events = ['GiepM_S','GiepS_S','GopM_S','GopS_S'] #'GiepM_S','GiepS_S','GopM_S','GopS_S'
deviant_events = ['GiepM_D','GiepS_D','GopM_D','GopS_D'] #'GiepM_D','GiepS_D','GopM_D','GopS_D'


df = input_mmr_prep(metadata, epochs, standard_events, deviant_events)

  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'sur' : mmr_sur, 'zero' : mmr_zero}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'sur' : mmr_sur, 'zero' : mmr_zero}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'sur' : mmr_sur, 'zero' : mmr_zero}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'sur' : mmr_sur, 'zero' : mmr_zero}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channe

In [96]:
df

Unnamed: 0,eeg_file,channel,mean,std,sur,zero
0,105a,FC5,2.082464e-06,2.365847e-06,0.004263,9
1,105a,Pz,8.545349e-07,1.060541e-06,0.001750,13
2,105a,O1,4.316986e-06,2.899746e-06,0.008843,7
3,105a,PO4,-6.360535e-07,1.299175e-06,-0.001303,14
4,105a,AF4,3.254148e-06,3.685909e-06,0.006662,7
...,...,...,...,...,...,...
500,221a,FC5,2.459493e-08,9.421346e-07,0.000051,13
501,221a,Pz,-1.907806e-07,9.055398e-07,-0.000390,9
502,221a,O1,-1.660748e-06,1.022857e-06,-0.003402,3
503,221a,PO4,8.192106e-08,9.280260e-07,0.000167,13


In [97]:
df = df.drop_duplicates(subset=['eeg_file','channel']) # ,'paradigm'

## Transpose dataframe into combination of paradigm and channel per participant

We now want a single row for every participant containing the paradigm and corresponding channels. The code below generates this dataframe. 

In [98]:
# transformation of the dataframe
df = df.pivot(index='eeg_file', columns=['channel']) # 'paradigm',

In [99]:
df.columns = ['_'.join(str(s).strip() for s in col if s) for col in df.columns]

In [100]:
df.reset_index(inplace=True)

## Merge and safe dataframe

We still need to merge some of the metadata into the dataframe, so we have the information of the age, gender and label of the participant. 

In [101]:
df = pd.merge(df, metadata, on='eeg_file')

In [102]:
pd.set_option('display.max_columns', None)

In [103]:
df

Unnamed: 0,eeg_file,mean_AF4,mean_FC5,mean_O1,mean_PO4,mean_Pz,std_AF4,std_FC5,std_O1,std_PO4,std_Pz,sur_AF4,sur_FC5,sur_O1,sur_PO4,sur_Pz,zero_AF4,zero_FC5,zero_O1,zero_PO4,zero_Pz,ParticipantID,test,sex,age_months,dyslexic_parent,Group_AccToParents,path_eeg,path_epoch,path_eventmarkers,epoch_file
0,101a,-1.117406e-06,-6.776742e-07,-5.976661e-07,-2.431471e-06,-1.670712e-06,1.384333e-06,1.019953e-06,0.000002,1.925103e-06,1.365521e-06,-0.002290,-0.001389,-0.001222,-0.004980,-0.003422,11,9,10,5,1,101,a,m,20,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,101a_epo.fif
1,103a,-4.819496e-07,-1.517147e-06,1.613725e-05,3.867000e-06,4.153652e-06,1.525937e-06,2.198310e-06,0.000016,3.247296e-06,3.280949e-06,-0.000986,-0.003107,0.033049,0.007922,0.008508,15,18,5,7,12,103,a,f,20,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,103a_epo.fif
2,104a,-6.350983e-07,-3.097631e-07,5.055879e-07,1.914926e-07,-2.296004e-07,1.344636e-06,7.629572e-07,0.000001,1.635251e-06,1.049152e-06,-0.001301,-0.000635,0.001034,0.000391,-0.000471,15,14,19,12,15,104,a,m,18,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,104a_epo.fif
3,105a,3.254148e-06,2.082464e-06,4.316986e-06,-6.360535e-07,8.545349e-07,3.685909e-06,2.365847e-06,0.000003,1.299175e-06,1.060541e-06,0.006662,0.004263,0.008843,-0.001303,0.001750,7,9,7,14,13,105,a,f,17,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,105a_epo.fif
4,106a,-2.109778e-07,-1.575644e-06,3.118421e-07,9.674898e-07,9.443440e-07,1.523331e-06,1.672483e-06,0.000002,2.113902e-06,1.382373e-06,-0.000429,-0.003228,0.000640,0.001983,0.001934,16,7,17,9,13,106,a,m,19,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,106a_epo.fif
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,217a,-4.344623e-07,7.895747e-07,9.808674e-07,2.532761e-06,1.870996e-07,1.168110e-06,1.105110e-06,0.000002,1.851433e-06,8.426879e-07,-0.000892,0.001617,0.002010,0.005189,0.000383,24,23,12,6,20,217,a,f,18,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,217a_epo.fif
97,218a,3.762315e-07,2.818658e-07,-4.217384e-07,-1.218666e-06,2.278748e-07,1.354678e-06,1.530969e-06,0.000001,1.151433e-06,8.372269e-07,0.000771,0.000575,-0.000865,-0.002496,0.000467,11,12,25,12,14,218,a,m,18,Nee,Control,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,218a_epo.fif
98,219a,1.682623e-06,9.445005e-07,3.314331e-06,2.087868e-06,2.146474e-06,1.533014e-06,1.305128e-06,0.000002,1.746317e-06,1.377061e-06,0.003447,0.001934,0.006789,0.004278,0.004397,7,23,5,7,9,219,a,m,18,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,219a_epo.fif
99,220a,-2.031275e-07,5.980443e-07,-1.078600e-06,-2.165710e-06,-2.057660e-06,8.269482e-07,8.132629e-07,0.000002,1.702823e-06,1.294922e-06,-0.000415,0.001225,-0.002209,-0.004435,-0.004214,22,31,28,2,3,220,a,m,19,Nee,Control,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,220a_epo.fif


Drop some unnecessary columns. 

In [104]:
df = df.drop(['eeg_file',
       'dyslexic_parent', 'path_eeg','path_epoch',
       'epoch_file', 'path_eventmarkers'], axis =1)

In [105]:
df['sex'] = np.where(
    (df['sex']=='m'), 1,0)

df['Group_AccToParents'] = np.where(
    (df['Group_AccToParents']=='At risk'), 1,0)

In [106]:
first = df.pop('Group_AccToParents')
df.insert(0, 'Group_AccToParents', first)

In [107]:
df.to_csv('df_avg_mmr.csv', index=False) # safe dataframe

## PCA analysis on feature reduction 

In [20]:
X = df.drop('Group_AccToParents',1)
y = df['Group_AccToParents']

  X = df.drop('Group_AccToParents',1)


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [22]:
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

ValueError: could not convert string to float: 'a'

In [23]:
explained_variance = pca.explained_variance_ratio_

AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'

In [None]:
explained_variance

In [None]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

In [None]:
X.shape