# Model input ERP format preparation 

In this notebook: 
- Necessary inputs
- Read all epochs
- Function to create dataframe with average mismatch response for all participants (needs to be transformed to function)
- Formatting dataframe as suitable model input

## Imports

In [1]:
import mne      # toolbox for analyzing and visualizing EEG data
import os       # using operating system dependent functionality (folders)
import pandas as pd # data analysis and manipulation
import numpy as np    # numerical computing (manipulating and performing operations on arrays of data)
import ipywidgets as widgets
from IPython.display import display

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import sys
sys.path.insert(0, '../eegyolk') # path to helper functions
#import eegyolk
import helper_functions as hf # library useful for eeg and erp data cleaning
import initialization_functions #library to import data
import epod_helper

In [2]:
metadata = pd.read_csv('metadata.csv', sep = ',')

In [3]:
metadata.head()

Unnamed: 0,eeg_file,ParticipantID,test,sex,age_months,dyslexic_parent,Group_AccToParents,path_eeg,path_epoch,path_eventmarkers,epoch_file
0,105a,105,a,f,17,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,105a_epo.fif
1,107a,107,a,f,16,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,107a_epo.fif
2,106a,106,a,m,19,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,106a_epo.fif
3,109a,109,a,m,21,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,109a_epo.fif
4,110a,110,a,m,17,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,110a_epo.fif


## Read all epochs from files

The function below loads all filtered epochs from the dataframe which contains the metadata and the epochs filepaths + filenames. The epochs are arrays for each stimuli with a time interval of -0.2 to 0.8.

In [4]:
epochs = initialization_functions.read_filtered_data(metadata)

Checking out file: 105a_epo.fif
Checking out file: 107a_epo.fif
Checking out file: 106a_epo.fif
Checking out file: 109a_epo.fif
Checking out file: 110a_epo.fif
Checking out file: 112a_epo.fif
Checking out file: 111a_epo.fif
Checking out file: 114a_epo.fif
Checking out file: 115a_epo.fif
Checking out file: 117a_epo.fif
Checking out file: 116a_epo.fif
Checking out file: 118a_epo.fif
Checking out file: 119a_epo.fif
Checking out file: 123a_epo.fif
Checking out file: 122a_epo.fif
Checking out file: 124a_epo.fif
Checking out file: 127a_epo.fif
Checking out file: 125a_epo.fif
Checking out file: 126a_epo.fif
Checking out file: 130a_epo.fif
Checking out file: 128a_epo.fif
Checking out file: 129a_epo.fif
Checking out file: 131a_epo.fif
Checking out file: 135a_epo.fif
Checking out file: 133a_epo.fif
Checking out file: 137a_epo.fif
Checking out file: 138a_epo.fif
Checking out file: 139a_epo.fif
Checking out file: 141a_epo.fif
Checking out file: 144a_epo.fif
Checking out file: 143a_epo.fif
Checking

In [5]:
len(epochs)

101

## Create pandas dataframe with the average difference between standard and deviant responses

The function below needs `metadata`, the loaded `epochs` and the definition of the standard and deviant events as input. You should define your standard and deviant events as an array. In the function `input_mmr_prep` it's important to know that the assumption is made that the deviant follows after a standard event. Therefore the deviant belonging to the standard is the  standard event number + 1. Make sure your events are numbered like this, else the function won't calculate the mismatch response.  

In [38]:
def input_mmr_prep(metadata, epochs, standard_events, deviant_events): 
    # create dataframe with expected columns 
    df = pd.DataFrame(columns=["eeg_file",  "channel", "mean"]) # "paradigm",

    # loop over all participants
    for i in range(len(metadata['eeg_file'])):
        std_evoked = epochs[i][standard_events].average() 
        dev_evoked = epochs[i][deviant_events].average()

        # calculate the mismatch response between standard and deviant evoked
        evoked_diff = mne.combine_evoked([std_evoked, dev_evoked], weights=[1, -1])
        
        # get a list of all channels
        #chnames_list = evoked_diff.info['ch_names']
        chnames_list = ['Pz','PO3','O1','Oz','O2','PO4']
        
        # compute for every channel the features of the mismatch line
        for channel in chnames_list: 
            chnames = mne.pick_channels(evoked_diff.info['ch_names'], include=[channel])
            roi_dict = dict(left_ROI=chnames) # combine_channels only takes a dictionary as input
            roi_evoked = mne.channels.combine_channels(evoked_diff, roi_dict, method='mean')
            mmr = roi_evoked.to_data_frame()
            mmr_avg = mmr['left_ROI'].mean()
            mmr_std = mmr['left_ROI'].std()
            mmr_skew = mmr['left_ROI'].skew()
            mmr_var = mmr['left_ROI'].var()
            mmr_kurt = mmr['left_ROI'].kurtosis()
            mmr_min = mmr['left_ROI'].min()
            mmr_max = mmr['left_ROI'].max()
            
            df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'skew' : mmr_skew, 'kurt' : mmr_kurt, 'var' : mmr_var,'min' : mmr_min,'max' : mmr_max}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms 
    return df

In [39]:
# define the events for standard and deviant
standard_events = ['GiepS_S'] #'GiepM_S','GiepS_S','GopM_S','GopS_S'
deviant_events = ['GiepS_D'] #'GiepM_D','GiepS_D','GopM_D','GopS_D'


df = input_mmr_prep(metadata, epochs, standard_events, deviant_events)

  roi_evoked = mne.channels.combine_channels(evoked_diff, roi_dict, method='mean')
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'skew' : mmr_skew, 'kurt' : mmr_kurt, 'var' : mmr_var,'min' : mmr_min,'max' : mmr_max}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms
  roi_evoked = mne.channels.combine_channels(evoked_diff, roi_dict, method='mean')
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'skew' : mmr_skew, 'kurt' : mmr_kurt, 'var' : mmr_var,'min' : mmr_min,'max' : mmr_max}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms
  roi_evoked = mne.channels.combine_channels(evoked_diff, roi_dict, method='mean')
  df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'skew' : mmr_skew, 'kurt' : mmr_kurt, 'var' : mmr_var,'min' : mmr_min,'max

In [40]:
df

Unnamed: 0,eeg_file,channel,mean,kurt,max,min,skew,std,var
0,105a,Pz,2.682689,-0.585351,8.334603,-2.723028,0.045122,2.400998,5.764790
1,105a,PO3,2.561561,0.264753,11.240470,-3.645459,0.743415,3.307474,10.939385
2,105a,O1,5.828814,-0.444278,15.308534,-2.952760,0.075801,4.419833,19.534927
3,105a,Oz,2.843554,0.745639,12.816242,-3.083279,0.926846,3.364777,11.321726
4,105a,O2,2.111776,0.701477,13.064098,-3.626968,1.123020,3.970303,15.763302
...,...,...,...,...,...,...,...,...,...
601,221a,PO3,0.473636,-0.196523,6.917947,-3.822040,0.624424,2.519234,6.346540
602,221a,O1,2.230484,-0.765944,9.261145,-3.797041,0.532549,3.330508,11.092286
603,221a,Oz,3.437303,-0.964063,10.835565,-3.223536,0.379832,3.829308,14.663602
604,221a,O2,5.172815,-1.107701,13.839930,-2.037918,0.255450,4.602085,21.179188


In [41]:
df = df.drop_duplicates(subset=['eeg_file','channel']) # ,'paradigm'

## Transpose dataframe into combination of paradigm and channel per participant

We now want a single row for every participant containing the paradigm and corresponding channels. The code below generates this dataframe. 

In [42]:
# transformation of the dataframe
df = df.pivot(index='eeg_file', columns=['channel']) # 'paradigm',

In [43]:
df.columns = ['_'.join(str(s).strip() for s in col if s) for col in df.columns]

In [44]:
df.reset_index(inplace=True)

## Merge and safe dataframe

We still need to merge some of the metadata into the dataframe, so we have the information of the age, gender and label of the participant. 

In [45]:
df = pd.merge(df, metadata, on='eeg_file')

In [46]:
pd.set_option('display.max_columns', None)

In [47]:
df

Unnamed: 0,eeg_file,mean_O1,mean_O2,mean_Oz,mean_PO3,mean_PO4,mean_Pz,kurt_O1,kurt_O2,kurt_Oz,kurt_PO3,kurt_PO4,kurt_Pz,max_O1,max_O2,max_Oz,max_PO3,max_PO4,max_Pz,min_O1,min_O2,min_Oz,min_PO3,min_PO4,min_Pz,skew_O1,skew_O2,skew_Oz,skew_PO3,skew_PO4,skew_Pz,std_O1,std_O2,std_Oz,std_PO3,std_PO4,std_Pz,var_O1,var_O2,var_Oz,var_PO3,var_PO4,var_Pz,ParticipantID,test,sex,age_months,dyslexic_parent,Group_AccToParents,path_eeg,path_epoch,path_eventmarkers,epoch_file
0,101a,-4.803768,-6.861465,-5.568198,-5.941640,-9.107199,-4.611721,-0.562165,-0.496082,-0.515277,-1.067546,-0.390698,-0.413064,3.136691,2.297604,2.512451,2.088773,3.404835,2.746709,-12.544278,-13.177281,-12.386485,-13.026352,-16.054792,-9.184929,0.319892,0.782728,0.469263,0.257558,0.938183,0.667729,3.413841,4.026678,3.605027,4.170460,5.253442,2.885322,11.654309,16.214139,12.996221,17.392739,27.598650,8.325084,101,a,m,20,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,101a_epo.fif
1,103a,19.883446,24.975000,15.546797,10.583465,9.278510,8.980769,-0.639914,-1.046457,-0.910234,-0.776038,-0.979723,-1.021471,45.649649,78.843023,31.866957,26.217370,22.372235,22.617439,-30.133953,-22.639233,-5.406720,-5.550557,-4.475777,-9.147466,-0.901687,0.164626,-0.375382,-0.001471,0.094579,-0.020761,22.908922,25.244952,10.078847,7.751403,6.891567,7.446865,524.818704,637.307599,101.583147,60.084244,47.493699,55.455804,103,a,f,20,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,103a_epo.fif
2,104a,5.025056,3.728181,4.172964,3.445599,2.649602,-1.490597,-0.188521,-0.654150,-0.162730,0.660011,-0.079210,0.578608,13.435692,11.707743,12.351547,12.921998,11.219894,6.268313,-4.453991,-3.708660,-4.800157,-5.815088,-3.516262,-6.186405,-0.289918,-0.137595,-0.304990,-0.133156,0.554728,0.484340,3.715777,3.210571,3.366349,3.028814,2.929036,2.560409,13.807001,10.307765,11.332309,9.173715,8.579250,6.555693,104,a,m,18,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,104a_epo.fif
3,105a,5.828814,2.111776,2.843554,2.561561,0.504324,2.682689,-0.444278,0.701477,0.745639,0.264753,0.503167,-0.585351,15.308534,13.064098,12.816242,11.240470,10.009000,8.334603,-2.952760,-3.626968,-3.083279,-3.645459,-4.808089,-2.723028,0.075801,1.123020,0.926846,0.743415,1.076142,0.045122,4.419833,3.970303,3.364777,3.307474,3.217218,2.400998,19.534927,15.763302,11.321726,10.939385,10.350494,5.764790,105,a,f,17,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,105a_epo.fif
4,106a,-2.111612,-2.724727,-1.171973,-1.092150,-2.354315,0.664995,0.525556,-0.330947,-0.641247,1.719099,-0.172471,0.607976,5.648503,7.069859,7.499098,6.330233,6.946855,7.860720,-12.637993,-10.112772,-9.666548,-14.364977,-10.058398,-4.158007,-0.505836,0.557716,-0.039128,-1.028704,0.582196,0.512282,3.602988,3.846539,3.753863,3.958567,3.743417,2.346960,12.981522,14.795859,14.091485,15.670253,14.013175,5.508222,106,a,m,19,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,106a_epo.fif
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,217a,2.933243,2.492899,2.115465,1.034877,2.157737,0.865691,-1.121746,-1.101995,-1.060556,-0.938454,-0.681131,-0.580103,12.762514,8.994805,11.119629,8.401492,8.551001,5.652159,-7.199085,-5.869103,-5.197457,-6.193635,-4.182073,-3.568588,-0.100631,-0.007074,0.030095,0.014958,0.292719,0.101375,5.044667,3.715081,4.043698,3.679427,2.886913,2.036773,25.448661,13.801827,16.351494,13.538183,8.334265,4.148446,217,a,f,18,f,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,217a_epo.fif
97,218a,-6.042049,-7.197263,-8.965705,-5.575467,-4.222334,-3.472159,-1.116926,-1.200517,-1.330322,-1.041544,-0.553457,-0.926702,2.981168,3.283967,3.257127,2.495629,3.815094,4.618657,-12.993363,-14.364736,-18.633659,-12.356152,-9.596063,-9.452912,0.426191,0.457209,0.291107,0.459677,0.673844,0.364758,4.433269,5.002648,6.527243,3.937542,3.168611,3.197134,19.653870,25.026488,42.604898,15.504241,10.040093,10.221666,218,a,m,18,Nee,Control,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,218a_epo.fif
98,219a,3.985731,3.818194,3.572298,3.353821,-1.580273,-0.252640,0.674328,-0.498430,-0.493440,-0.726549,-0.619165,0.164671,9.810595,11.390052,10.380866,9.205712,4.791675,3.961985,-5.509886,-1.970815,-3.064241,-2.458841,-6.744724,-3.980556,-0.619171,0.505276,0.274808,-0.198036,0.318454,0.044923,2.982900,3.448064,3.044555,2.707432,2.461528,1.443024,8.897694,11.889146,9.269315,7.330190,6.059118,2.082318,219,a,m,18,m,At risk,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,219a_epo.fif
99,220a,0.330157,1.419880,0.322775,0.893187,-2.031868,-2.223419,-0.554777,-0.705564,-0.319308,0.187801,-0.187100,-0.445076,7.459067,10.076513,5.818233,5.969766,2.428267,0.971704,-9.828997,-8.815611,-6.635341,-5.572658,-9.173635,-8.254329,-0.192240,-0.066602,-0.252990,-0.280798,-0.628134,-0.694669,3.564100,4.608070,2.719781,2.456889,2.668229,2.320235,12.702807,21.234314,7.397207,6.036302,7.119445,5.383488,220,a,m,19,Nee,Control,../../volume-ceph/ePodium_projectfolder/dataset,../../volume-ceph/nadine_storage/processed_epochs,../../volume-ceph/ePodium_projectfolder/events,220a_epo.fif


Drop some unnecessary columns. 

In [48]:
df = df.drop(['eeg_file',
       'dyslexic_parent', 'path_eeg','path_epoch',
       'epoch_file', 'path_eventmarkers'], axis =1)

In [49]:
df['sex'] = np.where(
    (df['sex']=='m'), 1,0)

df['Group_AccToParents'] = np.where(
    (df['Group_AccToParents']=='At risk'), 1,0)

In [50]:
first = df.pop('Group_AccToParents')
df.insert(0, 'Group_AccToParents', first)

In [51]:
df.to_csv('df_avg_mmr.csv', index=False) # safe dataframe

## PCA analysis on feature reduction 

In [20]:
X = df.drop('Group_AccToParents',1)
y = df['Group_AccToParents']

  X = df.drop('Group_AccToParents',1)


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [22]:
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

ValueError: could not convert string to float: 'a'

In [23]:
explained_variance = pca.explained_variance_ratio_

AttributeError: 'PCA' object has no attribute 'explained_variance_ratio_'

In [None]:
explained_variance

In [None]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

In [None]:
X.shape