# Model input ERP format preparation 

In this notebook: 
- Necessary inputs
- Read all epochs
- Function to create dataframe with average mismatch response for all participants (needs to be transformed to function)
- Formatting dataframe as suitable model input

## Imports

In [1]:
import mne      # toolbox for analyzing and visualizing EEG data
import os       # using operating system dependent functionality (folders)
import pandas as pd # data analysis and manipulation
import numpy as np    # numerical computing (manipulating and performing operations on arrays of data)
import ipywidgets as widgets
from IPython.display import display

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import eegyolk
import eegyolk.helper_functions as hf # library useful for eeg and erp data cleaning
from eegyolk import initialization_functions #library to import data
import eegyolk.epod_helper

In [2]:
metadata = pd.read_csv('metadata.csv', sep = ',')

In [3]:
metadata

Unnamed: 0,eeg_file,ParticipantID,test,sex,age_months,dyslexic_parent,Group_AccToParents,path_eeg,path_epoch,path_eventmarkers,epoch_file
0,101a,101,a,m,20,m,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,101a_epo.fif
1,102a,102,a,f,20,Nee,Control,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,102a_epo.fif
2,103a,103,a,f,20,m,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,103a_epo.fif
3,104a,104,a,m,18,f,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,104a_epo.fif
4,105a,105,a,f,17,f,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,105a_epo.fif
5,106a,106,a,m,19,f,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,106a_epo.fif
6,107a,107,a,f,16,m,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,107a_epo.fif
7,109a,109,a,m,21,m,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,109a_epo.fif
8,110a,110,a,m,17,m,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,110a_epo.fif
9,111a,111,a,m,20,m,At risk,F:/Stage/ePODIUM/Data/ePodium_projectfolder/Da...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ep...,F:/Stage/ePODIUM/Data/ePodium_projectfolder/ev...,111a_epo.fif


## Read all epochs from files

The function below loads all filtered epochs from the dataframe which contains the metadata and the epochs filepaths + filenames. The epochs are arrays for each stimuli with a time interval of -0.3 to 0.7.

In [5]:
epochs = initialization_functions.read_filtered_data(metadata)

AttributeError: module 'eegyolk.initialization_functions' has no attribute 'read_filtered_data'

In [None]:
len(epochs)

## Create pandas dataframe with the average difference between standard and deviant responses

The function below needs `metadata`, the loaded `epochs` and the definition of the standard and deviant events as input. You should define your standard and deviant events as an array. In the function `input_mmr_prep` it's important to know that the assumption is made that the deviant follows after a standard event. Therefore the deviant belonging to the standard is the  standard event number + 1. Make sure your events are numbered like this, else the function won't calculate the mismatch response.  

In [None]:
def input_mmr_prep(metadata, epochs, standard_events): 
    # create dataframe with expected columns 
    df = pd.DataFrame(columns=["eeg_file",  "channel", "mean"]) # "paradigm",

    # loop over all participants
    for i in range(len(metadata['eeg_file'])):
        std_evoked = epochs[i][standard_events].average() 
        dev_evoked = epochs[i][deviant_events].average()

        # calculate the mismatch response between standard and deviant evoked
        evoked_diff = mne.combine_evoked([std_evoked, dev_evoked], weights=[1, -1])
        
        # get a list of all channels
        chnames_list = evoked_diff.info['ch_names']
        
        # compute for every channel the features of the mismatch line
        for channel in chnames_list: 
            chnames = mne.pick_channels(evoked_diff.info['ch_names'], include=[channel])
            roi_dict = dict(left_ROI=chnames) # combine_channels only takes a dictionary as input
            roi_evoked = mne.channels.combine_channels(evoked_diff, roi_dict, method='mean')
            mmr = roi_evoked.to_data_frame()
            mmr_avg = mmr['left_ROI'].mean()
            mmr_std = mmr['left_ROI'].std()
            mmr_skew = mmr['left_ROI'].skew()
            mmr_var = mmr['left_ROI'].var()
            mmr_kurt = mmr['left_ROI'].kurtosis()
            
            df = df.append({'eeg_file': metadata['eeg_file'][i], 'channel': channel, 'mean' :  mmr_avg, 'std' : mmr_std, 'skew' : mmr_skew, 'kurt' : mmr_kurt, 'var' : mmr_var}, ignore_index=True) # add 'paradigm : paradigm' if we want to separate the paradigms 
    return df

In [None]:
# define the events for standard and deviant
standard_events = [2,5,8,11]
deviant_events = [3,6,9,12]

df = input_mmr_prep(metadata, epochs, standard_events, deviant_events)

In [None]:
df

In [None]:
df = df.drop_duplicates(subset=['eeg_file','channel']) # ,'paradigm'

## Transpose dataframe into combination of paradigm and channel per participant

We now want a single row for every participant containing the paradigm and corresponding channels. The code below generates this dataframe. 

In [None]:
# transformation of the dataframe
df = df.pivot(index='eeg_file', columns=['channel']) # 'paradigm',
df

In [None]:
df.columns = ['_'.join(str(s).strip() for s in col if s) for col in df.columns]
df

In [None]:
df.reset_index(inplace=True)
df

## Merge and safe dataframe

We still need to merge some of the metadata into the dataframe, so we have the information of the age, gender and label of the participant. 

In [None]:
df = pd.merge(df, metadata, on='eeg_file')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df

Drop some unnecessary columns. 

In [None]:
df = df.drop(['eeg_file','age_months_days',
       'dyslexic_parent', 'path_eeg','path_epoch',
       'epoch_file', 'path_eventmarkers'], axis =1)

In [None]:
df['sex'] = np.where(
    (df['sex']=='m'), 1,0)

df['Group_AccToParents'] = np.where(
    (df['Group_AccToParents']=='At risk'), 1,0)

In [None]:
first = df.pop('Group_AccToParents')
df.insert(0, 'Group_AccToParents', first)

To remove some outliers on the data, the z score is calculated. 

In [None]:
df.to_csv('df_avg_mmr.csv', index=False) # safe dataframe

## PCA analysis on feature reduction 

In [None]:
X = df.drop('Group_AccToParents',1)
y = df['Group_AccToParents']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [None]:
explained_variance = pca.explained_variance_ratio_

In [None]:
explained_variance

In [None]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

In [None]:
X.shape