# EEG preprocessing 

In this notebook: 
- Necessary imports
- Data loader for events, eeg and metadata
- Filtering raw EEG data
- Transform raw EEG into epochs
- saving filtered data to `metadata.csv`

Preprocessing steps: 
+ Prepare EEG (1. Subtract reference (mastoids), 2. Detrend, 3. Filter, 4. Remove bad channels)
+ Segment EEG into standard and deviant epochs (ERPs) (1. subtract baseline, 2. Reject artefacts, 3. Average (for each marker/subject/channel separately))
+ Calculate Mismatch response (deviant - standard for a single subject) (check differences between channels and subjects)

## Imports

The data will be processed using the mne library and the custom made eegyolk library. The eegyolk library contains methods to load the metadata, eeg data and the event markers. Make sure to pip install the library first using `pip install eegyolk`. 

In [1]:
import mne      # toolbox for analyzing and visualizing EEG data
import os       # using operating system dependent functionality (folders)
import pandas as pd # data analysis and manipulation
import numpy as np
import ipywidgets as widgets
from IPython.display import display
import matplotlib.pyplot as plt
import json

import eegyolk
from eegyolk import helper_functions, initialization_functions, epod_helper
from eegyolk.config import Config
from eegyolk.rawf import RawData


## Load metadata and eeg files

The config file needs to be created with the data pathways custom to your workspace. How to configure the `config.json` can be found in the eegyolk `readme.md`.

In [2]:
config = Config()
data_path = config.get_directory('root_2022')
path_eeg = config.get_directory('data')
path_metadata = config.get_directory('metadata')
path_eventmarkers = config.get_directory('events')
path_epochs = config.get_directory('preprocessed')

There are three pathways necessary for importing the data: the path to raw eeg files, the metadata and the events. The files can be loaded using the initialization_functions method. All event markers needs to be saved in a seperate folder. If not saved already, the event markers will be saved using the initialization_function.

In [3]:
# load metadata
files_metadata = ["children.txt", "cdi.txt", "parents.txt", "CODES_overview.txt"]  
children, cdi, parents, codes = eegyolk.initialization_functions.i_load_metadata(path_metadata, files_metadata)

In [4]:
# load eeg
eeg, eeg_filename =  initialization_functions.load_dataset(path_eeg, preload=False) # preload must be set to True once on the cloud

  raw = mne.io.read_raw_bdf(path, preload=preload, verbose=verbose)
  raw = mne.io.read_raw_bdf(path, preload=preload, verbose=verbose)
  raw = mne.io.read_raw_bdf(path, preload=preload, verbose=verbose)


248 EEG files loaded


In [5]:
# load events 
events_files = os.listdir(path_eventmarkers)
if len(events_files) == 0 or path_eventmarkers == False: # check if event markers are saved in a seperate folder
    initialization_functions.save_event_markers(path_eventmarkers, eeg, eeg_filename) # save event markers

event_markers = initialization_functions.load_events(path_eventmarkers, eeg_filename) # load event markers
event_markers_simplified = epod_helper.group_events_12(event_markers) # simplify events

248 Event Marker files loaded


## Filtering raw EEG 

There are 3 steps for the filtering:
1. Set the filter parameters. Which are:
    - High pass 
    - Low pass
    - Frequencies
    - Event dictionary
    - Time window for each epoch
2. A generator to filter the raw eeg data
3. Transforming the filtered eeg into epochs

### Set filter parameters

Below you can define the frequencies for the bandpass filter. The lowpass can not be below 0 and the highpass can not be higher then 100. Most common bandpass filter is filtering between 0.1 and 40. 

In [6]:
lowpass = widgets.BoundedFloatText(
    value=0.1,
    min=0,
    max=100,
    step=0.1,
    description='lowpass:',
    disabled=False
)

highpass = widgets.BoundedFloatText(
    value=40,
    min=0,
    max=100,
    step=0.1,
    description='highpass:',
    disabled=False
)

widgets.VBox([lowpass,highpass])


VBox(children=(BoundedFloatText(value=0.1, description='lowpass:', step=0.1), BoundedFloatText(value=40.0, des…

In [7]:
# change type to integer
lowpass = float(lowpass.value)
highpass = float(highpass.value)

The number of freqs can vary and be adjusted by changing `n`. The used frequency for this analysis is `[60, 120, 180, 240]`.

In [8]:
n = 4
freq = list(widgets.BoundedIntText(
    description='freq[{}]'.format(i),
    min=0,
    max=300,
    step=1,
    value=(i+1)*60)
    for i in range(n))

widgets.VBox(children=freq)

VBox(children=(BoundedIntText(value=60, description='freq[0]', max=300), BoundedIntText(value=120, description…

In [9]:
freqs= [f.value for f in freq]

Epochs are created with joining the eeg data with a specific event.  mne.Epochs automaticaly create a baseline correction and artefact rejection. 

In [10]:
event_dictionary = eegyolk.epod_helper.event_dictionary
event_dictionary

{'GiepM_FS': 1,
 'GiepM_S': 2,
 'GiepM_D': 3,
 'GiepS_FS': 4,
 'GiepS_S': 5,
 'GiepS_D': 6,
 'GopM_FS': 7,
 'GopM_S': 8,
 'GopM_D': 9,
 'GopS_FS': 10,
 'GopS_S': 11,
 'GopS_D': 12}

In order to create the epochs, the time before `tmin` and after an event `tmax` needs to be defined. The default values are set to -0.2 and 0.8. `tmin` and `tmax` are the start and stop time relative to each event.

In [11]:
tmin = widgets.BoundedFloatText(
    value=-0.2,
    min=-1,
    max=1,
    step=0.1,
    description='tmin:',
    disabled=False
)

tmax = widgets.BoundedFloatText(
    value=0.8,
    min=-1,
    max=1,
    step=0.1,
    description='tmax:',
    disabled=False
)

widgets.VBox([tmin,tmax])

VBox(children=(BoundedFloatText(value=-0.2, description='tmin:', max=1.0, min=-1.0, step=0.1), BoundedFloatTex…

In [12]:
tmin = float(tmin.value)
tmax = float(tmax.value)

### Filter generator

The first part to prepocess the data is creating a filter to filter the raw eeg data. This filter contains a bandpass filter, with as input the parameters `lowpass` and `highpass`. It also contains a notch filter to filter out power line noise and needs as input `freqs` for frequencies to apply the notch filter on. The next input is `mastoid_channels`, to subtract the reference from the raw eeg data. Finally, channels from the eeg can be dropped by adjusting the `drop_ch`. 

In [13]:
mastoid_channels = ['EXG1', 'EXG2']
drop_ch = ['EXG1', 'EXG2','EXG3', 'EXG4', 'EXG5', 'EXG6', 'EXG7', 'EXG8', 'Status']

def filter_raweeg_gen(eeg, lowpass, highpass, freqs, mastoid_channels, drop_ch): # filters the raw eeg
    for i in range(len(eeg)): #loops over all files
        processed_file = os.path.join(path_epochs, eeg_filename[i]+"_epo.fif") # creates new filename
        if not os.path.exists(processed_file): # if file isn't processed yet, it uses the filter function from the eegyolk library to preprocess
            yield helper_functions.filter_eeg_raw(eeg[i].load_data(), lowpass, highpass, freqs, mastoid_channels, drop_ch)
        else: 
            yield # print(f"File {processed_file} already processed \n", end = '')
            

This second part loops over all filtered files and checks for each file if it is already processed before, to save memory. Filtering the data needs a lot of computation power. If the code breaks due to a memory error, simply restart the kernel until all files are processed. For a selected event, an interval is created with a time before and after event. This represents an epoch. The function automatically performs a baseline correction. 

In [14]:
filtered_eegs = filter_raweeg_gen(eeg, lowpass, highpass, freqs, mastoid_channels, drop_ch)

if not os.path.exists(path_epochs):
        os.mkdir(path_epochs)
        
for idx, single_eeg in enumerate(filtered_eegs):
    processed_file = os.path.join(path_epochs, eeg_filename[idx]+"_epo.fif")
    if not os.path.exists(processed_file):
        epoch = hf.create_epoch(single_eeg, event_markers_simplified[idx], tmin, tmax)
        epoch_file = os.path.join(path_epochs, eeg_filename[idx]+"_epo.fif")
        epoch.save(epoch_file, overwrite=True)
        print("\n", idx+1, " saved.")


## Create DataFrame with metadata and eeg/epoch paths

This final part of the code creates a dataframe to store all preprocessed epoch files with the corresponding metadata. It consists out of simple pandas operations. 

In [15]:
# create dependent variable Group_AccToParents
children = children.drop(['Group_AccToParents'],axis=1)
parents['Group_AccToParents'] = np.where(((parents['dyslexia_mother_accToMother']=='Ja') | (parents['dyslexia_father_accToFather']=='Ja')) , 1,0)

# create key to merge 
parents.rename(columns = {'child':'ParticipantID'}, inplace=True)
cdi.rename(columns = {'participant':'ParticipantID'}, inplace=True)
metadata = pd.merge(cdi, children, on="ParticipantID")
metadata = pd.merge(metadata, parents, on="ParticipantID")

# create filepath columns
metadata['eeg_file']= metadata['ParticipantID'].astype(str) + metadata['test']

epoch_filename = []
for path in os.listdir(path_epochs): # iterate directory
    if os.path.isfile(os.path.join(path_epochs, path)): # check if current path is a file
        epoch_filename.append(path)

df_eegfilenames = pd.DataFrame(eeg_filename, columns=['eeg_file'])
df_epochfilenames = pd.DataFrame(epoch_filename, columns=['epoch_file'])
df_epochfilenames['eeg_file'] = df_epochfilenames.epoch_file.str[:4]

metadata['path_eeg'] = path_eeg
metadata['path_epoch'] = path_epochs 
metadata['path_eventmarkers'] = path_eventmarkers

# merge to final dataframe
df = pd.merge(df_eegfilenames, metadata, on='eeg_file')
df = pd.merge(df, df_epochfilenames, on='eeg_file')

The dataframe contains a lot of columns which we do not use in this research. Only important columns are kept. Some files are also dropped for this research. This is based on the outcome of the `data_analysis.ipynb` notebook, where we found that those files are missing events or contains bad channels. 

In [16]:
# drop columns and files
df = df[['eeg_file', 'ParticipantID', 'test', 'sex', 'age_months',
       'dyslexic_parent', 'Group_AccToParents', 'path_eeg', 'path_epoch',
       'path_eventmarkers', 'epoch_file']]

drop_files = ["102a","113a", "107b (deel 1+2)", "132a", "121b(2)", "113b", "107b (deel 3+4)", "147a",
                "121a", "134a", "143b", "121b(1)","136a", "145b", "150a","152a", "184a", "165a", "151a", "163a", "179a","179b", "182b", "186a", "193b", "207a"]

df = df[~df['eeg_file'].isin(drop_files)]
df = df.drop(df[df['test'] == "b"].index).reset_index(drop=True)

## Save metadata

The `metadata.csv` will be used in the follow up notebook `data_analysis.ipynb` and `model_preperation.ipynb`, where the data will be analysed and further transformed to function as input for the different models. 

In [17]:
df.to_csv('metadata.csv', index=False)

In [18]:
df

Unnamed: 0,eeg_file,ParticipantID,test,sex,age_months,dyslexic_parent,Group_AccToParents,path_eeg,path_epoch,path_eventmarkers,epoch_file
0,105a,105,a,f,17,f,1,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,105a_epo.fif
1,107a,107,a,f,16,m,1,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,107a_epo.fif
2,106a,106,a,m,19,f,0,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,106a_epo.fif
3,109a,109,a,m,21,m,0,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,109a_epo.fif
4,110a,110,a,m,17,m,1,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,110a_epo.fif
...,...,...,...,...,...,...,...,...,...,...,...
104,222a,222,a,m,18,Nee,0,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,222a_epo.fif
105,223a,223,a,f,18,m,1,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,223a_epo.fif
106,228a,228,a,m,19,Nee,0,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,228a_epo.fif
107,225a,225,a,f,16,m,0,/volume-ceph/ePodium_projectfolder/dataset,/volume-ceph/ePodium_projectfolder/epochs_fif,/volume-ceph/ePodium_projectfolder/events,225a_epo.fif
