# Notebook for the ePodium dataset

This notebook contains four sections. Access to the ePodium dataset is required to run the notebook properly.

+ In section 1. [Loading EEG-data and metadata](#1epod) the EEG-data and metadata is loaded. 
+ In section 2. [The ePodium experiment](#2epod) the structure of the auditory oddball experiment is explained and visualised. There is an emphasis on the different events. 
+ Section 3. [Processing](#3epod) is for processing and cleaning up the raw files. The raw data is split into epochs, i.e. partitions surrounding the events. These epochs are stored in a *.fif* file. *.fif* is the standard extension for the *mne* package. 
+ In section 4. [The Event Related Potential (ERP)](#4epod) (ERPs) that result from the events are explored. 



#### Import Packages

In [None]:
import ipywidgets
import mne
import numpy as np
import os
import glob
import wave # Optional package for analyzing .wav audio files

import local_paths
from functions import data_io
from functions import processing
from functions import display_helper

from functions.epodium import Epodium
epodium = Epodium()
from functions.sequences import EpodiumSequence

<br>

---
<a id='1epod'></a>
## 1. Loading EEG-data and metadata

   Make sure *local_paths.ePod_dataset* contains the path to the __dataset__ and *local_paths.ePod_metadata* contains the __metadata__ files.

* *children.txt* contains the age and sex, and risk of dyslexia due to at least 1 dyslexic parents.
* *cdi.txt* contains aditional information about the child's vocabulary with the Communicative Development Inventories questionnaire
* *parents.txt* contains info on dyslexia tests and diagnoses of the parents.
* *CODES_overview* contains the mapping of event condition and stimulus to an event number.

In [None]:
experiments_raw, experiments_id = data_io.load_raw_dataset(local_paths.ePod_dataset, 
                                                       file_extension='.bdf', 
                                                       preload=False)

epod_children, epod_cdi, epod_parents = \
    data_io.load_metadata(local_paths.ePod_metadata, epodium.metadata_filenames)

#### Participants Info

In [None]:
clean_b_dataframe = epod_children.loc[epod_children['Age_days_b'].str.isnumeric()]

print(f"The participants are between the age of {epod_children['Age_days_a'].min()} "
      f"and {int(clean_b_dataframe['Age_days_b'].max())} days. "
      f"({round(epod_children['Age_months_a'].min(), 1)} to "
      f"{float(clean_b_dataframe['Age_months_b'].max())} months)")

<br>

---
<a id='2epod'></a>
## 2. The ePodium experiment

The ePodium experiment is an *auditory oddball experiment*. Children listen to a sequence that contains __80% standard__ and __20%     deviant__ syllables in order to elicit the *mismatch response*. 
For measurement 34 electrodes are used, of which __32 channels__ and 2 mastoid references. The measurement frequency is __2048.0 Hz__. 
The experiment is around __30 minutes__, with sequences of around 7.5 minutes containing four different conditions:

+ Condition 1 __GiepMT__: standard "*giep*", deviant "*gip*": multiple pronounciations 
+ Condition 2 __GiepST__: standard "*giep*", deviant "*gip*": single pronounciation 
+ Condition 3 __GopMT__: standard "*gop*", deviant "*goep*": multiple pronounciations 
+ Condition 4 __GopST__: standard "*gop*", deviant "*goep*": single pronounciation

#### Analyse audio stimulus

In [None]:
def print_sound_duration(sounds):
    path_sound = os.path.join(local_paths.ePod_metadata, 'sounds', sounds)
    with wave.open(path_sound) as mywav:
        duration_seconds = mywav.getnframes() / mywav.getframerate()
        print(f"Length of the WAV file: {duration_seconds:.3f} s")
        
epod_sounds = sorted(os.listdir(os.path.join(local_paths.ePod_metadata, 'sounds')))
ipywidgets.interact(print_sound_duration, sounds=epod_sounds);

### Load events
Events are stored into external .txt file for faster loading.


In [None]:
## Store events in local path
n_events_stored = len(glob.glob(os.path.join(local_paths.ePod_dataset_events, '*.txt')))
if n_events_stored != len(experiments_raw):
    data.save_events(local_paths.ePod_dataset_events, experiments_raw, experiments_id)

## Load events
events = data_io.load_events(local_paths.ePod_dataset_events, experiments_id)

## Set multiple pronounciations as same event id to reduce the unique events from 78 to 12.
events_12 = epodium.group_events_12(events)

#### Choose which participant to analyse

In [None]:
def f(experiments):
    return experiments

participant_widget = ipywidgets.interactive(f, experiments=sorted(experiments_id))
display(participant_widget)

#### Show part of the EEG signal
When a new experiment is chosen, this cell needs to be run again to visualise its EEG data.

In [None]:
## Makes the plot interactive, comment out if not working:
# %matplotlib widget 

participant_raw = experiments_raw[experiments_id.index(participant_widget.result)]
participant_events = events_12[experiments_id.index(participant_widget.result)]

fig = mne.viz.plot_raw(participant_raw, participant_events, n_channels=3, scalings=50e-6, duration=0.5, start=1000)

#### Plot events across time

The events are grouped into __12 event types__, 3 for each condition. 

Ideally, each condition has __120 deviants__ (D) and __360 standards__ (S).

The test also contains __first standards__ (FS) to make the child accustomed to the standard. First standards are discarded when calculating the mismatch response. 

In [None]:
fig = mne.viz.plot_events(participant_events, event_id=epodium.event_dictionary, 
                          color=display_helper.color_dictionary, sfreq=epodium.frequency)

<br>

---
<a id='3epod'></a>
## 3. Processing
#### Filtering ePodium dataset and rejecting bad trials

The EEG data located in _local_paths.ePod_dataset_ is processed with the following techniques:
+ A high-pass filter on the raw EEG sequence with cutoff frequency 0.1 Hz to remove slow trends
+ Splitting the raw data into 1 second epochs in which the event occurs at 0.2s.
+ The epochs are cleaned with the autoreject library. This library contains classes that automatically reject bad trials and repair bad sensors in EEG data. The AutoReject and Ransac classes are used. https://autoreject.github.io/stable/index.html
+ A low-pass filter on the epochs 

The function process_raw splits the raw files up into epochs and saves the events externally. It takes a while to process each file, mainly due to the complexity of the autoreject method. To save some time, multiple raw files are processed simultaniously via multiprocessing.

In [None]:
raw_paths = sorted(glob.glob(os.path.join(local_paths.ePod_dataset, '*' + epodium.file_extension)))

## Multiprocessing:
processing.process_raw_multiprocess(experiments_id, raw_paths, epodium, local_paths.ePod_epochs)

## Single processing:
# for i in range(len(raw_paths)):
#     processing.process_raw(i, experiments_id, raw_paths, epodium, local_paths.ePod_epochs)

#### Extract valid experiments
Processed files with too few standards and deviants are considered invalid.

In [None]:
valid_experiments = processing.valid_experiments(epodium, local_paths.ePod_epochs_events, min_standards=180, min_deviants=80)

<br>

---
<a id='4epod'></a>
## 4. The Event Related Potential (ERP)

+ The voltage change in the brain as a response to an event is called the *event-related potential* (ERP)
+ The response difference between a standard and deviant ERP is called the *mismatch response* (MMR).
+ The mismatch response can be analysed to predict differences between participants.


#### Choose experiment to analyse

In [None]:
def load_participant_data(experiment):
    global epochs
    paths_epoch = os.path.join(local_paths.ePod_epochs, experiment + "_epo.fif")
    epochs = mne.read_epochs(paths_epoch, verbose = 0)
    print(f"Loaded experiment: {experiment} ")

ipywidgets.interact(load_participant_data, experiment=valid_experiments);

#### Widget for plotting standard, deviant, and mismatch ERPs

In [None]:
condition = ipywidgets.RadioButtons(options=epodium.conditions, 
                                    description='Condition:',
                                    value="GiepM")
event_type = ipywidgets.RadioButtons(options=["standard", "deviant", "MMN"], 
                                     description='Event type:', 
                                     value="standard")

def plot_ERP_widget(con, ev):
    display_helper.plot_ERP(epochs, con, ev)

ui = ipywidgets.HBox([condition, event_type])
out = ipywidgets.interactive_output(plot_ERP_widget, {'con': condition, 'ev': event_type})
display(ui, out)

#### Widget for modifying data

In the deep learning notebook, the model uses a sequence of data. This data is show in the plot below. Tweak the widget to see the effect of changing the parameters on an ERP. The plot can take a while to load.

+ *sample_rate*: The number of data points in a second of each channel.
+ *n_trials_averaged*: The number of trials averaged to form the ERP.
+ *gaussian_noise*: The standard deviation of the noise added to each datapoint to reduce overfitting.

In [None]:
def show_sequence_data(p, f, n, g):
    labels = epodium.create_labels(local_paths.ePod_metadata)
    sequence = EpodiumSequence(valid_experiments, labels, local_paths.ePod_epochs, batch_size=1, 
                               sample_rate=f, n_trials_averaged=n, gaussian_noise=g*1e-6)
    x, y = sequence.__getitem__(p, True)
    
    print(f"The shape of one data instance is {x[0].shape}")
    display_helper.plot_array_as_evoked(x[0], epodium.channels_epod, frequency=f, n_trials=n)
    print(f"In this experiment the age of the participant is {int(y[0])} days.")

# Widget sliders
participant = ipywidgets.IntSlider(description="participant")
sample_rate = ipywidgets.IntSlider(description="sample_rate", value=2049, min=10, max=2049)
n_trials_averaged = ipywidgets.IntSlider(description="n_trials_averaged", value=30)
gaussian_noise = ipywidgets.FloatSlider(description="gaussian_noise", max=3)
# Widget setup
ui = ipywidgets.HBox([participant, sample_rate, n_trials_averaged, gaussian_noise])
out = ipywidgets.interactive_output(show_sequence_data, {'p':participant, 'f': sample_rate, "n": n_trials_averaged, "g": gaussian_noise})
display(ui, out)

Beware, the absolute value of the y-axis (microVolt) is meaningless due to data normalization. Each data-point is divided by the standard deviation of all the signals.