## Applies Deep Learning methods to ePodium dataset for prediction of Dyslexia.

#### Import Packages

In [1]:
import mne
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from functions import epodium

import PATH

Same comments on the names as in other notebook, but fine for now...I would add at least one markdown cell comment explaining what epochs are in this context. Remember for AI people who may read this (future job interviews?) epochs has a diffferent meaning, and you should stay very clear on your terms for everyone.

#### Check number of epochs in each experiment
Experiments with enough epochs are added to *clean_list*

In [3]:
standard_minimum = 180  # total of 360
deviant_minimum = 80  # total size of 120
firststandard_minimum = 80  # total size of 120

count_analyzed = 0
count_bad = 0

clean_list = []

firststandard_index = [1, 4, 7, 10]
standard_index = [2, 5, 8, 11]
deviant_index = [3, 6, 9, 12]

# REVIEW: using glob.glob() would've been more appropropriate here:
# for event_file in glob(os.path.join(PATH.ePod_processed_autorject_events, '?' * 8 + '.txt'))
# which would allow you to not have the following `if' inside the loop, also would prevent you
# from needing `count_analyzed' variable, as the number of files thus retrieved would be the
# final value of this variable.
for event_file in os.listdir(PATH.ePod_processed_autoreject_events):
    if event_file.endswith('.txt') and len(event_file) == 8:
        # print(f"Analyzing {event_file}")
        count_analyzed += 1
        event = np.loadtxt(os.path.join(
            PATH.ePod_processed_autoreject_events, event_file), dtype=int)

        # Count how many events are left in standard, deviant, and FS
        # REVIEW: What is the significance of number 4?
        # REVIEW: This would've easier to understand if it was a separate
        # function.
        # REVIEW: Removed unnecessary parenthesis.
        # REVIEW: It's better to make things that do similar things look similar.
        # This code isrepeatedly calling `np.count_nonzero()' on a similar argument.
        # Essentially, the question this condition is asking is something like:
        # "are the numbers of events within desired range".  This may be made to
        # stand out more if expressed as:
        # if any(exceeds_limit(kind, limit) for kind, limit in event_ranges[i]):
        #     ...
        # -----
        # where `event_ranges = ((standard_index, standard_minimum), ...)'
        # REVIEW: If you decide to rewrite this loop as a separate function, you
        # could also avoid having `count_bad' variable, the desired value would be the
        # difference between the total number of processed files and the number of
        # files yielded by the generator.
        for i in range(4):
            if np.count_nonzero(event[:, 2] == standard_index[i]) < standard_minimum
                or np.count_nonzero(event[:, 2] == deviant_index[i]) < deviant_minimum
                    or np.count_nonzero(event[:, 2] == firststandard_index[i]) < firststandard_minimum:
                count_bad += 1
                break
            # REVIEW: Nore This condition is nearly equivalent to adding `else' statement to the loop:
            # for i in range(...):
            #     if condition:
            #        break
            # else:
            #     this code is executed if `break' was never reached
            if i == 3:  # No bads found at end of for loop
                clean_list.append(event_file)

clean_list = sorted(clean_list)
print(f"Analyzed: {count_analyzed}, bad: {count_bad}")
print(f"{len(clean_list)} files have enough epochs for analysis.")

Analyzed: 188, bad: 37
151 files have enough epochs for analysis.


In [None]:
I improved the formatting (minor issue). If there is time, make this a function on the back end.

#### Split into train and test dataset
Both the train and test sets have the same proportion of participants that did either a, b, or both experiments

In [4]:
# Split test/train on participant
# REVIEW: I believe, that what you wanted here is `os.path.splitext(file)[0]'
experiments = [file.replace('.txt', '') for file in clean_list]

# Split experiments into participants that did a, b, and both
# REVIEW: Could you also just check if 'a' is in the file name?
experiments_a = [file.replace('a', '') for file in experiments]
experiments_a = [item for item in experiments_a if len(item) == 3]
experiments_b = [file.replace('b', '') for file in experiments]
experiments_b = [item for item in experiments_b if len(item) == 3]
# REVIEW: Is order important here? Otherwise these are best accomplished
# by using methods defined on `set()'
# Something like:
# as = set(exp for exp in experiments if 'a' in exp)
# bs = set(exp for exp in experiments if 'b' in exp)
# as_and_bs = as.union(bs)
experiments_a_and_b = [file for file in experiments_a if file in experiments_b]
experiments_a_only = [file for file in experiments_a if file not in experiments_b]
experiments_b_only = [file for file in experiments_b if file not in experiments_a]

participants = sorted(experiments_a_and_b + experiments_a_only + experiments_b_only)

# Split participants into train and test dataset
train_ab, test_ab = train_test_split(experiments_a_and_b, test_size=0.25)  
train_a, test_a = train_test_split(experiments_a_only, test_size=0.25) 
train_b, test_b = train_test_split(experiments_b_only, test_size=0.25) 

# REVIEW: See my comment above. If you simply check for 'a' or 'b'
# being present, you don't need to add them back.
train = [x + 'a' for x in train_ab] + [x + 'b' for x in train_ab] + \
        [x + 'a' for x in train_a] + [x + 'b' for x in train_b]
test = [x + 'a' for x in test_ab] + [x + 'b' for x in test_ab] + \
       [x + 'a' for x in test_a] + [x + 'b' for x in test_b]

#### Create Iterator Sequence as input to feed the model
https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence


In [18]:
from tensorflow.keras.utils import Sequence

class EvokedIterator(Sequence):

    # REVIEW: Fixed formatting: extra white space around named arguments,
    # also a lot of extra white space in empty lines and at the end of lines.
    # Also, spaces after commas and unnecessary parenthesis.
    def __init__(self, experiments, n_experiments=8, n_trials_averaged=60):
        self.experiments = experiments                
        self.n_experiments = n_experiments
        self.n_trials_averaged = n_trials_averaged
                
        metadata_path = os.path.join(PATH.ePod_metadata, "children.txt")
        self.metadata = pd.read_table(metadata_path)
        
        event_types = 12 # (FS/S/D in 4 conditions)
        self.n_files = len(self.experiments) * event_types
        self.batch_size = self.n_experiments * event_types
    
    def __len__(self):
        # The number of batches in the Sequence.
        return int(np.ceil(len(self.experiments) / self.n_experiments))  
    
    def __getitem__(self, index):

        x_batch = []
        y_batch = []

        for i in range(self.n_experiments):
            participant_index = (index * self.n_experiments + i) % len(self.experiments)
            participant_id = self.experiments[participant_index][:3]
            participant_metadata = self.metadata.loc[self.metadata['ParticipantID'] == float(participant_id)]

            for key in epodium.event_dictionary:

                # Get file
                # REVIEW: If you are already using f-strings, be consistent, use them here too.
                npy_name = self.experiments[participant_index] + "_" + key + ".npy"
                npy_path = os.path.join(PATH.ePod_processed_autoreject_epochs_split_downsampled, npy_name)
                npy = np.load(npy_path)
                
                # Create ERP from averaging 'n_trials_averaged' trials.
                trial_indexes = np.random.choice(npy.shape[0], self.n_trials_averaged, replace=False)
                evoked = np.mean(npy[trial_indexes, :, :], axis=0)
                x_batch.append(evoked)

                # Create labels
                y = np.zeros(5)
                if participant_metadata["Sex"].item() == "F":
                    y[0] = 1
                if participant_metadata["Group_AccToParents"].item() == "At risk":
                    y[1] = 1
                
                if key.endswith("_FS"):
                    y[2] = 1
                if key.endswith("_S"):
                    y[3] = 1
                if key.endswith("_D"):
                    y[4] = 1
                y_batch.append(y)        

        return np.array(x_batch), np.array(y_batch)

train_sequence = EvokedIterator(train)
test_sequence = EvokedIterator(test)
# x,y = train_sequence.__getitem__(0)
# x.shape

#### Train model

The data is an *evoked* or *ERP* from a participant in the ePodium experiment. 60 EEG signals were averaged from -0.2 to +0.8 seconds after onset of an event. This is done for each of the 12 event types seperately.

dimensions: 
+ x (batches, timesteps, channels)
+ y (batches, labels)

labels: 
+ (Sex, At risk of dyslexia, first standard, standard, deviant)


In [None]:
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau

from models.DNN import fully_connected_model
from models.Transformer import TransformerModel

# fit network
# REVIEW:  Move all libraries to the top, please. Also, ideally notebooks need to execute in one go.  
# Please don't rely on
# variables being defined in the cells *following* the cell, or later inside the 
# same cell (as here) that uses
# the variable.  In this case, writing a function that applies to `model'
# would've been an appropriate alternative.
try:
    print(f"{model} already loaded")
except:
    print("initialise model")
    model = fully_connected_model()
    # REVIEW: Be consistent in the way you import definitions.
    # A good rule of thumb is to import classes from modules.  In this
    # instance, this means:
    # from tf.kears.optimizers import Adam
    # and later used as simply `Adam' instead of `tf.keras.optimizers.Adam'
    #also formatting
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=[
            tf.keras.metrics.Precision(),
            tf.keras.metrics.BinaryAccuracy(),
            tf.keras.metrics.Recall(),
        ],
    )

    output_filename = 'fully_connecteed_model'
    output_file = os.path.join(PATH.models, output_filename)
    # REVIEW:formatting
    checkpointer = ModelCheckpoint(filepath=output_file + ".hdf5", monitor='val_loss', verbose=1, save_best_only=True)
    earlystopper = EarlyStopping(monitor='val_loss', patience=1200, verbose=1)
    reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=200, min_lr=0.0001, verbose=1)

history = model.fit(x=train_sequence,
                    validation_data=test_sequence,
                    epochs=100,
                    callbacks=[checkpointer, earlystopper, reduce_lr])

<keras.engine.functional.Functional object at 0x7f56d48631f0> already loaded
Epoch 1/100
Epoch 1: val_loss did not improve from 0.65021
Epoch 2/100
Epoch 2: val_loss did not improve from 0.65021
Epoch 3/100
Epoch 3: val_loss did not improve from 0.65021
Epoch 4/100
Epoch 4: val_loss did not improve from 0.65021
Epoch 5/100
Epoch 5: val_loss did not improve from 0.65021
Epoch 6/100
Epoch 6: val_loss did not improve from 0.65021
Epoch 7/100
Epoch 7: val_loss did not improve from 0.65021
Epoch 8/100
Epoch 8: val_loss did not improve from 0.65021
Epoch 9/100
Epoch 9: val_loss did not improve from 0.65021
Epoch 10/100
Epoch 10: val_loss did not improve from 0.65021
Epoch 11/100
Epoch 11: val_loss improved from 0.65021 to 0.64824, saving model to /volume-ceph/models/model.hdf5
Epoch 12/100
Epoch 12: val_loss did not improve from 0.64824
Epoch 13/100
Epoch 13: val_loss did not improve from 0.64824
Epoch 14/100
Epoch 14: val_loss improved from 0.64824 to 0.64788, saving model to /volume-ceph/m

And as mentioned before, save off a graph of things like loss, and then you can just clear the outputs before saving...will look better.
Needs a final conclusion in a markdown cell about what was proved here. We want to know how the accuracy went up ; or not