# <center>Analyzing Neonatal EEG Seizure Recordings</center>
## <center>November 9th, 2019</center>
Easton Potokar

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import mne                               #package to handle EEG data files
import os, seaborn, re
from scipy import io                     #for loading matlab file

plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = [12,5]
plt.rcParams["figure.dpi"] = 200

fs = 256                                 #sample size

In [2]:
def read_data(filename):
    return df.read_pickle(filename, compression="gzip")

## 1 Introduction

Neonatal seizures are a common occurrence and require immediate care. Detection is only possible through continuous electroencephalogram (EEG) monitoring. Unfortunately, this places a heavy burden on NICUs (Newborn Intensive Care Units) due to the special expertise needed to interpret EEGs that is generally not available in a NICU. Alternative options include a simplified easy-to-read trend of the EEG output known as an amplitude integrated EEG (aEEG). While it has its strengths, drawbacks include short duration and low amplitude of seizures, causing them to be missed entirely. 

Continuous multichannel EEG is the gold standard for detecting seizures but expert interpretation is not readily available to NICUs. Alternatives include providing experts remote access to the EEG, but this still requires 24 hour surveillance, also a heavy load.

The dataset that will be used is available through a public repository containing EEG recordings of 79 term neonates admitted to the NICU, with an meidan duration of 74 minutes [1]. Each EEG includes 10 channels of data, each recorded at 256Hz, and thus containing frequencies up to 128HZ. These recordings were examined by three experts with their labelings of either a seizure being present or not included at minute intervals. 

# 2 Data Preparation
## 2.1 Data Scraping

The data can be found at https://zenodo.org/record/2547147, and is best downloaded using the pip package `zenodo-get`. By simply running `pip install zenodo-get` followed by `zenodo_get.py 10.5281/zenodo.2547147` downloads all data and checks the md5sums to ensure everything downloaded properly. Since this process was simple enough, no additional scraping methods were needed.

## 2.2 Data Cleaning

### 2.2.1 Cleaning Clinical Information

The data is stored in a mixture of `.csv` and `.edf` files. The `.edf` files are a standard for EEG data, and can be read in using the python package `mne`. First we load the csv files, and clean the data found in them

In [4]:
#load in clinical data
ci = pd.read_csv("data-og/clinical_information.csv", index_col="ID", usecols=["EEG file", "ID", "Gender", "GA (weeks)", "BW (g)"])

#replace weight string values with intervals
replaceBW = {"less than 2500g": pd.Interval(0, 2500),
              "2500 to 3000g": pd.Interval(2500, 3000),
              "3000 to 3500g": pd.Interval(3000, 3500),
              "3500 to 4000g": pd.Interval(3500, 4000),
              "greater than 4000g": pd.Interval(4000, 4500),
              }
ci.replace(replaceBW, inplace=True)

#replace gestational age string values with intervals
def interval(weeks):
    if not isinstance(weeks, str) and np.isnan(weeks):
        return weeks
    values = re.compile(r"(\d{2})").findall(weeks)
    return pd.Interval(int(values[0]), int(values[1]))
ci['GA (weeks)'] = ci['GA (weeks)'].apply(interval)

#load in all experts analysis from .mat file. Note we save as a numpy array b/c each child has a different length of recording
annot = io.loadmat('data-og/annotations_2017.mat')['annotat_new']
ci['expertA'] = [annot[0,i-1][0,:] for i in ci.index]
ci['expertB'] = [annot[0,i-1][1,:] for i in ci.index]
ci['expertC'] = [annot[0,i-1][2,:] for i in ci.index]

### 2.2.2 Cleaning EEG Data

Next we use `mne` to read in all data into a pandas DataFrame. The corresponding info like the experts analysis, baby information, etc, is saved as metadata on each seperate dataFrame and then saved as a pickle file for later analysis 

In [5]:
#iterate through all of files
for i in ci.index:
    #read in all of the raw data
    raw = mne.io.read_raw_edf("data-og/{}.edf".format(ci["EEG file"][i]))
    channels = raw.ch_names
    signals = raw[channels][0]
    time = raw[0,:][1]

    #save into pandas DataFrame
    df = pd.DataFrame(signals.T, columns=channels, index=time)
    df.ID = i
    df.gender = ci['Gender'][i]
    df.bw = ci['BW (g)'][i]
    df.ga = ci['GA (weeks)'][i]
    df._metadata = ['gender', 'bw', 'ga', 'ID']
    
    #check to make sure all experts analyzed correctly sized data
    if len(ci['expertA'][i]) != len(df.index) / 256:
        print(len(ci['expertA'][i]), len(df.index) / 256)
        raise ValueError("EEG {} has mismatched expert A and time stamps")
    if len(ci['expertB'][i]) != len(df.index) / 256:
        print(len(ci['expertB'][i]), len(df.index) / 256)
        raise ValueError("EEG {} has mismatched expert B and time stamps")
    if len(ci['expertC'][i]) != len(df.index) / 256:
        print(len(ci['expertC'][i]), len(df.index) / 256)
        raise ValueError("EEG {} has mismatched expert C and time stamps")
        
    #save to gzipped pickle file and update location in ci
    print("Saving {} to pklz...".format(ci["EEG file"][i]))
    df.to_pickle('data/{}.pklz'.format(ci["EEG file"][i]), compression="gzip", protocol=-1)
    ci["EEG file"][i] = "data/{}.pklz".format(ci["EEG file"][i])
    
ci.to_pickle('data/ci_cleaned.pklz', compression="gzip", protocol=-1)

Extracting EDF parameters from /home/contagon6/Documents/nicu-eeg-dataproject/data-og/eeg1.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Saving eeg1 to pklz...
Extracting EDF parameters from /home/contagon6/Documents/nicu-eeg-dataproject/data-og/eeg2.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Saving eeg2 to pklz...
Extracting EDF parameters from /home/contagon6/Documents/nicu-eeg-dataproject/data-og/eeg3.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Saving eeg3 to pklz...
Extracting EDF parameters from /home/contagon6/Documents/nicu-eeg-dataproject/data-og/eeg4.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Saving eeg4 to pklz...
Extracting EDF parameters from /home/contagon6/Documents/nicu-eeg-dataproject/data-og/eeg5.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Sav

## 2.3 Potential Data Problems

I believe the source of the data to be quite reliable. The data was recorded from a hospital in Finland by a third party which removes any sort of bias or data picking. The group who posted it was also looking to implement different ML algorithmns to detect seizures and appeared to be at least moderately successful using an SVM. As long as the group didn't cherry-pick data for their model, which seems unlikely since that's unethical and it's been published, the data should be sufficiently reliable.

Upon examining the data, I found that the length of recordings didn't match up with the length of some of the analyses by "Expert A", which obviously raises a lot of alarms. Upon further inspection there appears to be something wrong with the `.csv` file containing Expert A's annotations. Fortunately, a `.mat` file was also included in the dataset and all the lengths match up for each expert and the EEG data. Beyond that, all data appears to have valid information.

## 3 Feature Engineering

In [None]:
ci['expertA_avg'] = [np.sum(ci['expertA'][i]) / len(ci['expertA'][i]) for i in ci.index]
ci['expertB_avg'] = [np.sum(ci['expertB'][i]) / len(ci['expertB'][i]) for i in ci.index]
ci['expertC_avg'] = [np.sum(ci['expertC'][i]) / len(ci['expertC'][i]) for i in ci.index]
ci['minutes'] = [len(ci['expertA'][i])/60 for i in ci.index]

In [114]:

loaded = io.loadmat('data-og/annotations_2017.mat')


In [133]:
print(loaded['annotat_new'][0,2][0,:].shape)
print(loaded['annotat_new'][0,2][0,:])

(4412,)
[0 0 0 ... 0 0 0]
