### Handling Missing Values  
  
This notebook is focused on handling EEGs and spectrograms with missing values. The first section of code finds all EEGs and spectrograms with any missing values and then determines what the distribution of the target variable is for those EEGs and spectrograms. It also determines the distribution of the target variable if those EEGs and spectrograms were removed instead of having their missing values replaced.  
  
Before handling missing values, I'll need to handle outliers.. What exactly this looks like, I'm not sure. Some outlier values are likely informative, but that isn't always the case. One GRDA example which will be shown in the extreme values notebook has 9999.0 in every column for the first 2k rows of the EEG. I think it's unlikely this is the result of anything other than error in data entry and processing.

It's important to hanlde outliers before filling in missing values because if extreme values aren't handled first, they will influence missing value imputation. Previously, I had chosen to use median imputation to handle missing values, but for EEGs with a large number of consecutive missing values, this inserts a flat section resembling an interdischarge interval into EEGs which shouldn't have an interdischarge interval. Only 2 of my 6 brain activity types should have an interdischarge interval in their signals: Lateralized Periodic Discharge and Generalized Periodic Discharge.  
  
For lateralized activity, if every electrode is missing the same set of values, then this flat section resembling an interdischarge interval is being introduced consistently across both hemispheres which is inconsistent with lateralized activity. This affects at least two categories: Lateralized Periodic Discharge and Lateralized Rhythmic Delta Activity. It seems that EEGs with missing values are often missing them consecutively across every electrode, so these problems would show up often.  
  
I think an algorithmic approach which takes into account the signal typical of that EEG's activity type and also the signal typical of that electrode within that activity type is likely best, but I'm not sure how to approach that. There doesn't seem to be a default approach based on what I've seen looking up papers so far. Some people use neural networks like CNNs or Transformers, but I'm not taking a deep learning approach to this, so I don't intend to use a deep learning approach for this step of EEG preprocessing.  
  
If I can't implement an effective algorithmic method for missing value imputation, then another idea is to replace missing values with random values in that electrode's IQR. This at least avoids inserting a flat section across electrodes and across activity types. It won't recreate the signal that likely would have been recorded during that section of the EEG, but it shouldn't introduce a signal that is characteristic of any of my activity types into the EEGs of all activity types. This method should be superior to median imputation given the nature of EEG data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import fastparquet, pyarrow

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.head()

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0


In [4]:
df['row'] = [i for i in range(df.shape[0])]

In [5]:
np.unique(df['eeg_id'])

array([    568657,     582999,     642382, ..., 4294455489, 4294858825,
       4294958358])

##### Finding All EEGs with Missing Values

In [6]:
eeg_null_ids = []
eeg_null_ratios = []
for i in np.unique(df['eeg_id']):
    eeg = pd.read_parquet('train_eegs/{}.parquet'.format(i), engine = 'pyarrow')
    num_entries = eeg.shape[0] * eeg.shape[1]
    num_nulls = pd.isnull(eeg).sum().sum()
    ratio = num_nulls / num_entries
    if ratio > 0.0:
        eeg_null_ids.append(i)
        eeg_null_ratios.append(ratio)

##### Finding All Spectrograms with Missing Values

In [None]:
spec_null_ids = []
spec_null_ratios = []
for i in np.unique(df['spectrogram_id']):
    spec = pd.read_parquet('train_spectrograms/{}.parquet'.format(i), engine = 'pyarrow')
    num_entries = spec.shape[0] * spec.shape[1]
    num_nulls = pd.isnull(spec).sum().sum()
    ratio = num_nulls / num_entries
    if ratio > 0.0:
        spec_null_ids.append(i)
        spec_null_ratios.append(ratio)

##### Metadata Dataframe of EEGs and Spectrograms with Missing Values

In [None]:
missingval_df = df[df['eeg_id'] == eeg_null_ids[0]]

In [None]:
for i in range(1, len(eeg_null_ids)):
    missingval_df = pd.concat([missingval_df, df[df['eeg_id'] == eeg_null_ids[i]]], axis = 0)

In [None]:
for i in range(len(spec_null_ids)):
    missingval_df = pd.concat([missingval_df, df[df['spectrogram_id'] == spec_null_ids[i]]], axis = 0)

In [None]:
null_rows = np.unique(missingval_df['row'])

In [None]:
null_df = df[df['row'] == null_rows[0]]

In [None]:
for i in range(1, len(null_rows)):
    null_df = pd.concat([null_df, df[df['row'] == null_rows[i]]], axis = 0)

In [None]:
null_df.to_csv('null_metadata.csv', index = False)

##### Target Variable Distribution for EEGs and Spectrograms with Missing Values

In [None]:
null_df['expert_consensus'].value_counts(normalize = True)

##### Target Variable Distribution for All EEGs and Spectrograms

In [None]:
df['expert_consensus'].value_counts(normalize = True)

##### Target Variable Distribution if All Null EEGs and Spectrograms Were Removed

In [None]:
removal_df = df.drop(index = null_rows[0])

In [None]:
for i in range(1, len(null_rows)):
    removal_df = removal_df.drop(index = null_rows[i])

In [None]:
removal_df['expert_consensus'].value_counts(normalize = True)

In [None]:
removal_df.to_csv('nulls_dropped.csv', index = False)

##### Activity Type DFs

In [None]:
seizure_df = df[df['expert_consensus'] == 'Seizure']
lpd_df = df[df['expert_consensus'] == 'LPD']
gpd_df = df[df['expert_consensus'] == 'GPD']
lrda_df = df[df['expert_consensus'] == 'LRDA']
grda_df = df[df['expert_consensus'] == 'GRDA']
other_df = df[df['expert_consensus'] == 'Other']

In [None]:
seizure_df = seizure_df.reset_index().drop(columns = 'index')
lpd_df = lpd_df.reset_index().drop(columns = 'index')
gpd_df = gpd_df.reset_index().drop(columns = 'index')
lrda_df = lrda_df.reset_index().drop(columns = 'index')
grda_df = grda_df.reset_index().drop(columns = 'index')
other_df = other_df.reset_index().drop(columns = 'index')