# Harmful Brain Activity Detection

## Approach

Since the data is EEG and spectrogram data, we use signal processing methods for pre-processing and feature extraction and depending on the type of features we use an appropriate neural network for prediction

### Data Pre-Procssing

* **Band Pass Filter** 
* **ICA**  - Artifact removal such as eye blinks, heart beats, muscle movements,.etc

### Feature Extraction

* **Power Spectral Density** - time domain
* **FFT** - Frequency domain 

* **Waveform** - Time-Frequency Domain 


### Predictive Model

* CNN for FFT 
* Vision Transformer
* LSTM for Power Spectral Density

### Evaluation

* Kullback Liebler divergence

In [1]:
import dask
import multiprocessing
import keras
import os
import pandas as pd
import scipy
from scipy.signal import butter, lfilter
from mne.preprocessing import ICA

In [2]:
# Path to files

path_eeg = "hms-harmful-brain-activity-classification/train_eegs/"
path_spec = "hms-harmful-brain-activity-classification/train_spectrograms/"
path_train_csv = "hms-harmful-brain-activity-classification/train.csv"
path_test_csv = "hms-harmful-brain-activity-classification/test.csv"

### Read EEG and Spectrogram files

In [3]:
# Function to read eeg and spectrogram data
def read_file(folder):
    files = [os.path.join(folder, file) for file in os.listdir(folder)]
    return files

In [4]:
# EEG data frame
eeg_files = read_file(path_eeg)

In [6]:
# Read Spectrogram data
spec_files = read_file(path_spec)

In [7]:
# Read metadata
df_train = pd.read_csv(path_train_csv)

In [8]:
df_train

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0


In [13]:
df_spec = pd.read_parquet(spec_files[1])

In [15]:
df_spec

Unnamed: 0,time,LL_0.59,LL_0.78,LL_0.98,LL_1.17,LL_1.37,LL_1.56,LL_1.76,LL_1.95,LL_2.15,...,RP_18.16,RP_18.36,RP_18.55,RP_18.75,RP_18.95,RP_19.14,RP_19.34,RP_19.53,RP_19.73,RP_19.92
0,1,0.51,0.540000,0.520000,0.440000,0.37,0.230000,0.210000,0.200000,0.150000,...,0.00,0.00,0.00,0.00,0.00,0.00,0.01,0.01,0.02,0.02
1,3,6.62,8.440000,8.630000,7.770000,7.27,5.810000,5.640000,5.660000,5.920000,...,0.05,0.04,0.04,0.04,0.04,0.05,0.05,0.06,0.07,0.07
2,5,14.52,24.900000,24.549999,24.600000,27.67,22.740000,18.969999,17.250000,18.230000,...,0.17,0.11,0.12,0.11,0.08,0.07,0.11,0.10,0.08,0.13
3,7,19.66,27.580000,45.419998,25.860001,27.67,44.610001,17.200001,15.400000,28.559999,...,0.19,0.12,0.11,0.12,0.06,0.09,0.11,0.13,0.14,0.13
4,9,14.28,16.770000,16.620001,14.760000,13.15,8.970000,27.740000,28.740000,29.080000,...,0.05,0.05,0.05,0.03,0.03,0.03,0.02,0.06,0.06,0.07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
303,607,4.98,4.760000,5.240000,6.090000,5.72,5.080000,5.980000,5.450000,4.990000,...,0.14,0.10,0.10,0.12,0.09,0.09,0.11,0.05,0.07,0.06
304,609,7.95,13.450000,14.750000,15.810000,14.82,13.140000,9.650000,9.600000,6.840000,...,0.09,0.11,0.13,0.12,0.12,0.12,0.09,0.10,0.11,0.10
305,611,8.49,11.330000,8.090000,9.480000,5.57,12.530000,13.100000,15.330000,16.250000,...,0.13,0.15,0.12,0.13,0.14,0.08,0.08,0.11,0.09,0.09
306,613,11.71,16.850000,16.860001,13.550000,13.43,24.600000,25.100000,25.820000,23.160000,...,0.12,0.11,0.11,0.13,0.07,0.05,0.06,0.07,0.08,0.08


In [4]:
df_train_eeg = pd.read_parquet("EEG.parquet")

In [5]:
df_train_eeg

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,-105.849998,-89.230003,-79.459999,-49.230000,-99.730003,-87.769997,-53.330002,-50.740002,-32.250000,-42.099998,-43.270000,-88.730003,-74.410004,-92.459999,-58.930000,-75.739998,-59.470001,8.210000,66.489998,1404.930054
1,-85.470001,-75.070000,-60.259998,-38.919998,-73.080002,-87.510002,-39.680000,-35.630001,-76.839996,-62.740002,-43.040001,-68.629997,-61.689999,-69.320000,-35.790001,-58.900002,-41.660000,196.190002,230.669998,3402.669922
2,8.840000,34.849998,56.430000,67.970001,48.099998,25.350000,80.250000,48.060001,6.720000,37.880001,61.000000,16.580000,55.060001,45.020000,70.529999,47.820000,72.029999,-67.180000,-171.309998,-3565.800049
3,-56.320000,-37.279999,-28.100000,-2.820000,-43.430000,-35.049999,3.910000,-12.660000,8.650000,3.830000,4.180000,-51.900002,-21.889999,-41.330002,-11.580000,-27.040001,-11.730000,-91.000000,-81.190002,-1280.930054
4,-110.139999,-104.519997,-96.879997,-70.250000,-111.660004,-114.430000,-71.830002,-61.919998,-76.150002,-79.779999,-67.480003,-99.029999,-93.610001,-104.410004,-70.070000,-89.250000,-77.260002,155.729996,264.850006,4325.370117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,-45.540001,-26.459999,-23.209999,-25.250000,-21.559999,-36.549999,10.730000,-16.290001,-55.919998,-28.670000,-29.770000,-22.000000,3.710000,8.470000,0.480000,9.950000,33.959999,110.510002,58.599998,301.239990
9996,-26.860001,4.350000,7.410000,7.830000,5.260000,7.750000,50.130001,4.150000,1.720000,22.100000,7.150000,-6.820000,38.070000,32.880001,21.990000,32.990002,60.209999,-156.949997,-275.929993,-4634.799805
9997,-133.759995,-111.190002,-119.180000,-105.760002,-130.039993,-104.059998,-68.290001,-86.480003,-57.130001,-68.830002,-95.839996,-107.540001,-86.449997,-94.099998,-97.050003,-86.339996,-68.040001,-14.880000,66.440002,1667.800049
9998,-78.889999,-59.660000,-60.770000,-59.810001,-63.020000,-60.020000,-20.690001,-42.820000,-68.669998,-54.740002,-62.810001,-52.869999,-34.099998,-31.500000,-37.810001,-32.259998,-10.870000,137.559998,193.839996,2743.379883


In [6]:
df_train_spec = pd.read_parquet("Spectrogram.parquet")

In [7]:
df_train_spec

Unnamed: 0,time,LL_0.59,LL_0.78,LL_0.98,LL_1.17,LL_1.37,LL_1.56,LL_1.76,LL_1.95,LL_2.15,...,RP_18.16,RP_18.36,RP_18.55,RP_18.75,RP_18.95,RP_19.14,RP_19.34,RP_19.53,RP_19.73,RP_19.92
0,1,28.680000,53.990002,67.629997,59.880001,50.880001,74.309998,78.480003,63.080002,59.869999,...,0.13,0.14,0.08,0.11,0.04,0.03,0.05,0.05,0.04,0.05
1,3,29.639999,38.959999,44.009998,66.800003,48.509998,42.180000,47.340000,48.599998,40.880001,...,0.15,0.13,0.08,0.08,0.07,0.06,0.07,0.06,0.06,0.06
2,5,8.890000,9.020000,16.360001,23.559999,27.340000,30.040001,27.559999,23.290001,15.120000,...,0.12,0.11,0.08,0.08,0.09,0.10,0.12,0.14,0.13,0.14
3,7,1.770000,1.930000,1.810000,1.600000,1.430000,1.280000,1.190000,1.110000,1.010000,...,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.04,0.04,0.04
4,9,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,591,2.210000,2.280000,2.200000,1.280000,1.350000,1.930000,2.300000,2.440000,2.310000,...,0.03,0.03,0.03,0.03,0.03,0.04,0.04,0.03,0.03,0.03
296,593,2.490000,2.540000,2.150000,1.490000,1.360000,1.570000,1.970000,2.050000,1.890000,...,0.02,0.02,0.02,0.03,0.02,0.02,0.02,0.01,0.01,0.01
297,595,0.240000,0.190000,0.210000,0.120000,0.110000,0.080000,0.060000,0.040000,0.040000,...,0.00,0.00,0.00,0.01,0.01,0.01,0.01,0.00,0.00,0.00
298,597,0.990000,1.230000,1.370000,1.620000,1.940000,2.190000,2.270000,2.310000,2.300000,...,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.01,0.01,0.01


## Data Pre-Procssing

### Noise Removal - Bandpass Filter

We will use the Butterworth Bandpass Filter to remove noise from the EEG data

In [8]:
# Function to get Bandpass filter coefficients
def get_filter_Coeff(lowcut, highcut, freq, order):
    nyq_freq = 0.5 * freq # Nyquist Frequency
    low = lowcut / nyq_freq
    high = highcut / nyq_freq
    b, a = butter(order, [low, high], btype='band')
    return b,a


# Function to apply the filter 
def Bandpass_filter(data, lowcut, highcut, freq, order=5):
    b,a = get_filter_Coeff(lowcut, highcut, freq, order)
    fil_sig = lfilter(b, a, data)
    return fil_sig
    

In [9]:
lowcut = 1
highcut = 40
freq = 200

filtered_df = df_train_eeg.apply(lambda x: Bandpass_filter(x, lowcut, highcut, freq, order=5))

In [10]:
filtered_df

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,-2.100889,-1.771019,-1.577106,-0.977107,-1.979421,-1.742041,-1.058483,-1.007077,-0.640091,-0.835592,-0.858814,-1.761095,-1.476875,-1.835127,-1.169631,-1.503272,-1.180348,0.162950,1.319680,27.884764
1,-14.189864,-12.021789,-10.574690,-6.583092,-13.221607,-12.096371,-7.082100,-6.696018,-5.331571,-6.214313,-5.961406,-11.834955,-10.007022,-12.288901,-7.665859,-10.108626,-7.846102,4.862965,12.426104,233.359360
2,-40.552595,-33.997911,-28.993482,-17.495107,-36.539446,-35.232135,-18.527892,-17.939053,-18.271340,-18.839905,-16.394510,-33.455683,-27.727652,-34.052362,-19.882688,-27.927036,-20.702065,24.199447,43.072475,737.523959
3,-62.472702,-50.101602,-39.609503,-20.813742,-51.819248,-54.260173,-20.909384,-22.943880,-32.751130,-28.591775,-20.569729,-50.506138,-38.265283,-48.466303,-23.230328,-38.887589,-24.947541,49.975845,68.594786,1038.486706
4,-54.575278,-38.177927,-23.775645,-4.220588,-36.277114,-45.040232,-0.866430,-10.064749,-31.333848,-19.603869,-5.815211,-42.707473,-23.029403,-34.210271,-5.036138,-24.878711,-6.941964,44.088911,39.389186,350.525396
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,-7.707664,-15.220291,-8.998526,-0.933985,-5.735699,-1.117211,2.444315,-0.151495,-19.481629,-11.082190,-2.293582,-6.699586,-10.638649,-11.240802,-4.351119,-11.004733,-13.866837,-9.532745,-5.929960,343.669056
9996,-5.175235,-12.466265,-7.778360,0.068199,-3.299358,1.795548,5.440420,0.937345,-17.103852,-9.379634,-1.250840,-5.855717,-9.344584,-11.770231,-4.778636,-9.393136,-12.747615,-10.137379,-15.071165,198.782118
9997,-9.760735,-17.791755,-13.805102,-6.335713,-8.901127,-3.596039,0.877066,-3.672686,-21.868252,-15.586611,-7.692149,-11.621666,-15.901708,-19.780175,-12.397122,-15.025822,-20.332449,4.235858,-1.603616,451.175092
9998,-12.772188,-20.442583,-15.306716,-9.385226,-11.015012,-6.618801,0.380710,-5.601837,-27.812488,-20.648349,-11.686888,-15.146141,-19.004209,-23.308414,-15.767406,-18.387827,-24.072190,6.142635,-3.274736,449.836045


### ICA - Independent Component Analysis

Performing ICA for artifact removal such as eye blinks, heart beats, muscle movements,.etc


First, we will transform the dataframe into an array suitable to perform ICA

In [11]:
import mne
import numpy as np

sfreq = 200  # sampling frequency in Hz
channel_names = filtered_df.columns.tolist()  # Get the column names as channel names
channel_types = ['eeg'] * len(channel_names)  # Assuming all channels are EEG

# Convert DataFrame to numpy array
data = filtered_df.T.to_numpy()  # Transpose because MNE expects channels x times

# Create an MNE Info object
info = mne.create_info(ch_names=channel_names, sfreq=sfreq, ch_types=channel_types)

# Convert to volts (from microvolts, if your data is in uV)
# data = data / 1e6  # Scale if your data is not already in Volts (V)

# Create RawArray
raw = mne.io.RawArray(data, info)



Creating RawArray with float64 data, n_channels=20, n_times=10000
    Range : 0 ... 9999 =      0.000 ...    49.995 secs
Ready.


In [13]:
raw

0,1
Measurement date,Unknown
Experimenter,Unknown
Participant,Unknown

0,1
Digitized points,Not available
Good channels,20 EEG
Bad channels,
EOG channels,Not available
ECG channels,Not available

0,1
Sampling frequency,200.00 Hz
Highpass,0.00 Hz
Lowpass,100.00 Hz
Duration,00:00:50 (HH:MM:SS)


#### ICA - (WIP)

In [14]:
sfreq = raw.info['sfreq']

# Create and fit an ICA model
ica = ICA(n_components=15, random_state=97, max_iter=800)
ica.fit(raw)

# Plot the components to manually identify artifacts
ica.plot_components()

Fitting ICA to data using 20 channels (please be patient, this may take a while)
Selecting by number: 15 components


  ica.fit(raw)


Fitting ICA took 0.3s.




RuntimeError: No digitization points found.

## Feature Extraction

#### Power Spectral Density - time domain

#### FFT - Frequency domain


#### Waveform - Time-Frequency Domain

## Modelling

Based on the type of the Feature Extracted, the following models will be trained on the respective features and an ensemble of these models will be created for prediction/inference.

The models will be evaluated on the `Kullback Liebler divergence` Metric

**Predictive Models** :

1) CNN for FFT
2) LSTM for Power Spectral Density
3) Wavenets for Waveforms

In [17]:
torch.cuda.is_available()

False