## EEG Data Preprocessing and EEGNet Model Training
### Introduction

This documentation provides a step-by-step guide on how to preprocess EEG data from EDF files, segment the data, and train an EEGNet model using TensorFlow/Keras. The dataset used is from the PhysioNet EEG recordings of subjects before and during mental arithmetic tasks.



In [1]:
import os
import pyedflib
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from scipy.signal import resample
import matplotlib.pyplot as plt


## Step 1: Loading EDF Files and Extracting Signals
### Objective: Load EEG data from EDF files and extract the signals into numpy arrays.

Explanation: EDF (European Data Format) files are commonly used for storing EEG data. The MNE library in Python is a powerful tool for loading and manipulating EEG data from EDF files. In this step, we read the EDF files from the specified folder, extract the EEG signals, and store them in a list.

In [2]:
# Step 1: Load and Extract Signals from EDF Files
def load_edf_files(folder_path):
    edf_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.edf')]
    all_signals = []
    max_samples = 0
    
    # Determine the maximum number of samples across all files
    for edf_file in edf_files:
        edf_reader = pyedflib.EdfReader(edf_file)
        max_samples = max(max_samples, edf_reader.getNSamples()[0])
        edf_reader.close()
    
    for edf_file in edf_files:
        edf_reader = pyedflib.EdfReader(edf_file)
        num_channels = edf_reader.signals_in_file
        signals = np.zeros((num_channels, max_samples))
        
        for i in range(num_channels):
            signal = edf_reader.readSignal(i)
            if len(signal) < max_samples:
                # Pad with zeros if the signal is shorter than the max_samples
                signal = np.pad(signal, (0, max_samples - len(signal)), 'constant')
            signals[i, :] = signal
        
        all_signals.append(signals)
        edf_reader.close()
    
    return np.array(all_signals), edf_files

folder_path = "C:/Users/kames/Downloads/eeg-during-mental-arithmetic-tasks-1.0.0/eeg-during-mental-arithmetic-tasks-1.0.0"
X_raw, edf_files = load_edf_files(folder_path)
print(f"X_raw shape: {X_raw.shape}")

X_raw shape: (72, 21, 94000)


In [3]:
import pandas as pd

# Load subject information
subject_info = pd.read_csv("C:/Users/kames/Downloads/subject-info.csv")

# Extract labels (assuming 'Count quality' is the column for labels)
labels_dict = {str(row['Subject']): row['Count quality'] for _, row in subject_info.iterrows()}

# Extract labels from file names
def get_labels(edf_files, labels_dict):
    labels = []
    for file in edf_files:
        subject_id = os.path.basename(file).split('_')[0]
        if subject_id in labels_dict:
            labels.append(labels_dict[subject_id])
        else:
            labels.append(None)  # Handle missing labels if necessary
    return np.array(labels)

y_raw = get_labels(edf_files, labels_dict)
print(f"y_raw shape: {y_raw.shape}")
print(f"y_raw: {y_raw}")


y_raw shape: (72,)
y_raw: [0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1]


In [4]:
from sklearn.preprocessing import StandardScaler

def preprocess_signals(signals):
    # Transpose to have shape (samples, channels, time_points)
    signals = np.transpose(signals, (0, 2, 1))
    
    # Reshape for CNN input (samples, channels, time_points, 1)
    samples, time_points, channels = signals.shape
    signals = signals.reshape(samples, time_points, channels, 1)
    
    # Normalize
    scaler = StandardScaler()
    signals_scaled = scaler.fit_transform(signals.reshape(-1, channels)).reshape(signals.shape)
    
    return signals_scaled

X_scaled = preprocess_signals(X_raw)
print(f"X_scaled shape: {X_scaled.shape}")


X_scaled shape: (72, 94000, 21, 1)


## Step 2: Downsampling the Signals
### Objective: Reduce the sampling rate of the EEG signals to make the data more manageable for processing.

Explanation: EEG data often comes with a high sampling rate, such as 1000 Hz, which means there are 1000 data points per second. Downsampling reduces the number of data points, making it easier to handle computationally. Here, we downsample the signals from 1000 Hz to 250 Hz.

In [5]:
from scipy.signal import resample

def downsample_signals(signals, target_length):
    downsampled_signals = []
    for signal in signals:
        num_channels = signal.shape[0]
        downsampled_signal = np.zeros((num_channels, target_length))
        for i in range(num_channels):
            downsampled_signal[i, :] = resample(signal[i, :], target_length)
        downsampled_signals.append(downsampled_signal)
    return np.array(downsampled_signals)

# Assuming we want to downsample to 250 Hz from 1000 Hz and the original length is 94000 (for 60 seconds at 1000 Hz)
original_length = 94000
target_length = int(original_length * 250 / 1000)

X_downsampled = downsample_signals(X_raw, target_length)
print(f"X_downsampled shape: {X_downsampled.shape}")


X_downsampled shape: (72, 21, 23500)


## Step 3: Segmenting the Signals
### Objective: Split the continuous EEG signals into shorter, fixed-length segments.

Explanation: Segmenting the EEG data into shorter chunks (e.g., 1-second segments) can improve the performance of machine learning models by providing more training samples. Overlapping segments can also be used to increase the dataset size further. In this step, we split the signals into 1-second segments with a 50% overlap.

In [6]:
def segment_signals(signals, segment_length, overlap):
    segments = []
    for signal in signals:
        num_channels, num_time_points = signal.shape
        for start in range(0, num_time_points - segment_length + 1, segment_length // overlap):
            segments.append(signal[:, start:start + segment_length])
    return np.array(segments)

segment_length = 250  # 1 second segments
overlap = 2  # 50% overlap

X_segmented = segment_signals(X_downsampled, segment_length, overlap)
print(f"X_segmented shape: {X_segmented.shape}")


X_segmented shape: (13464, 21, 250)


## Step 4: Preprocessing the Signals
### Objective: Normalize and reshape the segmented signals to prepare them for input into the neural network.

Explanation: Normalization ensures that the features have similar scales, which helps in faster convergence during training. The segmented signals are reshaped to match the input shape expected by the EEGNet model, which is (samples, time points, channels, 1).

In [7]:
def preprocess_signals(signals):
    samples, channels, time_points = signals.shape
    signals = signals.transpose((0, 2, 1)).reshape(samples, time_points, channels, 1)
    
    scaler = StandardScaler()
    signals_scaled = scaler.fit_transform(signals.reshape(-1, channels)).reshape(signals.shape)
    
    return signals_scaled

X_preprocessed = preprocess_signals(X_segmented)
print(f"X_preprocessed shape: {X_preprocessed.shape}")

X_preprocessed shape: (13464, 250, 21, 1)


## Step 5: Creating Labels
### Objective: Ensure that each segmented signal has a corresponding label.

Explanation: Since we have segmented the data, we need to repeat the labels for each segment. This step ensures that each segment has a label corresponding to its original signal.



In [8]:
# Repeat labels to match the number of segments per original sample
num_segments_per_sample = X_segmented.shape[0] // len(X_downsampled)
y_repeated = np.repeat(y_raw, num_segments_per_sample)
print(f"y_repeated shape: {y_repeated.shape}")



y_repeated shape: (13464,)


## Step 6: Splitting the Dataset
### Objective: Split the dataset into training and testing sets.

Explanation: To evaluate the performance of the model, we need to separate the data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y_repeated, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


X_train shape: (10771, 250, 21, 1)
X_test shape: (2693, 250, 21, 1)
y_train shape: (10771,)
y_test shape: (2693,)


## Step 7: Define the TSCeption Model
### Objective: Implement the TSCeption model architecture.

Explanation: The TSCeption model consists of multiple branches of convolutional layers that capture temporal features at different scales. These features are then concatenated and passed through additional convolutional layers to capture spatial features.

In [13]:
from tensorflow.keras.layers import Input, Conv1D, BatchNormalization, Activation, SpatialDropout1D, Add, Dense, GlobalAveragePooling1D
from tensorflow.keras.models import Model

def TCN(input_shape, num_classes, num_filters=32, kernel_size=3, dilation_rates=[1, 2, 4], dropout_rate=0.2):
    inputs = Input(shape=input_shape)
    x = inputs

    for dilation_rate in dilation_rates:
        prev_x = x
        x = Conv1D(filters=num_filters, kernel_size=kernel_size, padding='causal', dilation_rate=dilation_rate)(x)
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        x = SpatialDropout1D(rate=dropout_rate)(x)
        
        x = Conv1D(filters=num_filters, kernel_size=kernel_size, padding='causal', dilation_rate=dilation_rate)(x)
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        x = SpatialDropout1D(rate=dropout_rate)(x)
        
        if prev_x.shape[-1] != x.shape[-1]:
            prev_x = Conv1D(filters=num_filters, kernel_size=1, padding='same')(prev_x)
        
        x = Add()([prev_x, x])
    
    x = GlobalAveragePooling1D()(x)
    outputs = Dense(num_classes, activation='softmax')(x)
    
    return Model(inputs, outputs)

# Define input shape based on your data
input_shape = (250, 21)
num_classes = len(np.unique(y_train))

# Create and compile the model
tcn_model = TCN(input_shape, num_classes, num_filters=16, kernel_size=3, dilation_rates=[1, 2], dropout_rate=0.1)
tcn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
tcn_model.summary()


## Step 8: Training and Testing of the TSCeption Model
### Objective: Train the TSCeption model using the preprocessed EEG data

Explanation: The model is trained using the training data, and its performance is validated on the test data.

In [16]:
history = tcn_model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/100
[1m337/337[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 101ms/step - accuracy: 0.8314 - loss: 0.3861 - val_accuracy: 0.8504 - val_loss: 0.3592
Epoch 2/100
[1m337/337[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 107ms/step - accuracy: 0.8349 - loss: 0.3820 - val_accuracy: 0.8704 - val_loss: 0.3306
Epoch 3/100
[1m337/337[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 107ms/step - accuracy: 0.8368 - loss: 0.3757 - val_accuracy: 0.8589 - val_loss: 0.3396
Epoch 4/100
[1m337/337[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 104ms/step - accuracy: 0.8442 - loss: 0.3638 - val_accuracy: 0.8648 - val_loss: 0.3219
Epoch 5/100
[1m337/337[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 100ms/step - accuracy: 0.8449 - loss: 0.3628 - val_accuracy: 0.8693 - val_loss: 0.3250
Epoch 6/100
[1m337/337[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 99ms/step - accuracy: 0.8483 - loss: 0.3559 - val_accuracy: 0.8789 - val_loss: 0.3141
Epoch

In [18]:
score = tcn_model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.2600855231285095
Test accuracy: 0.8960267305374146
